Monday, October 17, 2011

FAST ESP : Deleting all documents in a collection




A typical FAST ESP solution has multiple collections. For example, there might be a collection for the data crawled and indexed Sharepoint, another collection for data crawled from a database using JDBC Connector and so on. 


Suppose that you are developing and implementing such a solution. You'd have setup a bunch of collections and setup their crawlers such that the appropriate documents are included under each collection. During the course of this solution's implementation you'll have a need to discard all the documents in a particular collection and crawl everything from scratch. A typical scenario is when you fix a bug in indexing and want to apply the fix to the documents already in the index as well.
The tool to do this is the 'collection-admin.cmd' command-line utility present in the \bin directory. The exact command to use is this

collection-admin -m clearcollection -n

In the above command, the -m flag indicates the mode to run the tool in. In our case we use to the 'clearcollection' to indicate we want to delete/clear or reset the collection.  '' is the name of the collection you want to reset. You can find the name in the admin GUI or using the following command with the same tool
collection-admin -m listcollections
When you invoke the command, the UI will look like this



When this command returns, all documents in the collection would have been deleted. You can then do a full crawl to repopulate this collection with appropriate documents
More information and command line options to run the 'collection-admin' tool can be found in the tool's own commandline help.

FAST ESP: How Relevancy works?



Relevancy is the measure of how well a set of documents (results) answers or addresses the intent of a given query.
When there are many query matches, the search engines must rank the results by relevance score, sorting the results listing so that the pages most likely to be useful will appear first. Varying algorithms are used to define relevancy. Relevancy definition and tuning is one of core differentiators of FAST ESP platform. This blog post is about the relevance framework and related concepts and features in FAST ESP.


FAST ESP Search Relevance Framework

FAST ESP applies search relevancy through the following key steps:
  • Data mining – A document processing framework can be used to perform real-time content refinement. This includes embedded relevancy tools and integration points for 3rd party modules. An Entity Extraction framework enables extraction of named entities and key concepts from documents that may be used for result navigation
  • Linguistic normalization – Handles grammatical variations and automatic spell corrections
  • Query Processing – A query processing framework applies built-in or custom query transformations based on application specific rules
  • Ranking based on the FAST InPerspective model provides a multi-faceted measurement of the quality of the match between the query and a candidate result document
  • Query Context Analysis indicates the ability to present the information from the query results in context of the query. FAST ESP supports dynamic document summaries that display the segments of the matching document that provide the most relevant match with the query
  • Data Driven Navigation provides dynamic drill-down into the query result or related areas.
The relevancy of a document with respect to a query is represented by a ranking value. Following section lists the different elements used to calculate the rank value.


Elements of Rank Value


Element
Description
Freshness
Age of a document compared to the time when the query is issued
Authority
Importance of a document determined by the links to it from other documents
Quality
Assigned importance of a document, independent of the query
Geo
Importance of geographical distance between a document’s associated latitude/longitude and a target location specified in a query
Context
Importance of matching a query in a given document field
Proximity
For multi-term queries: the shorter the distance between query terms in a document, the higher the document’s rank value
Position
The earlier a query term occurs in a field, the higher the document’s rank value
Frequency
The more frequent a query term occurs in a document, the higher the document’s rank value
Completeness
The greater the number of query terms present in the same field of a matching document, the higher the document’s rank value
Number
For multi-term queries; the more query terms matched in a document, the higher the document’s rank value


Relevant Sorting of Query Results

FAST ESP provides three main methods for sorting the results of a query:

  1. Sorting by rank (relevancy score) - FAST ESP computes a rank value based on a set of parameters as described below. These parameters can be tuned in order to provide the best possible perceived relevancy for the end-user. It is possible to define multiple rank profiles that can be selected on a per query basis
  2. Sorting by field values - You may also sort query results by value of any searchable field, such as product name, product code, price or date. FAST ESP supports numeric and full-text sorting, single and multi-level sorting, ascending and descending sorting direction and national sorting rules
  3. Sorting by geographic location - The Geo Search feature provides capabilities for sorting and filtering query results based on geographic location
Rank Profile

A Rank Profile concept enables full control of the relative weight of each rank component for a given query. For example, how important an article’s title is relative to the main text or how important is proximity versus freshness. This enables individual relevance tuning of different query applications using a FAST ESP installation.
In FAST ESP, the Rank Profile is a configuration element within the Index Profile and defines relative weight for the different components of the dynamic rank. Multiple Rank Profiles can be specified in the Index Profile.
Tuning the Ranking and Sorting of Query Results

The ranking and sorting of query results can be tuned in three main ways:
  • Multiple Rank Profiles can be specified in the Index Profile. A Rank Profile defines relative weight for the different components of the dynamic rank
  • Sorting attributes can be specified for individual fields of the documents
  • Result sorting can be controlled on a per query basis. By default the result is sorted by rank as defined in the default Rank Profile. Query parameters enable you to specify an alternative rank profile for the query, or a set of fields that the result shall be sorted by
Relevance support in the Query Language

FAST ESP includes a highly expressive query language that also includes advanced proximity operators:
  • Different relevance weight may be applied to different terms or phrases in a query
  • Explicit proximity (ordered/unordered NEAR) operators enables precise match in semi-structured content without a need for phrase match
  • Boundary match operators enables exact match with extracted entities or entire document elements such as a product name
  • Wildcard query support
Dynamic Client Side Ranking

Dynamic client side ranking can be done by using the XRANK operator which is a part of the FAST Query Language (FQL). The boost value is specified with the parameter boost=n, where n is some signed integer value. Negative boost is supported, but if the result of boosting with a negative value is negative then the result will be set to 0.
Its a concept unique to FAST and I will cover it in detail in another post.


Rank Modification Tools


FAST ESP provides tools to modify rank for individual documents. These tools enable you to perform Absolute Query Boost, Relative Query Boost or Relative Document Boost for given documents in the FAST ESP index. An example could be a product database where it may be desired to boost products with highest profit margins, boost products related to campaigns, etc.
Two main tools exist for this purpose:
1) Search Business Center (SBC) - This is an optional, GUI based tool which enables query-oriented rank tuning. The SBC also includes a powerful query reporting module that may be used to assist in the rank tuning. Using the SBC you can change the ranking for each query using three different methods:
  • Top Ten - to position the document in one of ten reserved places that will be returned at the top of the results list
  • Add boost points - to add a value to a document to increase its relevancy relative to the other documents returned in the search results.You can also add negative boost points to a document.
  • Block from query - to prevent the blocked document from appearing in the search results for the query.
2) Rank Tuning Bulk Loader - This is a standard FAST ESP tool that enables you to perform the same rank tuning as the SBC, using an XML file as input. The XML file contains a specification of the rank modifications to be performed
How SharePoint does Relevancy?

Relevancy in SharePoint 2007

SharePoint 2007 also has Search Relevance architecture though the sophistication and flexibility is much less than FAST ESP. You can refer to SharePoint SDK documentation - Enterprise Search Relevance Architecture Overview for details on it. Also refer to Improving Relevance to learn some of the ways you can adjust the settings used in ranking calculation for Enterprise Search results, as well as other techniques you can try for improving relevance.

Thursday, April 14, 2011

FAST ESP: Multi-node installation and Windows Automatic Private IP Addressing


FAST ESP installation system checks for the following


1. Machine is ping-able
2. All 4000 ports from 13000 to 16999 are free
3. Rest checks such as Java installed, directory exists etc…


Now the interesting part, all system checks succeeded and installation succeeded with 0 errors. Spent another hour to install all the patches, only to find the ESP failing to create a collection and default views. 


The following error is thrown
Failed to deploy views '[oursite, espsystemwebcluster]' (deployment id: 1956). Failed to perform deployment transaction. Deployment failed: Could not deploy resource 'no.fast.messages.resourceservice.ResourceHandle@2c91b364' 
The above error gave a hint on communication error while deploying the resource files for collection creation. But did not get convinced as the system status showing as follows


image


Later I started searching over microsoft support site for the resolution for this issue, glad to find the issue suggesting restarting the ESP solves the issue, happily restarted but no luck!!!!.


After a day of analyzing the issue which includes uninstall and reinstallation of FAST ESP, finally decided to open a ticket with Microsoft. Before opening the ticket, I executed on all 4 servers ipconfig to collect the network information BINGOOO…. Except one server rest are all having the 10.1.x.x, the other server showed 169.x.x.x. Network folks immediately recognized this IP is incorrect.


What I am going to write below  FAST documentation has to include it in the installation guide


From windows millennium edition onwards Microsoft assigns an automatic IP address for the enabled network adapters. With automatic IP addressing, the IP address can be configured automatically. This method decreases administration time and means that IP addresses can be reused. Also, this method is recommended for all sizes of networks that do not have a direct Internet connection or available DHCP service.


Please check the link below

http://support.microsoft.com/kb/307287
http://support.microsoft.com/kb/956196


Resolution:


Just disabled all the adapters except the adapter holding the static IP, solved the issue!!!!!