Enterprise search

Monday, October 17, 2011

FAST ESP : Deleting all documents in a collection

A typical FAST ESP solution has multiple collections. For example, there might be a collection for the data crawled and indexed Sharepoint, another collection for data crawled from a database using JDBC Connector and so on.

Suppose that you are developing and implementing such a solution. You'd have setup a bunch of collections and setup their crawlers such that the appropriate documents are included under each collection. During the course of this solution's implementation you'll have a need to discard all the documents in a particular collection and crawl everything from scratch. A typical scenario is when you fix a bug in indexing and want to apply the fix to the documents already in the index as well.
The tool to do this is the 'collection-admin.cmd' command-line utility present in the \bin directory. The exact command to use is this

collection-admin -m clearcollection -n

In the above command, the -m flag indicates the mode to run the tool in. In our case we use to the 'clearcollection' to indicate we want to delete/clear or reset the collection. '' is the name of the collection you want to reset. You can find the name in the admin GUI or using the following command with the same tool

collection-admin -m listcollections

When you invoke the command, the UI will look like this

When this command returns, all documents in the collection would have been deleted. You can then do a full crawl to repopulate this collection with appropriate documents
More information and command line options to run the 'collection-admin' tool can be found in the tool's own commandline help.

FAST ESP: How Relevancy works?

Relevancy is the measure of how well a set of documents (results) answers or addresses the intent of a given query.
When there are many query matches, the search engines must rank the results by relevance score, sorting the results listing so that the pages most likely to be useful will appear first. Varying algorithms are used to define relevancy. Relevancy definition and tuning is one of core differentiators of FAST ESP platform. This blog post is about the relevance framework and related concepts and features in FAST ESP.

FAST ESP Search Relevance Framework

FAST ESP applies search relevancy through the following key steps:

Data mining – A document processing framework can be used to perform real-time content refinement. This includes embedded relevancy tools and integration points for 3rd party modules. An Entity Extraction framework enables extraction of named entities and key concepts from documents that may be used for result navigation
Linguistic normalization – Handles grammatical variations and automatic spell corrections
Query Processing – A query processing framework applies built-in or custom query transformations based on application specific rules
Ranking based on the FAST InPerspective model provides a multi-faceted measurement of the quality of the match between the query and a candidate result document
Query Context Analysis indicates the ability to present the information from the query results in context of the query. FAST ESP supports dynamic document summaries that display the segments of the matching document that provide the most relevant match with the query
Data Driven Navigation provides dynamic drill-down into the query result or related areas.

The relevancy of a document with respect to a query is represented by a ranking value. Following section lists the different elements used to calculate the rank value.

Elements of Rank Value

Element	Description
Freshness	Age of a document compared to the time when the query is issued
Authority	Importance of a document determined by the links to it from other documents
Quality	Assigned importance of a document, independent of the query
Geo	Importance of geographical distance between a document’s associated latitude/longitude and a target location specified in a query
Context	Importance of matching a query in a given document field
Proximity	For multi-term queries: the shorter the distance between query terms in a document, the higher the document’s rank value
Position	The earlier a query term occurs in a field, the higher the document’s rank value
Frequency	The more frequent a query term occurs in a document, the higher the document’s rank value
Completeness	The greater the number of query terms present in the same field of a matching document, the higher the document’s rank value
Number	For multi-term queries; the more query terms matched in a document, the higher the document’s rank value

Relevant Sorting of Query Results

FAST ESP provides three main methods for sorting the results of a query:

Sorting by rank (relevancy score) - FAST ESP computes a rank value based on a set of parameters as described below. These parameters can be tuned in order to provide the best possible perceived relevancy for the end-user. It is possible to define multiple rank profiles that can be selected on a per query basis

Sorting by field values - You may also sort query results by value of any searchable field, such as product name, product code, price or date. FAST ESP supports numeric and full-text sorting, single and multi-level sorting, ascending and descending sorting direction and national sorting rules

Sorting by geographic location - The Geo Search feature provides capabilities for sorting and filtering query results based on geographic location

Rank Profile

A Rank Profile concept enables full control of the relative weight of each rank component for a given query. For example, how important an article’s title is relative to the main text or how important is proximity versus freshness. This enables individual relevance tuning of different query applications using a FAST ESP installation.
In FAST ESP, the Rank Profile is a configuration element within the Index Profile and defines relative weight for the different components of the dynamic rank. Multiple Rank Profiles can be specified in the Index Profile.
Tuning the Ranking and Sorting of Query Results

The ranking and sorting of query results can be tuned in three main ways:

Multiple Rank Profiles can be specified in the Index Profile. A Rank Profile defines relative weight for the different components of the dynamic rank
Sorting attributes can be specified for individual fields of the documents
Result sorting can be controlled on a per query basis. By default the result is sorted by rank as defined in the default Rank Profile. Query parameters enable you to specify an alternative rank profile for the query, or a set of fields that the result shall be sorted by

Relevance support in the Query Language

FAST ESP includes a highly expressive query language that also includes advanced proximity operators:

Different relevance weight may be applied to different terms or phrases in a query
Explicit proximity (ordered/unordered NEAR) operators enables precise match in semi-structured content without a need for phrase match
Boundary match operators enables exact match with extracted entities or entire document elements such as a product name
Wildcard query support

Dynamic Client Side Ranking

Dynamic client side ranking can be done by using the XRANK operator which is a part of the FAST Query Language (FQL). The boost value is specified with the parameter boost=n, where n is some signed integer value. Negative boost is supported, but if the result of boosting with a negative value is negative then the result will be set to 0.
Its a concept unique to FAST and I will cover it in detail in another post.

Rank Modification Tools

FAST ESP provides tools to modify rank for individual documents. These tools enable you to perform Absolute Query Boost, Relative Query Boost or Relative Document Boost for given documents in the FAST ESP index. An example could be a product database where it may be desired to boost products with highest profit margins, boost products related to campaigns, etc.
Two main tools exist for this purpose:
1) Search Business Center (SBC) - This is an optional, GUI based tool which enables query-oriented rank tuning. The SBC also includes a powerful query reporting module that may be used to assist in the rank tuning. Using the SBC you can change the ranking for each query using three different methods:

Top Ten - to position the document in one of ten reserved places that will be returned at the top of the results list
Add boost points - to add a value to a document to increase its relevancy relative to the other documents returned in the search results.You can also add negative boost points to a document.
Block from query - to prevent the blocked document from appearing in the search results for the query.

2) Rank Tuning Bulk Loader - This is a standard FAST ESP tool that enables you to perform the same rank tuning as the SBC, using an XML file as input. The XML file contains a specification of the rank modifications to be performed
How SharePoint does Relevancy?

Relevancy in SharePoint 2007

SharePoint 2007 also has Search Relevance architecture though the sophistication and flexibility is much less than FAST ESP. You can refer to SharePoint SDK documentation - Enterprise Search Relevance Architecture Overview for details on it. Also refer to Improving Relevance to learn some of the ways you can adjust the settings used in ranking calculation for Enterprise Search results, as well as other techniques you can try for improving relevance.

Thursday, April 14, 2011

FAST ESP: Multi-node installation and Windows Automatic Private IP Addressing

FAST ESP installation system checks for the following

1. Machine is ping-able
2. All 4000 ports from 13000 to 16999 are free
3. Rest checks such as Java installed, directory exists etc…

Now the interesting part, all system checks succeeded and installation succeeded with 0 errors. Spent another hour to install all the patches, only to find the ESP failing to create a collection and default views.

The following error is thrown
Failed to deploy views '[oursite, espsystemwebcluster]' (deployment id: 1956). Failed to perform deployment transaction. Deployment failed: Could not deploy resource 'no.fast.messages.resourceservice.ResourceHandle@2c91b364'
The above error gave a hint on communication error while deploying the resource files for collection creation. But did not get convinced as the system status showing as follows

Later I started searching over microsoft support site for the resolution for this issue, glad to find the issue suggesting restarting the ESP solves the issue, happily restarted but no luck!!!!.

After a day of analyzing the issue which includes uninstall and reinstallation of FAST ESP, finally decided to open a ticket with Microsoft. Before opening the ticket, I executed on all 4 servers ipconfig to collect the network information BINGOOO…. Except one server rest are all having the 10.1.x.x, the other server showed 169.x.x.x. Network folks immediately recognized this IP is incorrect.

What I am going to write below FAST documentation has to include it in the installation guide

From windows millennium edition onwards Microsoft assigns an automatic IP address for the enabled network adapters. With automatic IP addressing, the IP address can be configured automatically. This method decreases administration time and means that IP addresses can be reused. Also, this method is recommended for all sizes of networks that do not have a direct Internet connection or available DHCP service.

Please check the link below

http://support.microsoft.com/kb/307287
http://support.microsoft.com/kb/956196

Resolution:

Just disabled all the adapters except the adapter holding the static IP, solved the issue!!!!!

Wednesday, May 20, 2009

ESP Sharepoint web parts

The table below lists and describes the Web parts included in the ESP Web Parts package:

ESP Search Config

ESP Search Config is a hidden Web part for configuring the ESP connection, selecting the appropriate search view, and setting other search access parameters. You will need one ESP Search Config Web part for each search group.

ESP Search Box

ESP Search Box provides an input box for submitting a search query.

ESP Did You Mean

ESP Did You Mean analyzes the search query and returns the appropriate information. The output depends on the selected spellchecking mode:
• Suggest will return a suggestion if the query was misspelled.
• Modification will automatically resubmit the query and show the changed query term(s).

ESP Result List

ESP Result List displays the set of results found as a list.

ESP Search Statistics

ESP Search Statistics displays information such as the response time for the current query, the number of results returned, and the number of pages.

ESP Search Action

ESP Search Action displays links for actions that can be performed on the result set. Currently, sorting of fields/rank profiles is the only action supported.

ESP Search Pagination

ESP Search Pagination is used to page the search results.

ESP Navigator

ESP Navigator displays a single navigator for the search result set. This provides an aggregation of the values for a particular field and can be used to refine the search result set.

ESP Navigators

ESP Navigators displays all navigators returned for the current search result.

ESP Breadcrumb

ESP Breadcrumb displays the search term(s) and list of navigators used to obtain the current result set.

FAST Enterprise Search Platform (ESP)

I have spent the last week in Las Vegas attending the FAST Forward 09 conference. About a year ago the Norvegian company FAST Search & Transfer was acquired by Microsoft and like me customers all over the world wonder what would happen. Some thought it was great to have a huge company with its R&D resources to take the platform forward while others like me feared a technology transition which would include cancelling support for other operating systems and integration with nothing but Microsoft technology.

It was very clear that the Microsoft Marketing department had a lot to say about the conference and what messages that were to be conveyed. Somewhere behind all that you could still see some of the old FAST mentality but it was really toned down. To me the conference was about convincing existing customers that MS is committed to Enterprise Search and to give Sharepoint customers some idea of what Enterprise Search is all about.

It is clear that the product line is diversifying in a common Microsoft strategy:

Solutions for Internet Business

FAST Search for Internet Business
FAST Search for Sharepoint Internet sites
FAST AdMomentum
Solutions for Business Productivity

FAST Search for Sharepoint
FAST Search for Internal Application
FAST Search for Sharepoint won’t be available until Office Wave 14 (incl Sharepoint) will be released so in the meantime there will be a product called FAST ESP for Sharepoint that can be used today and will have a license migration path towards FAST Search for Sharepoint. That product will have product license of aroudn 25 000 USD and then additional Client Access License (CAL) will follow in a standrad MS manner.

So what does all of this means for us who like to see FAST ESP continue as an enterprise component in a heterogenous environment? Well, MS has commited to 10 years of support for current customers, I guess in a gesture towards those who are worried. Over and over again I heard representatives talking about how important those high-end installations on other operating systems are. The same message appeared when it came to connectors and integration with Enterprise Content Management systems like EMC Documentum. Still, most if not all demos was connected to Sharepoint and/or other MS-specific technologies.

The technical roadmap means that the past year has been devoted in rewriting their next generation search platform from Java to .Net. The first product that will be released is the Content Integration Studio (CIS) which consist of Visual Studio (I guess earlier in Eclipse) component and a server-side execution engine. This will only be available on Windows since it is deeply connected to the .Net-environment. It looks like a promising product with support for flows instead of linear pipeline to handle the processing of information before it is handed of to the index engine. CIS therefore sits in-front of FAST ESP and a combination of actions in flow and in old pipelines can be executed. Information from CIS is written to the ESP which then creates the index and also processes queries to it.

What I think we can expect is that new innovation is focused on creating a modular architecture where CIS is the first one. Features in ESP will the be gradually reengineered in a .Net-environment and thus creating a common search platform some years into the future. It will likely mean that we will still see one or two upgrades to the core ESP as we know it today to enable it to function together with the new components. Content Fusion will most likely be the next module that will extend ESP but on a .Net-architecture.

When it comes to the presentation logic where we today have the FAST Search Front-End (SFE) we will see them either as Web parts for Sharepoint or as AJAX Aerogel from MS. These are currently developed using Javascript but will include Silverlight later on.

These will initially be offered in both a IIS and a Tomcat flavour and possibly others if there is demand. They will intitially integrated with ESP and Unity and thus opening up for a new approach of developing a search experience on top of them.

I general I don’t like the Microsoft approach of insisting of owning the whole technology stack by themselves and refusing to invest in other standards-based projects. Instead of developing their own AJAX libraries they could have used ExtJS or even Google Web Toolkit. While it is not open source MS argues that it is a very Permissive licence from MS that has many of the same qualities. A good thing is that MS was comitted to make sure that this framework works on all major browsers including FireFox, Safari and Chrome. It is interoperable with JQuery.

In summary I think it is kind of a mixed experience. The new features being developed are truly needed to make FAST keep being one of the most advanced search engines available. I think many of the features look really promising and I can’t wait to get my hands on then. On the other hand it is clear that things are going proprietary (FAST ESP had a lot of open source in it), it is being aligned in a Microsoft stack and thus gradually minimizing options. That includes how new technologies are being implemented (MS-ones instead of open source), what operating systems it will run on and how the support for developing presenation logics look like. It means I have to have people how know both Java and .Net, both Flash and Silverlight (possibly JavaFx) and both ExtJS/GWT and MS AJAX/Aerogel.

We are deeply invested in the EMC Documentum Platform and would of course like to continue use ESP as a way to add advanced capabilities and performance to our architecture. However, I think I will over time get sick and tired on Microsoft sales people trying to convince me to use Sharepoint instead of Documentum. For anybody who know how both platform work it is almost a joke but I will most likely have to keep explaining and explaining. I just hope that we can have decent connector developed for Documentum.

FAST ESP: Different ways to retrieve Content

Content retrieval is done very differently in FAST then in SharePoint 2007. FAST ESP may retrieve content from the data sources using two broad approaches:

1) Content Pull: this approach leverages content connectors to retrieve the information via standard APIs or interfaces provided by the source content repositories. This is the core technology of most search solutions, and includes retrieval of file server based documents, web based information, databases or any other enterprise applications. The content connectors do not require integration programming towards the target data repositories.

2) Content Push: this approach requires that the data repositories, applications or messaging middleware send the data directly to FAST ESP via its Content API. This omits the latency of crawling but it requires a closer relationship between the content application and search engine. Multiple programmatic interfaces like .NET, Java, C++, XML-RPC are available for pushing content.

Content Connectors

A content connector is a program that extracts content from some source system, maps the content from the source document model to the document model of FAST ESP, and feeds the documents to FAST ESP for indexing. FAST ESP ships with several commonly used connectors like:

Enterprise Crawler
File Traverser
Database Connector
Other connectors available from FAST include: Microsoft Content Management Server (MCMS) 2002, Documentum, Sharepoint Portal Server (SPS) 2003, Microsoft Office SharePoint Server (MOSS) 2007, StarTeam, CaliberRM, Meridio, Oracle Content Server...

Connectors available from FAST Partners include: Kapow, Vignette VCM, OpenText LiveLink, InterwovenTeamSite, FileNet P8 Content Manager, SAP EP Connector, and SAP PLM...

The connector for specific applications mentioned above include various properties one of which is that they will recognize the source system's security model. This enables security trimming of the search results i.e search users get to see only the information which they have access to in the search results.

In addition to all these options, the FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connector application.

Enterprise Crawler

Content on Web sites/applications can be retrived using the FAST Enterprise Crawler. The Crawler scans specified web sites and follows hyperlinks, extracts the desired information and detects duplicates. It interfaces directly with the Content API to submit the content. The document processing converts the HTML into structured data as defined by the web representation.

Multiple web domains from Intranet, Extranet or Internet can be specified, with individually configured refresh rate, MIME-type support, etc. Parts of web domains can be included/ excluded from the crawl using regular expression based configuration. The Crawler supports incremental crawling, dynamic pages, entitled content (cookie, SSL, password), HTTP 1.0/1.1, FTP, frames, Macromedia Flash content, robots.txt and meta robots tags.

Intelligent loop detection keeps the crawler from repeatedly traversing the same page. During incremental crawling, the Crawler can be configured to focus on retrieving new content only, or detecting modified or deleted items in previously retrieved content.

The Browser Engine is a standalone component which is used by the Enterprise Crawler to extract information from JavaScript and Flash files. A unique JavaScript parser enables the crawler to index dynamic content generated by JavaScript on the client side, and follow JavaScript generated links. The crawler includes the ability to follow hyperlinks and index textual content from Macromedia Flash files.

A multi-node Crawler architecture provides unlimited scalability in number of crawled web servers and number of documents.

File Traverser

Files from any reachable file server can be retreived using the File Traverser. It scans specified file directories on file servers, retrieves content of various formats, and submits it to a collection in the same way as the Enterprise crawler. More than 400 file types can be processed, including popular document types such as Microsoft Office, Text, PDF and Adobe PDF files.

The file traverser crawls all sub directories starting at a given top directory. It then processes all files that match defined extensions, such as html, pdf and doc, and generates a URL per document based on a given prefix. Documents are then sent to the ESP Content API in configurable batches. The size of the batches is limited by two factors: total file size and number of documents.

You may configure the file traverser with the authorization levels in order to retrieve entitled content. Several entities of the traverser may be configured with different authorization level in order to handle multiple collections of data with different entitlements.

Database Connector

FAST ESP provides an index architecture that is well suited for both structured and unstructured information. Integrating the search engine with a relational database is performed for two main reasons:

Relational databases are not very efficient for handling large query volumes. Exporting the data to a dedicated search engine may dramatically off-load the database servers
Integrating a large number of different data sources into one index and one search bar provides a more convenient search experience
FAST ESP provides connectors for a number of relational database systems such as SQL Srever, Oracle, MySQL and DB2. The connectors support flexible indexing of structured content and document attachments. Database retrieval schemes may be configured using SQL statements.

Database retrieval may be configured using SQL statements, and content from multiple tables or databases may be pre-joined prior to indexing. This enables tailoring of the content schema in the Index to frequent queries. Document attachments may also be indexed together with the database content.

FAST ESP is very scalable and can be sized to index billions of database records if needed. An optimized incremental update feature is also provided that takes advantage of update notifications provided by commercial databases. In this way FAST ESP will only request the database for content that is known to be updated. This approach ensures that the latency from a database table change until it is updated in FAST ESP can be minimized, and the system will impose minimal load on the database host system.

FAST ESP versus MOSS 2007 / Microsoft Search Server

It took me some time but i figured out some differences between Microsoft Search Server / MOSS 2007 and Microsoft FAST ESP. These differences are not coming from Microsoft or the FAST company. But it came to my notice that Microsoft and FAST will announce a complete and correct list with these differences between the two products at the conference in Las Vegas next week.These differences willl help me and you to make the right descisions at our customers for implementing search and are based on business requirements.

Data Volume Scalability

FAST ESP has proven their scalability and can handle many millions of documents and even up to billions of documents. The system delivers a scalable system based on expanding hardware. MOSS will still perform up to 20 million documents. Above this the search solution is not supported anymore by Microsoft.

Query Rate Scalability

FAST ESP will deliver a model which can handle up to a thousands QPS (Queries per second), while MOSS QPS is still unknown. FAST ESP is a product on the market which differs from its competitors because it expressses its query rate in QPS in stead of QPM (Queries per minute).

High availability

FAST ESP can be deployed in a distributed manner so thtat there is no single point of failure, while MOSS only uses one index server and has a single point of failure.

Taxonomy

FAST ESP provides contextual search insight information over arbitrary XML content. This is not available in MOSS.

Navigation

FAST ESP has something called FAST recommendations which recommends based on user profiles and activities. This is not available in MOSS.

Real-time alerts

FAST ESP supports real-time alerts to monitor ingested content and alerts the user through mechanisms like SMS, Email, Blackberry and other means. This is not available in MOSS.

Faceted Navigation on Metadata

One of the biggest advantages of FAST is that OOTB Entity extraction capability is over 30 different entities (e.g. companies, products, persons, and more). It also supports Deep navigation technology. While MOSS supports only a shallow faceted search solution based on its results by best bets.

Advanced federation

FAST ESP supports advanced federation including sending queries to various web search APIs, mixing results, and shallow navigation. MOSS only supports federation without mixing of results from different sources and navigation components, but showing them seperatly.

Relevance tuning

FAST ESP has an open architecture and is fully configurable for relevancy. There is a management GUI for business rules. MOSS is not easy to configure for tuning relevancy. As stated code level changes are required).

Linguistics

FAST ESP supports word flexion for 32 languages with high quality of lemmatization. Also spell check like "Did you mean" is available for these languages. There is an ability to tune spelling dictionairess and algorithms and indexing support up to more then 80 languages. MOSS 2007 only uses stemming algorithms and basis spell checking with not much configuration abilities.

FAST ESP has also a dictionary management platform for managing dictionaries for synonyms, spell checking, anti-phrasing, lemmatization, black lists, companies, personal names and more. This is not available in MOSS.

Enrichment of indexing and handling search

FAST ESP has a document processing framework which provides the customer with 11 pre-configured pipelines for handling various types of content (e.g. HTML, unstructured data like PDFs and Office documents). Each of those pipelines can be customized based on over 100 stages. You are able to define your own pipelines and stages. Stages are developed in Python. This is not available in MOSS.

Doing Search

FAST ESP supports approximate search like people find with fuzzy logic, n-gram search, synonyms, spell check, lemmatization and wildcards like pre, post and full. MOSS supports only wildcards like * and has no front truncation.

Duplicate results

FAST ESP has a framework which supports for pre-index and query time results so called de-duplication. This can be customized like same title, author and other means. This is not available in MOSS.

Reporting
MOSS 2007 supports simple query reporting and has some trouble when you want to transfer logs to 3rd party tools, while FAST ESP supports sophisticated query reporting including top queries, top zero-result queries, managed queries and more. Query logs are in W3C format and tools are provided to support transfer these logs to RDBMS for analysis using 3rd party reporting tools like SQL Server Reporting and Crystal Reports