Wednesday, May 20, 2009

FAST ESP: Different ways to retrieve Content

Content retrieval is done very differently in FAST then in SharePoint 2007. FAST ESP may retrieve content from the data sources using two broad approaches:

1) Content Pull: this approach leverages content connectors to retrieve the information via standard APIs or interfaces provided by the source content repositories. This is the core technology of most search solutions, and includes retrieval of file server based documents, web based information, databases or any other enterprise applications. The content connectors do not require integration programming towards the target data repositories.

2) Content Push: this approach requires that the data repositories, applications or messaging middleware send the data directly to FAST ESP via its Content API. This omits the latency of crawling but it requires a closer relationship between the content application and search engine. Multiple programmatic interfaces like .NET, Java, C++, XML-RPC are available for pushing content.

Content Connectors

A content connector is a program that extracts content from some source system, maps the content from the source document model to the document model of FAST ESP, and feeds the documents to FAST ESP for indexing. FAST ESP ships with several commonly used connectors like:

Enterprise Crawler
File Traverser
Database Connector
Other connectors available from FAST include: Microsoft Content Management Server (MCMS) 2002, Documentum, Sharepoint Portal Server (SPS) 2003, Microsoft Office SharePoint Server (MOSS) 2007, StarTeam, CaliberRM, Meridio, Oracle Content Server...

Connectors available from FAST Partners include: Kapow, Vignette VCM, OpenText LiveLink, InterwovenTeamSite, FileNet P8 Content Manager, SAP EP Connector, and SAP PLM...

The connector for specific applications mentioned above include various properties one of which is that they will recognize the source system's security model. This enables security trimming of the search results i.e search users get to see only the information which they have access to in the search results.

In addition to all these options, the FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connector application.

Enterprise Crawler

Content on Web sites/applications can be retrived using the FAST Enterprise Crawler. The Crawler scans specified web sites and follows hyperlinks, extracts the desired information and detects duplicates. It interfaces directly with the Content API to submit the content. The document processing converts the HTML into structured data as defined by the web representation.

Multiple web domains from Intranet, Extranet or Internet can be specified, with individually configured refresh rate, MIME-type support, etc. Parts of web domains can be included/ excluded from the crawl using regular expression based configuration. The Crawler supports incremental crawling, dynamic pages, entitled content (cookie, SSL, password), HTTP 1.0/1.1, FTP, frames, Macromedia Flash content, robots.txt and meta robots tags.

Intelligent loop detection keeps the crawler from repeatedly traversing the same page. During incremental crawling, the Crawler can be configured to focus on retrieving new content only, or detecting modified or deleted items in previously retrieved content.

The Browser Engine is a standalone component which is used by the Enterprise Crawler to extract information from JavaScript and Flash files. A unique JavaScript parser enables the crawler to index dynamic content generated by JavaScript on the client side, and follow JavaScript generated links. The crawler includes the ability to follow hyperlinks and index textual content from Macromedia Flash files.

A multi-node Crawler architecture provides unlimited scalability in number of crawled web servers and number of documents.

File Traverser

Files from any reachable file server can be retreived using the File Traverser. It scans specified file directories on file servers, retrieves content of various formats, and submits it to a collection in the same way as the Enterprise crawler. More than 400 file types can be processed, including popular document types such as Microsoft Office, Text, PDF and Adobe PDF files.

The file traverser crawls all sub directories starting at a given top directory. It then processes all files that match defined extensions, such as html, pdf and doc, and generates a URL per document based on a given prefix. Documents are then sent to the ESP Content API in configurable batches. The size of the batches is limited by two factors: total file size and number of documents.

You may configure the file traverser with the authorization levels in order to retrieve entitled content. Several entities of the traverser may be configured with different authorization level in order to handle multiple collections of data with different entitlements.

Database Connector

FAST ESP provides an index architecture that is well suited for both structured and unstructured information. Integrating the search engine with a relational database is performed for two main reasons:

Relational databases are not very efficient for handling large query volumes. Exporting the data to a dedicated search engine may dramatically off-load the database servers
Integrating a large number of different data sources into one index and one search bar provides a more convenient search experience
FAST ESP provides connectors for a number of relational database systems such as SQL Srever, Oracle, MySQL and DB2. The connectors support flexible indexing of structured content and document attachments. Database retrieval schemes may be configured using SQL statements.

Database retrieval may be configured using SQL statements, and content from multiple tables or databases may be pre-joined prior to indexing. This enables tailoring of the content schema in the Index to frequent queries. Document attachments may also be indexed together with the database content.

FAST ESP is very scalable and can be sized to index billions of database records if needed. An optimized incremental update feature is also provided that takes advantage of update notifications provided by commercial databases. In this way FAST ESP will only request the database for content that is known to be updated. This approach ensures that the latency from a database table change until it is updated in FAST ESP can be minimized, and the system will impose minimal load on the database host system.

No comments:

Post a Comment