Enterprise search: 5/17/09

Wednesday, May 20, 2009

ESP Sharepoint web parts

The table below lists and describes the Web parts included in the ESP Web Parts package:

ESP Search Config

ESP Search Config is a hidden Web part for configuring the ESP connection, selecting the appropriate search view, and setting other search access parameters. You will need one ESP Search Config Web part for each search group.

ESP Search Box

ESP Search Box provides an input box for submitting a search query.

ESP Did You Mean

ESP Did You Mean analyzes the search query and returns the appropriate information. The output depends on the selected spellchecking mode:
• Suggest will return a suggestion if the query was misspelled.
• Modification will automatically resubmit the query and show the changed query term(s).

ESP Result List

ESP Result List displays the set of results found as a list.

ESP Search Statistics

ESP Search Statistics displays information such as the response time for the current query, the number of results returned, and the number of pages.

ESP Search Action

ESP Search Action displays links for actions that can be performed on the result set. Currently, sorting of fields/rank profiles is the only action supported.

ESP Search Pagination

ESP Search Pagination is used to page the search results.

ESP Navigator

ESP Navigator displays a single navigator for the search result set. This provides an aggregation of the values for a particular field and can be used to refine the search result set.

ESP Navigators

ESP Navigators displays all navigators returned for the current search result.

ESP Breadcrumb

ESP Breadcrumb displays the search term(s) and list of navigators used to obtain the current result set.

FAST Enterprise Search Platform (ESP)

I have spent the last week in Las Vegas attending the FAST Forward 09 conference. About a year ago the Norvegian company FAST Search & Transfer was acquired by Microsoft and like me customers all over the world wonder what would happen. Some thought it was great to have a huge company with its R&D resources to take the platform forward while others like me feared a technology transition which would include cancelling support for other operating systems and integration with nothing but Microsoft technology.

It was very clear that the Microsoft Marketing department had a lot to say about the conference and what messages that were to be conveyed. Somewhere behind all that you could still see some of the old FAST mentality but it was really toned down. To me the conference was about convincing existing customers that MS is committed to Enterprise Search and to give Sharepoint customers some idea of what Enterprise Search is all about.

It is clear that the product line is diversifying in a common Microsoft strategy:

Solutions for Internet Business

FAST Search for Internet Business
FAST Search for Sharepoint Internet sites
FAST AdMomentum
Solutions for Business Productivity

FAST Search for Sharepoint
FAST Search for Internal Application
FAST Search for Sharepoint won’t be available until Office Wave 14 (incl Sharepoint) will be released so in the meantime there will be a product called FAST ESP for Sharepoint that can be used today and will have a license migration path towards FAST Search for Sharepoint. That product will have product license of aroudn 25 000 USD and then additional Client Access License (CAL) will follow in a standrad MS manner.

So what does all of this means for us who like to see FAST ESP continue as an enterprise component in a heterogenous environment? Well, MS has commited to 10 years of support for current customers, I guess in a gesture towards those who are worried. Over and over again I heard representatives talking about how important those high-end installations on other operating systems are. The same message appeared when it came to connectors and integration with Enterprise Content Management systems like EMC Documentum. Still, most if not all demos was connected to Sharepoint and/or other MS-specific technologies.

The technical roadmap means that the past year has been devoted in rewriting their next generation search platform from Java to .Net. The first product that will be released is the Content Integration Studio (CIS) which consist of Visual Studio (I guess earlier in Eclipse) component and a server-side execution engine. This will only be available on Windows since it is deeply connected to the .Net-environment. It looks like a promising product with support for flows instead of linear pipeline to handle the processing of information before it is handed of to the index engine. CIS therefore sits in-front of FAST ESP and a combination of actions in flow and in old pipelines can be executed. Information from CIS is written to the ESP which then creates the index and also processes queries to it.

What I think we can expect is that new innovation is focused on creating a modular architecture where CIS is the first one. Features in ESP will the be gradually reengineered in a .Net-environment and thus creating a common search platform some years into the future. It will likely mean that we will still see one or two upgrades to the core ESP as we know it today to enable it to function together with the new components. Content Fusion will most likely be the next module that will extend ESP but on a .Net-architecture.

When it comes to the presentation logic where we today have the FAST Search Front-End (SFE) we will see them either as Web parts for Sharepoint or as AJAX Aerogel from MS. These are currently developed using Javascript but will include Silverlight later on.

These will initially be offered in both a IIS and a Tomcat flavour and possibly others if there is demand. They will intitially integrated with ESP and Unity and thus opening up for a new approach of developing a search experience on top of them.

I general I don’t like the Microsoft approach of insisting of owning the whole technology stack by themselves and refusing to invest in other standards-based projects. Instead of developing their own AJAX libraries they could have used ExtJS or even Google Web Toolkit. While it is not open source MS argues that it is a very Permissive licence from MS that has many of the same qualities. A good thing is that MS was comitted to make sure that this framework works on all major browsers including FireFox, Safari and Chrome. It is interoperable with JQuery.

In summary I think it is kind of a mixed experience. The new features being developed are truly needed to make FAST keep being one of the most advanced search engines available. I think many of the features look really promising and I can’t wait to get my hands on then. On the other hand it is clear that things are going proprietary (FAST ESP had a lot of open source in it), it is being aligned in a Microsoft stack and thus gradually minimizing options. That includes how new technologies are being implemented (MS-ones instead of open source), what operating systems it will run on and how the support for developing presenation logics look like. It means I have to have people how know both Java and .Net, both Flash and Silverlight (possibly JavaFx) and both ExtJS/GWT and MS AJAX/Aerogel.

We are deeply invested in the EMC Documentum Platform and would of course like to continue use ESP as a way to add advanced capabilities and performance to our architecture. However, I think I will over time get sick and tired on Microsoft sales people trying to convince me to use Sharepoint instead of Documentum. For anybody who know how both platform work it is almost a joke but I will most likely have to keep explaining and explaining. I just hope that we can have decent connector developed for Documentum.

FAST ESP: Different ways to retrieve Content

Content retrieval is done very differently in FAST then in SharePoint 2007. FAST ESP may retrieve content from the data sources using two broad approaches:

1) Content Pull: this approach leverages content connectors to retrieve the information via standard APIs or interfaces provided by the source content repositories. This is the core technology of most search solutions, and includes retrieval of file server based documents, web based information, databases or any other enterprise applications. The content connectors do not require integration programming towards the target data repositories.

2) Content Push: this approach requires that the data repositories, applications or messaging middleware send the data directly to FAST ESP via its Content API. This omits the latency of crawling but it requires a closer relationship between the content application and search engine. Multiple programmatic interfaces like .NET, Java, C++, XML-RPC are available for pushing content.

Content Connectors

A content connector is a program that extracts content from some source system, maps the content from the source document model to the document model of FAST ESP, and feeds the documents to FAST ESP for indexing. FAST ESP ships with several commonly used connectors like:

Enterprise Crawler
File Traverser
Database Connector
Other connectors available from FAST include: Microsoft Content Management Server (MCMS) 2002, Documentum, Sharepoint Portal Server (SPS) 2003, Microsoft Office SharePoint Server (MOSS) 2007, StarTeam, CaliberRM, Meridio, Oracle Content Server...

Connectors available from FAST Partners include: Kapow, Vignette VCM, OpenText LiveLink, InterwovenTeamSite, FileNet P8 Content Manager, SAP EP Connector, and SAP PLM...

The connector for specific applications mentioned above include various properties one of which is that they will recognize the source system's security model. This enables security trimming of the search results i.e search users get to see only the information which they have access to in the search results.

In addition to all these options, the FAST ESP SDK also provides a Content Connector Toolkit which helps you create your own connector application.

Enterprise Crawler

Content on Web sites/applications can be retrived using the FAST Enterprise Crawler. The Crawler scans specified web sites and follows hyperlinks, extracts the desired information and detects duplicates. It interfaces directly with the Content API to submit the content. The document processing converts the HTML into structured data as defined by the web representation.

Multiple web domains from Intranet, Extranet or Internet can be specified, with individually configured refresh rate, MIME-type support, etc. Parts of web domains can be included/ excluded from the crawl using regular expression based configuration. The Crawler supports incremental crawling, dynamic pages, entitled content (cookie, SSL, password), HTTP 1.0/1.1, FTP, frames, Macromedia Flash content, robots.txt and meta robots tags.

Intelligent loop detection keeps the crawler from repeatedly traversing the same page. During incremental crawling, the Crawler can be configured to focus on retrieving new content only, or detecting modified or deleted items in previously retrieved content.

The Browser Engine is a standalone component which is used by the Enterprise Crawler to extract information from JavaScript and Flash files. A unique JavaScript parser enables the crawler to index dynamic content generated by JavaScript on the client side, and follow JavaScript generated links. The crawler includes the ability to follow hyperlinks and index textual content from Macromedia Flash files.

A multi-node Crawler architecture provides unlimited scalability in number of crawled web servers and number of documents.

File Traverser

Files from any reachable file server can be retreived using the File Traverser. It scans specified file directories on file servers, retrieves content of various formats, and submits it to a collection in the same way as the Enterprise crawler. More than 400 file types can be processed, including popular document types such as Microsoft Office, Text, PDF and Adobe PDF files.

The file traverser crawls all sub directories starting at a given top directory. It then processes all files that match defined extensions, such as html, pdf and doc, and generates a URL per document based on a given prefix. Documents are then sent to the ESP Content API in configurable batches. The size of the batches is limited by two factors: total file size and number of documents.

You may configure the file traverser with the authorization levels in order to retrieve entitled content. Several entities of the traverser may be configured with different authorization level in order to handle multiple collections of data with different entitlements.

Database Connector

FAST ESP provides an index architecture that is well suited for both structured and unstructured information. Integrating the search engine with a relational database is performed for two main reasons:

Relational databases are not very efficient for handling large query volumes. Exporting the data to a dedicated search engine may dramatically off-load the database servers
Integrating a large number of different data sources into one index and one search bar provides a more convenient search experience
FAST ESP provides connectors for a number of relational database systems such as SQL Srever, Oracle, MySQL and DB2. The connectors support flexible indexing of structured content and document attachments. Database retrieval schemes may be configured using SQL statements.

Database retrieval may be configured using SQL statements, and content from multiple tables or databases may be pre-joined prior to indexing. This enables tailoring of the content schema in the Index to frequent queries. Document attachments may also be indexed together with the database content.

FAST ESP is very scalable and can be sized to index billions of database records if needed. An optimized incremental update feature is also provided that takes advantage of update notifications provided by commercial databases. In this way FAST ESP will only request the database for content that is known to be updated. This approach ensures that the latency from a database table change until it is updated in FAST ESP can be minimized, and the system will impose minimal load on the database host system.

FAST ESP versus MOSS 2007 / Microsoft Search Server

It took me some time but i figured out some differences between Microsoft Search Server / MOSS 2007 and Microsoft FAST ESP. These differences are not coming from Microsoft or the FAST company. But it came to my notice that Microsoft and FAST will announce a complete and correct list with these differences between the two products at the conference in Las Vegas next week.These differences willl help me and you to make the right descisions at our customers for implementing search and are based on business requirements.

Data Volume Scalability

FAST ESP has proven their scalability and can handle many millions of documents and even up to billions of documents. The system delivers a scalable system based on expanding hardware. MOSS will still perform up to 20 million documents. Above this the search solution is not supported anymore by Microsoft.

Query Rate Scalability

FAST ESP will deliver a model which can handle up to a thousands QPS (Queries per second), while MOSS QPS is still unknown. FAST ESP is a product on the market which differs from its competitors because it expressses its query rate in QPS in stead of QPM (Queries per minute).

High availability

FAST ESP can be deployed in a distributed manner so thtat there is no single point of failure, while MOSS only uses one index server and has a single point of failure.

Taxonomy

FAST ESP provides contextual search insight information over arbitrary XML content. This is not available in MOSS.

Navigation

FAST ESP has something called FAST recommendations which recommends based on user profiles and activities. This is not available in MOSS.

Real-time alerts

FAST ESP supports real-time alerts to monitor ingested content and alerts the user through mechanisms like SMS, Email, Blackberry and other means. This is not available in MOSS.

Faceted Navigation on Metadata

One of the biggest advantages of FAST is that OOTB Entity extraction capability is over 30 different entities (e.g. companies, products, persons, and more). It also supports Deep navigation technology. While MOSS supports only a shallow faceted search solution based on its results by best bets.

Advanced federation

FAST ESP supports advanced federation including sending queries to various web search APIs, mixing results, and shallow navigation. MOSS only supports federation without mixing of results from different sources and navigation components, but showing them seperatly.

Relevance tuning

FAST ESP has an open architecture and is fully configurable for relevancy. There is a management GUI for business rules. MOSS is not easy to configure for tuning relevancy. As stated code level changes are required).

Linguistics

FAST ESP supports word flexion for 32 languages with high quality of lemmatization. Also spell check like "Did you mean" is available for these languages. There is an ability to tune spelling dictionairess and algorithms and indexing support up to more then 80 languages. MOSS 2007 only uses stemming algorithms and basis spell checking with not much configuration abilities.

FAST ESP has also a dictionary management platform for managing dictionaries for synonyms, spell checking, anti-phrasing, lemmatization, black lists, companies, personal names and more. This is not available in MOSS.

Enrichment of indexing and handling search

FAST ESP has a document processing framework which provides the customer with 11 pre-configured pipelines for handling various types of content (e.g. HTML, unstructured data like PDFs and Office documents). Each of those pipelines can be customized based on over 100 stages. You are able to define your own pipelines and stages. Stages are developed in Python. This is not available in MOSS.

Doing Search

FAST ESP supports approximate search like people find with fuzzy logic, n-gram search, synonyms, spell check, lemmatization and wildcards like pre, post and full. MOSS supports only wildcards like * and has no front truncation.

Duplicate results

FAST ESP has a framework which supports for pre-index and query time results so called de-duplication. This can be customized like same title, author and other means. This is not available in MOSS.

Reporting
MOSS 2007 supports simple query reporting and has some trouble when you want to transfer logs to 3rd party tools, while FAST ESP supports sophisticated query reporting including top queries, top zero-result queries, managed queries and more. Query logs are in W3C format and tools are provided to support transfer these logs to RDBMS for analysis using 3rd party reporting tools like SQL Server Reporting and Crystal Reports

FAST ESP Factory Interfaces (Java Search API)

A connection to the QRserver can be established using three approaches.One approach takes no parameters, but relies on the properties being specified in the com.fastsearch.esp.search.SearchFactory.properties file.

Default value is the HTTPSearchFactory factoryinstance:

Code:

ISearchFactory factory = SearchFactory.newInstance()

A second approach requires you to instantiate a Search Factory based on a factory class, passed as a String.

The default implementation is com.fastsearch.esp.search.http.HttpSearchFactory:

Code:

String factory_name = "com.fastsearch.esp.search.http.HttpSearchFactory"ISearchFactory factory = SearchFactory.newInstance(factory)

A third approach requires you to instantiate a Search Factory based on a Properties object. The followingproperty is mandarory:

Code:

Properties prop = new Properties();prop.setProperty("com.fastsearch.esp.search.http.qrservers", "host1:port,host2:port");
ISearchFactory factory = SearchFactory.newInstance(prop);

Other available parameters:

• By default, ESP uses HTTP GET as request method. When using QBE (Query By Example), the requestmethod must be set to HTTP POST:

Code:

com.fastsearch.esp.search.http.requestmethod• Use secure socket layer when searching the Java API, default is false:

Code:

com.fastsearch.esp.search.http.usessl

• Set the maximum number of connections for a HttpSearhEngine object, default is 10. Note, that if you aresearching through search-views, each search-view instance will hold one HttpSearchEngine instance foreach QRServer:

Code:

com.fastsearch.esp.search.http.maxconnections

• Specify a proxy server used by the API's httpclient. Default is disabled, but the API will still look for proxiesconfigured through Java's system property http.proxyHost and http.proxyPort. If the system property isset and you do not want the search API to use it, override by setting this parameter to "none":

Code:

com.fastsearch.esp.search.http.proxy

All three approaches allow for the specification of multiple QRservers. However, the first and second approachrequire hostnames and ports to be specified in the factory's properties file. The third approach allows forhostnames and ports to be specified as arguments to the properties object. The arguments will then beappended to the factorys properties file.

Tip: Faster, Better Search for Movable Type

Out of the box Movable Type's (MT) search engine isn't terrible. It meets minimum requirements, is fairly accurate, but not much more. The big issue is performance. Under any sort of load, MT's native search starts to creak and groan, and suck gobs of CPU time.
Enter Mark Carey and his latest version of Fast Search for MT.
Fast Search v1.5 is a (PHP, as opposed to Perl) plugin extension for the Movable Type publishing platform. The plugin provides fast searching of MT blog posts with low CPU load and a flexible query syntax. Fast Search uses the PHP-based dynamic publishing system of Movable Type to search for entries using native MySQL database queries.
The new version provides search options that enable the user to choose between MySQL full text searching, a standard SQL LIKE query, or a combination of the two.

While full text is the most performant, all of the Fast Search methods yield significant performance gains over MT's native Perl-CGI-based search engine.
The latest version of Fast Search for MT includes these new features:
Multi-Byte Character Support. Fast Search can now search entries with content stored in multi-byte characters, such as Arabic and Japanese, using the "Non-Fulltext" search method.
Total Number of Results Found. A new tag for displaying the total number of results in the database, so you can display output such as "Results 1-10 of 256".
Search Page Links. Display linked page numbers to link directly to multiple pages of search results.
Previous Page Link. Links to the previous page of search results.
Short Words Searching. Using the "Non-Fulltext" or "Fallback"search methods, you can now search for words under 4 characters in length. This was a serious limitation inherent with MySQL's full text search -- given the number TLA's in use today.
Match Partial Words. Search for partial words using the "Non-Fulltext" or "Fallback" methods, you can now search for partial words.
"Mixed" Multi-blog Searching. You can now search multiple blogs in the same search. The results are merged into a single list of search results, not grouped by blog.
Fast Search will automatically create a MySQL full text index, enabling very fast searches. For people who want to search multi-byte languages or short/partial words, the "Non-Fulltext" option enables this functionality. "Non-Fulltext" conducts a SQL LIKE query that runs in the native MySQL query engine. The Fallback option is a combination search, first trying a full text search, then a LIKE query, if full text returns zero results.
The new pagination links feature of v1.5 enables Google-style links to multiple search results pages, along with "Next" and "Previous" buttons.
The MySQL full text queries are built and executed within the framework of Movable Type's PHP-based dynamic publishing system. The database content is then displayed via MT templates and tags.
The first step of the PHP code is to determine the search query and any optional arguments that need to be considered. The final phase hooks into the Movable Type template system, enabling users to customize the search results pages, just like any other MT template.
Fast Search reuses many of the same template tags from the CGI-based search, making it easy for make the switch to much faster searching.
One important note for those considering the transition: In case its not obvious, Fast Search relies on the MySQL database engine as the backend for search. Therefore to utilize this product, your MT installs must already be storing content in a MySQL database. This is not a given. In fact many MT installs are configured to use Berkeley DB file system databases.
Fast Search v1.5 is a commercial extension to the Movable Type publishing platform. It is free for non-commercial use or about US$ 100 for grubby capitalists like you and me.

10 Steps To IT Certification - Part 1

Skilled network professionals are not going to be losing many jobs, no matter what happens to employment figures or prospects in other areas of the economy, and new IT jobs will always be available to properly educated, trained and prepared individuals. Notice the three different words there. Even after getting a degree (education), someone who wants to work in this area of continuing high demand needs the proper, current certifications (training). Before you begin accruing the various Cisco, Microsoft, Novell, Oracle and other certifications, you need to pass the exams, which is where the third concept, preparation, enters the picture.Whichever expertise you seek to hone or add with any particular certification, you must enter the process with a serious commitment. There will be a considerable investment of time, energy and money if your goal is a quality certification that represents you as a leader in your field. There is a lot more to it than learning how to take a test, and there may be any number of between-the-lines sidesteps and detours on your journey, but the following 10 steps to IT certification should get you pointed in the right direction.Step 1: A specialty means you’re focusedThe old terms “engineer” and “programmer” do not say much about what a person does anymore. Today, even within Information Technology (IT) in general and network technology in particular, there are numerous areas of concentration. You can specialize in the core routing and switching components of networks, focus on network security issues like firewalls and hacker detection/prevention, and/or obtain certifications for dealing with telephony technologies. There are numerous certifications from which to choose and you will obviously do best in the areas you find most interesting and personally rewarding. After you get your first certification you may decide to work toward others, but consider focusing on a single area as your “specialty.”Step 2: An investment means you’re seriousDon’t shortchange yourself if you make the decision in #1 above to get going on a particular certification, whether or not you have a plan to accrue more than the first one. One or 20, a certification requires a serious investment of time, energy and funds. Do your due diligence before you spend money on study materials, by talking to colleagues, reading book reviews on Amazon.com and checking the websites of the companies whose certifications you want to earn (Microsoft, Cisco and the others). User forums on the Internet, specifically dealing with the area of study, are another good source for information on the best study aids.Step 3: Question authorityDo not be intimidated by people who tell you they are gurus or experts in “networks” or “networking” (which could mean they do a lot of Facebook postings, among other things). There really aren’t any IT gurus. The science is changing and advancing so quickly, and there is so much going on in so many areas, that just about any subject matter you picked to become “an expert” in would be obsolete before you finished your studies. The depth and breadth of network technology can positively overwhelm newbies and veterans alike. Just remember that you are not being certified as a “know-it-all,” but as having expertise in a particular area, expertise that you must work at maintaining. Whether it’s wireless, security or telephony, get grounded in that first, specialized IT area, then decide what you wish to add to your skill set. This will come naturally as you progress in your career.Step 4: Be an Aristotelian (balanced)If you are not careful, you can easily “bury yourself” under mountains of homework, practice tests, study materials and books. This is especially easy to do if you are trying to knock off a certification exam in an insufficient amount of time. You should certainly spend the time that is required to prepare well for your certification exam, but it is crucial to make enough time for the rest of your life, too. Take regular breaks, and also “break away” from computers, manuals and books for at least a few hours daily. Not only will you reduce the chance of burnout, you will give yourself the necessary time to digest what you have studied. This is an important step in changing what you studied into what you “learned.”CBT Planet.com offers comprehensive computer and employee career training including self-paced computer based training (cbt), computer training videos http://www.cbtplanet.com/ , instructor-led courses and IT certifications boot camps. Visit them online today for all your cbt online.

Enterprise search