November 09, 2009

SearchDev Dinner in San Jose at ESS West

We've just put the final touches on the annual SearchDev dinner in conjunction with the Enterprise Search Summit West next week in San Jose, California. Anyone who attends the conference, or anyone in the Bay Area, is welcome to attend.

Lucid Imagination is sponsoring the dinner this year along with New Idea Engineering, which will be held on Wednesday night, November 18, at 6PM in the San Jose Hilton, adjacent to the convention center.

Seats are limited, so if you think you will want to attend, please RSVP today to info(at)ideaeng.com with your name and names of the folks who will join you. Of course, replace the (at) with @...

Miles


November 06, 2009

Relevance by, for, and of the people...

Have you ever found yourself browsing a search result list, clicked on a result with a promising teaser, and been frustrated that the document didn't live up to its summary? Me too... you mutter 'this search sucks' to yourself, click the browser's Back link, and browse the result list again, hoping for a better result.

It seems the obsession with 'social search' has lead a few of the best known search companies to tie click popularity back into the base relevance engine. Google recently announced  Self-Learning Scorer as a new part of its latest Google Search Appliance update; and Microsoft announced similar interactive behavior ranking capability in both SharePoint and FAST ESP search - Behavioral Adaptation, one engineered called it.

Color us skeptical. We like the concept of click popularity, but we prefer to see it linked with a 'thumbs-up/thumbs-down' feedback mechanism. If people like the document they see, they won't bother telling you what a great job you did; but trust us, if it's not what they wanted, they will spend the extra few seconds to enter a negative vote. We've not been able to find out the details of the Google feature; Microsoft tells us that the recommendations have a 'time to live' of 30 days, so at least there's hope that crummy documents with great summaries won't fill the top spots of your search result lists.

What do you think?

  

November 05, 2009

Call for Papers: Enterprise Search Summit East, May 2010

My friend Michelle Manafy over at Info Today has asked me to post their call for papers for the May 2010 Enterprise Search Summit East May 11 - 12. ESS East has been one of the premier shows, and Michelle has updated the format to provide attendees more face time with speakers to make the show more valuable.

If you're implementing search now, you're ahead of alot of folks - share what you've learned! Submit a paper today! You've only got until November 30!

See you in New York!

/s/Miles

November 04, 2009

PDF - The New Legacy Data

In the old days companies referred to paper documents as "legacy data", boxes and boxes of important printed documents that were difficult to access.  If you've been in the industry for a while, you'll recall all the high speed scanning / OCR companies that cropped up to solve this problem.

Today virtually all documents and manuals are created electronically, and thanks high quality formats like Adobe PDF and numerous electronic distribution channels, documents tend to stay in digital format.

To me, PDF has replaced paper as the new "legacy" format - there's a ton of technical data now being published in this format.  And to paraphrase an old commercial "data checks in, but it don't check out".

Of course there were ways to get all that tabular technical data back out of PDF, and into more usable forms, but getting this right is not trivial and it's certainly not Adobe's priority.  We're not chiding them for this, their business model is clearly served by getting data INTO PDF, and Acrobat can now export to XML. and other solutions can help you get the content out as well.

A similar case could be made for HTML, Word, Excel and PowerPoint.  Each of these formats have problems of their own.

PDF has some particularly details that can thwart enterprise search:

  • Not all PDF files have searchable text, and users are generally unaware of the difference.
  • PDF files come in many dialects.
  • Tabular data in PDF is sometimes difficult for software to infer; humans easily see the rows and columns, but unlike other document formats, there is no intrinsic hierarchical document structure, just pixels, lines and text snippets with various X,Y coordinates.
  • Older PDF formats were not as capable when dealing with other languages, such as Arabic.

All of these issues have solutions, but all of them require some thought and careful tool selection.

So the complexity of OCR has been replaced, on some level, with document filters, entity extraction, ETL and optimized fulltext search.

October 30, 2009

Request for other enterprise search professionals

I recently spoke with someone who is looking for people who work for companies with large Autonomy implementations.  I Figured since you folks who do read our blog may well meet his ideal specifications, so I'm posting his note here. Feel free to contact him directly.

I am looking to set up 30min to an hour telephone consultation calls  for one of our clients with enterprise search product users who work with a private company that purchases $500k or more of licenses a year or individuals who have an opinion on new beta products being introduced by enterprise search companies.

You can reach him directly at:

                Emailaddr

Appreciate it if you'd mention our blog if you get in touch with him. Thanks!

/s/Miles