March 06, 2018

Lucidworks Announces Search as a Service

Not Your Grandfather's Site Search

Some of you may know that New Idea Engineering spun off a company called in the mid-90s, offering Verity-powered site search for thousands of clients. Sadly, our investors insisted we violate the cardinal rule of business I learned at HP - "be profitable" - so when the "dot-com' bubble exploded, we were back to being New Idea Engineering again!

I remain a fan of hosted search to this day and have been pleasantly surprised to see companies like Algolia; Swiftype - now part of Elastic; and a few other "Search as a service" organizations reinventing the capabilities that we - along with a competitor Atomz> - offered more than 20 years ago! And I include with them the 'cloud-based' search services offed by other established enterprise search companies like Coveo, Microsoft, and until recently, Google.

That said, we've always strived to be fully vendor neutral when it comes to recommending products and services to our clients, and we go out of our way to understand and work with all of the major enterprise search vendors.

Over the last several months I've had the opportunity to use early releases of a product Lucidworks announced this morning: Lucidworks Site Search. As I said, I am a fan of hosted search - or 'search as a service'; and in full disclosure, I was a Lucidworks employee a few years back and yes, a shareholder.

I had an opportunity to talk with Will Hayes, Lucidworks' CEO, about Lucidworks' entry into the hosted search market. Even in its initial release, it looks pretty impressive.

First, Lucidworks Site Search is powered by the newest release of their enterprise product, Fusion 4.0, announced just last week and available for download. One of the exciting new capabilities in Fusion 4 is the full integration with Spark to enhance search with machine learning. It's not quite Google's "people like you" out of the box, but it's a giant step towards AI in the enterprise.

Fusion 4 also provides the ability to create, test, and move into production custom 'portable' search applications. When I first looked at the product last week, I confess to not having the vision to see just how powerful that capability is. It seems that the Fusion Site Search announced this morning is an example of a powerful, custom search app written specifically for site search.

But Lucid has great plans for their Site Search product. It can be run in the cloud, initially on AWS but soon expanding to other cloud services including Azure and Google. And reliability, you can elect to have Lucidworks Site Search span multiple data centers and even across multiple cloud services. As you'd expect in an enterprise product, it supports a wide variety of document formats, security, faceted navigation and a full management console. Finally, I understand that plans are in the works Lucidworks Site Search to be installed "on-prem" and even federate results (respecting document security) from the cloud and from your local instance at the same time.

Over the coming weeks and months I'll be writing more about Fusion 4, Lucid Site Search, and search apps. Stay tuned!

February 22, 2018

Search Is the User Experience, not the kernel

In the early days of what we now call 'enterprise search', there was no distinction between the search product and the underlying technology. Verity Topic ran on the Verity kernel and Fulcrum ran on the Fulcrum kernel, and that's the way it was - until recently.

In reality, writing the core of an enterprise search product is tough. It has to efficiently create an index of all the works in virtually any kind of file; it has to provide scalability to index millions of documents; and it has to respect document level security using a variety of protocols. And all of this has to deliver results in well under a second. And now, machine learning is becoming an expected capability as well. All for coding that no user will ever see.

Hosted search vendor Swiftype provides a rich search experience for administrators and for uses, but Elastic was the technology under the covers. And yesterday, Coveo announced that their popular enterprise search product will also be available with the Elastic engine rather than only with the existing Coveo proprietary kernel. This marks the start of a trend that I think may become ubiquitous.  

Lucidworks, for example, is synonymous with Solr; but conceptually there is no reason their Fusion product couldn't run on a different search kernel - even on Elastic. However, with their investment in Solr, that does seem unlikely, especially with their ability to federate results from Elastic and other kernels with their App Studio, part of the recent Twigkit acquisition.

Nonetheless, Enterprise search is not the kernel: it's the capabilities exposed for the operation, management, and search experience of the product.

Of course, there are differences between Elastic and Coveo, for example, as well as with other kernels. But in reality, as long as the administrative and user experiences get the work done, what technology is doing the work under the covers matters only in a few fringe cases. And ironically, Elastic, like many other platforms, has its own potentially serious fringe conditions. At the UI level, solving those cases on multiple kernels is probably a lot less intense than managing and maintaining a proprietary kernel.

And this may be an opportunity for Coveo: until now, it's been a Cloud and Windows-only platform. This may mark their entry into multiple-platform environments.

February 20, 2018

Search, the Enterprise Orphan

It seems that everywhere I go, I hear how bad enterprise search is. Users, IT staff, and management complain, and eventually organizations decide that replacing their existing vendor is the best solution. I’d wager that companies switch their search platforms more frequently than any other mission-critical application

While the situation is frustrating for organizations that use search, the current state isn’t as bad for the actual search vendors: if prospects are universally unhappy with a competing product, it’s easier to sell a replacement technology that promises to be everything the current platform is not. It may seem that the only loser is the current vendor; and they are often too busy converting new customers to the platform to worry much.

But in fact, switching search vendors every few years is a real problem for the organization that simply wants its employees and users to find the right content accurately, quickly and without any significant user training. After all, employees are born with the ability to use Google!


Higher level story

Why is enterprise search so bad? In my experience, search implemented and managed properly is pretty darned good. As I see it, the problem is that at most organizations, search doesn’t have an owner.  On LinkedIn, a recent search for “vice president database” jobs shows over 1500 results. Searching for “vice president enterprise search”? Zero hits.

This means that search, recognized as mission-critical by senior management, often doesn’t have an owner outside of IT, whose objective is to keep enterprise applications up and running. Search may be one of the few enterprise applications where “up and running” is just not good enough.

Sadly, there is often no “search owner”; no “search quality team”; and likely no budget for measuring and maintaining result quality.

Search Data Quality

We’ve all heard the expression “Garbage In, Garbage Out”. What is data quality when it comes to search? And how can you measure it?

Ironically, enterprise content authors have an easy way to impact search data quality; but few use it. The trick? Document Properties – also known as ‘metadata’.

When you create any document, there is always data about the document – metadata. Some of the metadata ‘just happens’: the file date, its size, and the file name and path. Other metadata depends on the author-provided properties like a title, subject, and other fielded data like that maintained in the Office ‘Properties’ tab. And there are tools like the Stanford Named Entity Recognition tool (licensed under the GNU General Public License) that can perform advanced metadata extraction from the full text of a document

Some document properties happen automatically. In Microsoft Office, for example, the Properties form provides a way to define field values including the author name, company and other fields. The problem is, few people go to the effort of filling the property fields correctly, so you end up for bad metadata. And bad data is arguably worse than no metadata.

On the enterprise side, I heard about an organization that wanted to reward employees who authored popular content for the intranet. The theory was that recognizing and rewarding useful content creation would help improve the overall quality and utility of the corporate intranet.

An organization we did a project for a few years ago were curious about poor metadata in their intranet document repository, so they did a test. After some testing of their Microsoft Office documents, , they discovered that one employee had authored nearly half of all their intranet content! It turned out that one employee, an Office Assistant, had authored the document that everyone in the origination used as the starting point for a of their common standard reports.

Solving the Problem

Enterprise search technology has advanced to an amazing level. A number of search vendors have even integrated machine learning tools like Spark to surface popular content for frequent queries. And search-related reporting has become a standard part of nearly all search product offerings, so metrics such as top queries and zero hits are available and increasingly actionable.

To really take advantage of these new technological solution, you need to have a team of folks to actively participate in making your enterprise search a success so you can break the loop of “buy-replace”.

Start by identifying an executive owner, and then pull together a team of co-conspirators who can help. Sometimes just by looking at the reports you have and taking action can go a long way.

Review the queries with no results and see if there are synonyms that can find the right content without even changing the content.  Identify the right page for your most popular queries and define one or two “best bets’. If you find that some frequent queries don’t really have relevant content? Work with your web team to create appropriate content.

Funding? Find the right person in your organization to convince that spending a little money on fixing the problems now will break the “buy-replace’ problem and save some significant but needlessly recurring expenses.

Like so many things, a little ongoing effort can solve the problem.

December 04, 2017

Search Indices are Not Content Repositories

Recently on Quora, someone asked for help with a corrupt Elasticsearch index. A number of folks responded, all recommending that he simply rebuild the search index and move on.

The bad news turns out that this person didn't have any source documents: he was so impressed with what Elasticsearch did that he had been using it as his primary storage for content. When it crashed, his content was gone. This is not an indictment of Elasticsearch: it can happen to any complex software product whether Elastic, Solr or SharePoint.

In my reply, I told him how sorry I was for his loss, and suggested he get to work restoring or recreating his content. I even offered to call and tell him how sorry I was for his loss. 

Then I launched into what I really felt I needed to say - there for his behalf, and here for yours.  I suggested - no, actually I insisted - that you NEVER use ANY search index as your primary store for content. Let me be more specific: NEVER. EVER.

Some commercial platforms such as Solr and commercial software based on Solr (ie, Lucidworks) have a reasonably robust ability to replicate the index over multiple servers or nodes which provides some safety (I’m thinking SOLR Cloud here); others do not. But the replication is a copy of the INDEX, which is NOT your documents.

The search index is optimized for retrieval. Databases, CMS, file systems and other tech are for storage.

For one, I’m not sure any search engine stores the entire document of any type. Conceptually, most search indices have two ‘logical’ (if not physical) files 

One of these files you can think of as a database table with one row per document, with field values (Title, Author, etc). This file generally stores the URL, file name, database row as well, basically ‘where do I go to find this full document?’ - and maybe a few other field values.

The second file is a list of all the (non-stopwords) in all of your documents. The word itself is stored once, along with a list of byte offsets in the document where the word appears (multiple byte offsets, one for each instance of the word). It also has a pointer to all docs which have that word. Again: Stop words are generally NOT indexed, so they are usually not in the index.

(There is more detail in an older article on my website Relational Databases vs. Full-Text Search Engines - New Idea Engineering)

COULD you rebuild the full document? Well, depends on the search platform. In most platforms I've seen, it would be difficult because stop words are not even stored. Recreating a document that omits ‘the’, ’a’, ‘an’, ‘and’ etc. MIGHT be human readable but it is NOT the original document.

Secondly, not all search engine indices are replicated for redundancy. The assumption is that if you lose the file system where the content lives, you can still search; you just can't retrieve any documents until you restore the original content.

And some platforms do not give you a way to access the index, short of searching. And a search index is an index, not a repository.

Finally, some platforms are better at redundant failover of indices than others. If the platform you use is one of those that do not have redundancy BY DEFAULT.. like some very popular platforms - and you use that index as the primary data store for your doc and the index dies.. you’re what we used to call SOL - ‘sure outta luck’.

The moral of the story? DO NOT USE A SEARCH INDEX AS THE PRIMARY DATASTORE. Specific enough?

October 11, 2017

A Search Center of Excellence is still relevant managing enterprise search

I was having a conversation with an old friend who manages enterprise search her organization, a biotech company back east. We've worked together on search projects going back to my days at Verity - for you young'uns, that is what we call BG: 'before Google'.

Based on an engagement we did sometime after Google but before Solr, "Centers of Excellence" or "COE" had become very popular, and we decided we could define the rules and responsibilities of a  Search Center of Excellence or SCOE: the team that manages the full breadth of operation and management for enterprise search. We began preaching the gospel of the SCOE at trade show events and on our blog where you can find that original article.

My friend and I had a great conversation about how successful they had been managing three generations of search platforms now with the SCOE; and how they still maintain the responsibilities the SCOE assumed years back with only a few meetings a year to review how search is doing, address any concerns, and map out enhancements as they become available. 

It worked then, and it works now. The SCOE is a great idea! Let me know if you'd like to talk about it.