In the old days companies referred to paper documents as "legacy data", boxes and boxes of important printed documents that were difficult to access. If you've been in the industry for a while, you'll recall all the high speed scanning / OCR companies that cropped up to solve this problem.
Today virtually all documents and manuals are created electronically, and thanks high quality formats like Adobe PDF and numerous electronic distribution channels, documents tend to stay in digital format.
To me, PDF has replaced paper as the new "legacy" format - there's a
ton of technical data now being published in this format. And to
paraphrase an old commercial "data checks in, but it don't check out".
Of course there were ways to get all that tabular technical data back
out of PDF, and into more usable forms, but getting this right is not trivial and it's certainly not Adobe's priority. We're not chiding them for this, their business model is clearly served by getting data INTO PDF, and Acrobat can now export to XML. and other solutions can help you get the content out as well.
A similar case could be made for HTML, Word, Excel and PowerPoint. Each of these formats have problems of their own.
PDF has some particularly details that can thwart enterprise search:
- Not all PDF files have searchable text, and users are generally unaware of the difference.
- PDF files come in many dialects.
- Tabular data in PDF is sometimes difficult for software to infer; humans easily see the rows and columns, but unlike other document formats, there is no intrinsic hierarchical document structure, just pixels, lines and text snippets with various X,Y coordinates.
- Older PDF formats were not as capable when dealing with other languages, such as Arabic.
All of these issues have solutions, but all of them require some thought and careful tool selection.
So the complexity of OCR has been replaced, on some level, with
document filters, entity extraction, ETL and optimized fulltext search.