Thanks, Tim.  A couple of quick comments and a couple of questions:

    1) the toughest pdfs to identify are those that are partly
    searchable (text) and partly not (image-based text).  However, I've
    found that such documents tend to exist in clusters.

    2) email documents (.eml) are no problem, provided the -filetypes
    eml is including in the indexing command.  Otherwise the indexing is
    not recursive and you'll completely (and silently) miss all such
    documents in lower subdirectories.

    3) I have indexed other repositories and noticed some silent
    failures (mostly for large .doc documents).  Wish there was some way
    to log these errors so it would be obvious what documents have been
    excluded.

    4) I still don't understand the use of tika.eval - is that an
    application that you run against a collection or what?

    5) I've seen reference to tika-server - but I have no idea on how
    that tool might be usefully applied.

    6) Adobe Acrobat Pro apparently has a batch mode suitable for
    flagging unsearchable (that is, image-based) pdf files and fixing them.

    7) Another problem I've encountered is documents that are themselves
    a composite of other documents (like an email thread).  The problem
    is that a hit on such a document doesn't tell you much about the
    true relevance of each contained document.  You have to do a
    laborious manual search to figure it out.

    8) Is there a way to return the size of a matching document (which,
    I think, would help identify non-searchable/image documents)?

Regards,

Terry




On 04/18/2018 12:50 PM, Allison, Timothy B. wrote:
> To be Waldorf to Erick's Statler (if I may), lots of things can go wrong 
> during content extraction.[1]  I had two big concerns when I heard of your 
> task:
>
>
>
> 1) image only pdfs, which can parse without problem, but which might yield 0 
> content.
>
> 2) emails (see, e.g. SOLR-12048)
>
>
>
> It sounds like you're taking care of 1), and 2) doesn't apply because you're 
> using Tika (although note that we've made some major changes to our RFC822 
> parsing in the upcoming Tika 1.18).  So, no need to read further! 😊
>
>
>
> In general, surprising things can happen during the content extraction phase, 
> and unless you are monitoring/measuring/evaluating what's extracted, your 
> search system can yield results that are downright dangerous if you assume 
> that the full stack is actually working.
>
>
>
> I worked with one batch of documents where HALF of the Excel files weren't 
> being parsed.  They all had the same quirk which caused an exception in POI, 
> and because they were inside zip files, and Tika's legacy/default behavior is 
> to silently ignore embedded exceptions -- the owners of the search system had 
> _no idea_ that they'd never be able to find those documents.  At one point, 
> Tika wasn't extracting sdt form fields in docx or form fields in pdf...at 
> all...imagine if your document set was a bunch docx with sdts or pdfs with 
> form fields...  We just fixed a bug to pull text from joined shapes in 
> ppt...we've been missing that text for years!
>
>
>
> Those are a few horror stories, I have many, and there are countless more yet 
> to be discovered!
>
>
>
> The goal of tika-eval[2] is to allow you to see if things don't look right 
> based on your expectations.[3]  It doesn't help with indexing at all per se, 
> but it can allow you to see odd things and 1) change your processing pipeline 
> (add OCR where necessary or use an alternate parser for some file formats) or 
> 2) raise an issue to fix bugs in the content extraction libraries, or at 
> least 3) recognize that you aren't getting reliable content out of ~x% of 
> your documents.  If manually checking PDFs to determine whether or not to run 
> OCR is a hassle, run tika-eval and identify those docs that have a low word 
> count/page ratio.
>
>
>
> Couple of handfuls of Welsh documents; I thought we only had English...what?! 
>  No, that's just bad content extraction (character mapping failure in the PDF 
> or other mojibake).  Average token length in this document is 1, and it is 
> supposed to be English...what?  No, that's the spacing problem that Erick 
> Mentioned.  Average words per page in some pdfs = 2?  No, that's an 
> image-only pdf...that needs to go through OCR.  Ratio of out of vocabulary 
> words = 90%...no that's character encoding mojibake.
>
>
>
>
>
>> I was recently indexing a set of about
> 13,000 documents and at one point, a document caused solr to crash.  I had to 
> restart it.  I removed the offending document, and restarted the indexing.  
> It then eventually happened again, so I did the same thing.
>
>
>
> Crash, crash like OOM?  If you're able to share that with Tika or PDFBox, we 
> can _try_ to fix the underlying bug if there is one.  Sometimes, though, our 
> parsers require far more memory that is ideal. 😐
>
>
>
> If you have questions about tika-eval, please ask over on the Tika list.  
> Apologies for too many words.  Thank you, all, for this discussion!
>
>
>
> Cheers,
>
>
>
>            Tim
>
>
>
>
>
> P.S. On metadata author vs. creator, for a good while, we've been trying to 
> standardize to Dublin core -- dc:creator.  If you see areas for improvement, 
> let us know.
>
>
>
> [1] 
> https://www.slideshare.net/TimAllison6/haystack-2018-apachetikaevaltallison
>
> [2] https://wiki.apache.org/tika/TikaEval
>
> [3] Obviously, without ground truth, there is no automated way to detect the 
> sdt/form field/grouped text box problems, but tika-eval does what it can to 
> identify and count:
>
> a) catastrophic problems (oom, permanent hang)
>
> b) catchable exceptions
>
> c) corrupted text
>
> d) nearly entirely missing text
>
>
>
>

Reply via email to