Chris A. Mattmann created SOLR-7137:
---------------------------------------

             Summary: Upgrade to Tika 1.7 in 4_10_3 branch
                 Key: SOLR-7137
                 URL: https://issues.apache.org/jira/browse/SOLR-7137
             Project: Solr
          Issue Type: Bug
          Components: contrib - Solr Cell (Tika extraction)
    Affects Versions: 4.10.3
            Reporter: Chris A. Mattmann
            Priority: Blocker
             Fix For: 4.10.4


I have been trying out SolrCell as an alternative to ingesting around 40M 
images using Tesseract/OCR and Tika. I noticed in 4.10.3 Tika is pinned to 1.5. 
In 1.5 Tika and in SolrCell 4.10.3, only about 5600 images of a subset of 
50,000 are ingested when I run a series of 50k cURL commands to the extract 
handler. I had a feeling it has something to do with the fact that some of the 
characters extracted are oddball characters (4@#@#/ ^^^^) due to Tesseract not 
always extracting the right text. But then I remembered Tesseract didn't land 
in Tika until 1.7.

So regardless, I thought I'd upgrade the 4.10.x branch to Tika 1.7. This is a 
trivial patch to do so, attached (Tika + compress updates). Now all 50K images 
on the 50K subset are ingested, but I'm noticing something else weird. Despite 
the fact that Tesseract is called, and despite the fact that on certain images 
I can verify text is extracted, all I am getting in the "content" field of 
SolrCell is a bunch of "\n \n \n \n \n \n" text. Extremely odd.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to