In the Tika entity processor use the option onError=“skip”

Alternatives are abort (default) or continue (behave as nothing would have 
happened)

Skip skips the current document 

> Am 15.03.2019 um 12:44 schrieb Demian Katz <demian.k...@villanova.edu>:
> 
> Jörn (and anyone else with more experience with this than I have),
> 
> I've been working on Whitney with this issue. It is a PDF file, and it can be 
> opened successfully in a PDF reader. Interestingly, if I try to extract data 
> from it on the command line, Tika version 1.3 throws a lot of warnings but 
> does successfully extract data, but several newer versions, including 1.17 
> and 1.20 (haven't tested other intermediate versions) encounter a fatal error 
> and extract nothing. So this seems like something that used to work but has 
> stopped. Unfortunately, we haven't been able to find a way to downgrade to an 
> old enough Tika in her Solr installation to work around the problem that way.
> 
> The bigger question, though, is whether there's a way to allow the DIH to 
> simply ignore errors and keep going. Whitney needs to index several terabytes 
> of arbitrary documents for her project, and at this scale, she can't afford 
> the time to stop and manually intervene for every strange document that 
> happens to be in the collection. It would be greatly preferable if the 
> indexing process could ignore exceptions and proceed on than if it just stops 
> dead at the first problem. (I'm also pretty sure that Whitney is already 
> using the ignoreTikaException attribute in her configuration, but it doesn't 
> seem to help in this instance).
> 
> Any suggestions would be greatly appreciated!
> 
> thanks,
> Demian
> 
> -----Original Message-----
> From: Jörn Franke <jornfra...@gmail.com> 
> Sent: Friday, March 15, 2019 4:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Help with a DIH config file
> 
> Do you have an exception?
> It could be that the pdf is broken - can you open it on your computer with a 
> pdfreader?
> 
> If the exception is related to Tika and pdf then file an issue with the 
> pdfbox project. If there is an issue with Tika and MsOffice documents then 
> Apache poi is the right project to ask.
> 
>> Am 15.03.2019 um 03:41 schrieb wclarke <wcla...@widernet.org>:
>> 
>> Thank you so much.  You helped a great deal.  I am running into one 
>> last issue where the Tika DIH is stopping at a specific language and 
>> fails there (Malayalam).  Do you know of a work around?
>> 
>> 
>> 
>> --
>> Sent from: 
>> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen
>> e.472066.n3.nabble.com%2FSolr-User-f472068.html&amp;data=02%7C01%7Cdem
>> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5
>> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071&amp;sdata=NpddZY
>> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3D&amp;reserved=0

Reply via email to