In the Tika entity processor use the option onError=“skip” Alternatives are abort (default) or continue (behave as nothing would have happened)
Skip skips the current document > Am 15.03.2019 um 12:44 schrieb Demian Katz <demian.k...@villanova.edu>: > > Jörn (and anyone else with more experience with this than I have), > > I've been working on Whitney with this issue. It is a PDF file, and it can be > opened successfully in a PDF reader. Interestingly, if I try to extract data > from it on the command line, Tika version 1.3 throws a lot of warnings but > does successfully extract data, but several newer versions, including 1.17 > and 1.20 (haven't tested other intermediate versions) encounter a fatal error > and extract nothing. So this seems like something that used to work but has > stopped. Unfortunately, we haven't been able to find a way to downgrade to an > old enough Tika in her Solr installation to work around the problem that way. > > The bigger question, though, is whether there's a way to allow the DIH to > simply ignore errors and keep going. Whitney needs to index several terabytes > of arbitrary documents for her project, and at this scale, she can't afford > the time to stop and manually intervene for every strange document that > happens to be in the collection. It would be greatly preferable if the > indexing process could ignore exceptions and proceed on than if it just stops > dead at the first problem. (I'm also pretty sure that Whitney is already > using the ignoreTikaException attribute in her configuration, but it doesn't > seem to help in this instance). > > Any suggestions would be greatly appreciated! > > thanks, > Demian > > -----Original Message----- > From: Jörn Franke <jornfra...@gmail.com> > Sent: Friday, March 15, 2019 4:18 AM > To: solr-user@lucene.apache.org > Subject: Re: Help with a DIH config file > > Do you have an exception? > It could be that the pdf is broken - can you open it on your computer with a > pdfreader? > > If the exception is related to Tika and pdf then file an issue with the > pdfbox project. If there is an issue with Tika and MsOffice documents then > Apache poi is the right project to ask. > >> Am 15.03.2019 um 03:41 schrieb wclarke <wcla...@widernet.org>: >> >> Thank you so much. You helped a great deal. I am running into one >> last issue where the Tika DIH is stopping at a specific language and >> fails there (Malayalam). Do you know of a work around? >> >> >> >> -- >> Sent from: >> https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Flucen >> e.472066.n3.nabble.com%2FSolr-User-f472068.html&data=02%7C01%7Cdem >> ian.katz%40villanova.edu%7Ca54d5daee7b14648442908d6a91f9bf6%7C765a8de5 >> cf9444f09cafae5bf8cfa366%7C0%7C0%7C636882350564627071&sdata=NpddZY >> 2sHKJHAR8V%2BIlMt4j1i3oy94KP9%2Btp1EQ2xM4%3D&reserved=0