Re: PDF extraction using Tika

2020-08-26 Thread Walter Underwood
t;>>> should run Tika separately as it's entirely possible for it to fail to >>>> parse a PDF and crash - and if you're running it in DIH & Solr it then >>>> brings down everything. Separate your PDF processing from your Solr >>>> indexing. >

RE: [EXT] Re: PDF extraction using Tika

2020-08-26 Thread Hanjan, Harinderdeep S.
memory footprint. For example, the following will limit it to 2GB > java -Xmx2048m -jar tika-server-1.24.jar - H -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: August 26, 2020 6:19 AM To: solr-user Subject: [EXT] Re: PDF extraction using Tika When I wor

Re: PDF extraction using Tika

2020-08-26 Thread Jan Høydahl
a PDF >>> and crash - and if you're running it in DIH & Solr it then brings down >>> everything. Separate your PDF processing from your Solr indexing. >>> >>> >>> Cheers >>> >>> Charlie >>> >>>> >>>&

Re: PDF extraction using Tika

2020-08-26 Thread Charlie Hull
- and if you're running it in DIH & Solr it then brings down everything. Separate your PDF processing from your Solr indexing. Cheers Charlie Thanks, Srinivas Kashyap -Original Message- From: Alexandre Rafalovitch Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using

RE: PDF extraction using Tika

2020-08-25 Thread Srinivas Kashyap
Thanks Phil, I will modify it according to the need. Thanks, Srinivas -Original Message- From: Phil Scadden Sent: 26 August 2020 02:44 To: solr-user@lucene.apache.org Subject: RE: PDF extraction using Tika Code for solrj is going to be very dependent on your needs but the beating

RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden
Admin", password); UpdateResponse ur = req.process(solr,"prindex"); req.commit(solr, "prindex"); -----Original Message----- From: Srinivas Kashyap Sent: Tuesday, 25 August 2020 17:04 To: solr-user@lucene.apache.org Subject: RE: PDF extraction usi

Re: PDF extraction using Tika

2020-08-25 Thread Joe Doupnik
PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offset 2383 at org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.ja

Re: PDF extraction using Tika

2020-08-25 Thread Charlie Hull
Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offs

RE: PDF extraction using Tika

2020-08-24 Thread Srinivas Kashyap
from PDF and pushes into solr? Thanks, Srinivas Kashyap -Original Message- From: Alexandre Rafalovitch Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even

Re: PDF extraction using Tika

2020-08-24 Thread Alexandre Rafalovitch
The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offset 2383 at org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045) Are you indexing the

PDF extraction using Tika

2020-08-24 Thread Srinivas Kashyap
Hello, We are using TikaEntityProcessor to extract the content out of PDF and make the content searchable. When jetty is run on windows based machine, we are able to successfully load documents using full import DIH(tika entity). Here PDF's is maintained in windows file system. But when