Re: PDF extraction using Tika

2020-08-26 Thread Walter Underwood
t;>>> should run Tika separately as it's entirely possible for it to fail to >>>> parse a PDF and crash - and if you're running it in DIH & Solr it then >>>> brings down everything. Separate your PDF processing from your Solr >>>> indexing. >

RE: [EXT] Re: PDF extraction using Tika

2020-08-26 Thread Hanjan, Harinderdeep S.
memory footprint. For example, the following will limit it to 2GB > java -Xmx2048m -jar tika-server-1.24.jar - H -Original Message- From: Jan Høydahl [mailto:jan@cominvent.com] Sent: August 26, 2020 6:19 AM To: solr-user Subject: [EXT] Re: PDF extraction using Tika When I wor

Re: PDF extraction using Tika

2020-08-26 Thread Jan Høydahl
a PDF >>> and crash - and if you're running it in DIH & Solr it then brings down >>> everything. Separate your PDF processing from your Solr indexing. >>> >>> >>> Cheers >>> >>> Charlie >>> >>>> >>>&

Re: PDF extraction using Tika

2020-08-26 Thread Charlie Hull
- and if you're running it in DIH & Solr it then brings down everything. Separate your PDF processing from your Solr indexing. Cheers Charlie Thanks, Srinivas Kashyap -Original Message- From: Alexandre Rafalovitch Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using

RE: PDF extraction using Tika

2020-08-25 Thread Srinivas Kashyap
Thanks Phil, I will modify it according to the need. Thanks, Srinivas -Original Message- From: Phil Scadden Sent: 26 August 2020 02:44 To: solr-user@lucene.apache.org Subject: RE: PDF extraction using Tika Code for solrj is going to be very dependent on your needs but the beating

RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden
Admin", password); UpdateResponse ur = req.process(solr,"prindex"); req.commit(solr, "prindex"); -----Original Message----- From: Srinivas Kashyap Sent: Tuesday, 25 August 2020 17:04 To: solr-user@lucene.apache.org Subject: RE: PDF extraction usi

Re: PDF extraction using Tika

2020-08-25 Thread Joe Doupnik
PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offset 2383 at org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.ja

Re: PDF extraction using Tika

2020-08-25 Thread Charlie Hull
Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offs

RE: PDF extraction using Tika

2020-08-24 Thread Srinivas Kashyap
from PDF and pushes into solr? Thanks, Srinivas Kashyap -Original Message- From: Alexandre Rafalovitch Sent: 24 August 2020 20:54 To: solr-user Subject: Re: PDF extraction using Tika The issue seems to be more with a specific file and at the level way below Solr's or possibly even

Re: PDF extraction using Tika

2020-08-24 Thread Alexandre Rafalovitch
The issue seems to be more with a specific file and at the level way below Solr's or possibly even Tika's: Caused by: java.io.IOException: expected='>' actual=' ' at offset 2383 at org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045) Are you indexing the