RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
Well I have a lot OCRed PDF, but the extremely slow text extract is hard to pin 
down. The bulk of the OCRed one arent too slow, but then I have one that will 
take several minutes.  I use a little utility, pdftotext.exe, for making a 
crude guess at whether OCR is necessary and it is much faster (but not that 
easy to use in the indexing workflow). Some of the  big modern ones (fully 
digital) can also be very slow. Maybe the amount of inline imagery?? Doesn’t 
seem to bother pdftotext.

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org]
Sent: Friday, 8 December 2017 3:36 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Alternatives to tika for extracting text out of PDFs

No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.

I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 7, 2017, at 5:29 PM, Erick Erickson  wrote:
>
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
>
> Erick
>
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Walter Underwood
No need to prove it. More modern PDF formats are easier to decode, but for many 
years the text was move-print-move-print, so the font metrics were necessary to 
guess at spaces.  Plus, the glyph IDs had to be mapped to characters, so some 
PDFs were effectively a substitution code. Our team joked about using cow 
(crypt breakers workbench) for PDF decoding, but decided it would be a problem 
for export.

I saw one two-column PDF where the glyphs were laid out strictly top to bottom, 
across both columns. Whee!

A friend observed that turning a PDF into a structured document is like turning 
hamburger back into a cow. The PDF standard has improved a lot, but then you 
get an OCR’ed PDF. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 7, 2017, at 5:29 PM, Erick Erickson  wrote:
> 
> I'm going to guess it's the exact opposite. The meta-data is the "semi
> structured" part which is much easier to collect than the PDF. I mean
> there are parameters to tweak that consider how much space between
> letters in words (in the body text) should be allowed and still
> consider it a single word. I'm not quite sure how to prove that, but
> I'd be willing to make a bet ;)
> 
> Erick
> 
> On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
>> I am indexing PDFs and a separate process has converted any image PDFs to 
>> search PDF before solr gets near it. I notice that tika is very slow at 
>> parsing some PDFs. I don't need any metadata (which I suspect is slowing 
>> tika down), just the text. Has anyone used an alternative PDF text 
>> extraction library in a SOLRJ context?
>> Notice: This email and any attachments are confidential and may not be used, 
>> published or redistributed without the prior written consent of the 
>> Institute of Geological and Nuclear Sciences Limited (GNS Science). If 
>> received in error please destroy and immediately notify GNS Science. Do not 
>> copy or disclose the contents.



Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Erick Erickson
I'm going to guess it's the exact opposite. The meta-data is the "semi
structured" part which is much easier to collect than the PDF. I mean
there are parameters to tweak that consider how much space between
letters in words (in the body text) should be allowed and still
consider it a single word. I'm not quite sure how to prove that, but
I'd be willing to make a bet ;)

Erick

On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden  wrote:
> I am indexing PDFs and a separate process has converted any image PDFs to 
> search PDF before solr gets near it. I notice that tika is very slow at 
> parsing some PDFs. I don't need any metadata (which I suspect is slowing tika 
> down), just the text. Has anyone used an alternative PDF text extraction 
> library in a SOLRJ context?
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.