RE: Specialized Solr Application

2018-04-20 Thread Allison, Timothy B.
>1) the toughest pdfs to identify are those that are partly searchable (text) and partly not (image-based text).  However, I've found that such documents tend to exist in clusters. Agreed. We should do something better in Tika to identify image-only pages on a page-by-page basis, and

Re: Specialized Solr Application

2018-04-19 Thread Terry Steichen
Thanks, Tim.  A couple of quick comments and a couple of questions: 1) the toughest pdfs to identify are those that are partly searchable (text) and partly not (image-based text).  However, I've found that such documents tend to exist in clusters. 2) email documents (.eml) are no

RE: Specialized Solr Application

2018-04-18 Thread Allison, Timothy B.
To be Waldorf to Erick's Statler (if I may), lots of things can go wrong during content extraction.[1] I had two big concerns when I heard of your task: 1) image only pdfs, which can parse without problem, but which might yield 0 content. 2) emails (see, e.g. SOLR-12048) It sounds like yo

Re: Specialized Solr Application

2018-04-18 Thread Erick Erickson
ike date, subject, to and from. Other (so-called 'rich text') >>> documents (like pdfs and Word-type), the metadata is not so useful, but >>> on the other hand, there's not much consistent structure to the >>> documents I have to deal with. >>> >>

Re: Specialized Solr Application

2018-04-18 Thread Terry Steichen
However, there's a premium on precision (and recall) in searches. >>> Please, oh, please, no matter what you're using for content/text extraction >>> and/or OCR, run tika-eval[1] on the output to ensure that that you are >>> getting mostly language-y content ou

Re: Specialized Solr Application

2018-04-17 Thread Erick Erickson
you're using for content/text extraction >> and/or OCR, run tika-eval[1] on the output to ensure that that you are >> getting mostly language-y content out of your documents. Ping us on the >> Tika user's list if you have any questions. >> >> Bad text,

Re: Specialized Solr Application

2018-04-17 Thread Terry Steichen
wiki.apache.org/tika/TikaEval > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Tuesday, April 17, 2018 4:17 AM > To: solr-user@lucene.apache.org > Subject: Re: Specialized Solr Application > > On 16/04/2018 19:48, Terry Steichen wrote: >&g

RE: Specialized Solr Application

2018-04-17 Thread Allison, Timothy B.
ay, April 17, 2018 4:17 AM To: solr-user@lucene.apache.org Subject: Re: Specialized Solr Application On 16/04/2018 19:48, Terry Steichen wrote: > I have from time-to-time posted questions to this list (and received > very prompt and helpful responses).  But it seems that many of you are >

Re: Specialized Solr Application

2018-04-17 Thread Charlie Hull
On 16/04/2018 19:48, Terry Steichen wrote: I have from time-to-time posted questions to this list (and received very prompt and helpful responses).  But it seems that many of you are operating in a very different space from me.  The problems (and lessons-learned) which I encounter are often very

Specialized Solr Application

2018-04-16 Thread Terry Steichen
I have from time-to-time posted questions to this list (and received very prompt and helpful responses).  But it seems that many of you are operating in a very different space from me.  The problems (and lessons-learned) which I encounter are often very different from those that are reflected in ex