Re: Regarding pdf indexing issue

2018-07-11 Thread Terry Steichen
Walter, Well said.  (And I love the hamburger conversion analogy - very apt.) The only thing I will add is that when you have a collection of similar rich text documents, you might be able to construct queries to respect internal structures within the documents.  If all/most of your documents

Re: Regarding pdf indexing issue

2018-07-11 Thread Shamik Sinha
You may try to use tesseract tool to check data extraction from pdf or images and then go forward accordingly. As far as I understand the PDF is an image and not data. The searchable PDF actually overlays the selectable text as hidden text over the PDF image. These PDFs can be indexed and

Re: Regarding pdf indexing issue

2018-07-11 Thread Walter Underwood
PDF is not a structured document format. It is a printer control format. PDF does not have a paragraph marker. Instead, it says to move to this spot on the page, choose this font, and print this letter. For a paragraph, it moves farther. For the next letter in a word, it moves a little bit.

Re: Regarding pdf indexing issue

2018-07-11 Thread Erick Erickson
Solr will not do this automatically, the Extracting Request Handler simply indexes the entire contents of the doc without regard to things like paragraphs etc. Ditto with HTML. This is actually a task that requires getting into Tika and using all the bells and whistles there. I'd recommend two

Regarding pdf indexing issue

2018-07-11 Thread Rahul Prasad Dwivedi
Hello Team, I am using the Solr for indexing and searching for pdf document I have go through with your website document and installed solr but unable to index and search the document. For example: Suppose we have a PDF file which have no of paragraph with separate heading. So If I search for