Re: Regarding pdf indexing issue

2018-07-11 Thread Terry Steichen
Walter, Well said.  (And I love the hamburger conversion analogy - very apt.) The only thing I will add is that when you have a collection of similar rich text documents, you might be able to construct queries to respect internal structures within the documents.  If all/most of your documents

Re: Regarding pdf indexing issue

2018-07-11 Thread Shamik Sinha
You may try to use tesseract tool to check data extraction from pdf or images and then go forward accordingly. As far as I understand the PDF is an image and not data. The searchable PDF actually overlays the selectable text as hidden text over the PDF image. These PDFs can be indexed and

Re: Regarding pdf indexing issue

2018-07-11 Thread Walter Underwood
PDF is not a structured document format. It is a printer control format. PDF does not have a paragraph marker. Instead, it says to move to this spot on the page, choose this font, and print this letter. For a paragraph, it moves farther. For the next letter in a word, it moves a little bit.

Re: Regarding pdf indexing issue

2018-07-11 Thread Erick Erickson
Solr will not do this automatically, the Extracting Request Handler simply indexes the entire contents of the doc without regard to things like paragraphs etc. Ditto with HTML. This is actually a task that requires getting into Tika and using all the bells and whistles there. I'd recommend two

Regarding pdf indexing issue

2018-07-11 Thread Rahul Prasad Dwivedi
Hello Team, I am using the Solr for indexing and searching for pdf document I have go through with your website document and installed solr but unable to index and search the document. For example: Suppose we have a PDF file which have no of paragraph with separate heading. So If I search for

Apache Solr - Pdf Indexing.

2014-04-29 Thread vignesh
Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator and able to get the files containing the keywords. Kindly guide me to get the keyword list in a .PDF file. Note : In Schema.xml have declared a unique tag id.

Re: Apache Solr - Pdf Indexing.

2014-04-29 Thread Alexandre Rafalovitch
Your question is not terribly clear. Are you having troubles indexing PDF in general? Try the tutorial and specifically look for extract handler. Or you already got PDF into the system but your 3000 Keyword query does not match it? In which case it might be just that PDF extraction is limited by

Re: Apache Solr - Pdf Indexing.

2014-04-29 Thread Gora Mohanty
On Apr 29, 2014 2:52 PM, vignesh vignes...@ninestars.in wrote: Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator and able to get the files containing the keywords. Kindly guide me to get the keyword list in a .PDF file. What

Apache Solr - Pdf Indexing.

2014-04-29 Thread vignesh
Hi Team, I am indexing PDF using Apache Solr 3.6 . Passing around 3000 keywords using the OR operator (gardens OR flowers OR time OR train OR trees OR etc) able to get the files containing these keywords. But every .PDF file will not be containing all the keywords, some may

PDF Indexing

2014-04-02 Thread Sujatha Arun
Hi, I am able to use TIKA and DIH to Index a pdf as a single document.However I need each page to be single document. Is there any inbuilt mechanism to achieve the same or do I have to use pdfbox or any other tool achieve this? Regards

Re: PDF Indexing

2014-04-02 Thread Ahmet Arslan
Hi Sujatha, There is no built in mechanism. Prepare page documents outside of the solr.  http://searchhub.org/2012/02/14/indexing-with-solrj/ And you may want to save text content somewhere too. If you change something in index analysis/schema you need to reindex. If you save text data, you

Re: PDF Indexing

2014-04-02 Thread Jack Krupansky
: Wednesday, April 2, 2014 3:35 PM To: solr-user@lucene.apache.org Subject: Re: PDF Indexing Hi Sujatha, There is no built in mechanism. Prepare page documents outside of the solr. http://searchhub.org/2012/02/14/indexing-with-solrj/ And you may want to save text content somewhere too. If you change

Re: PDF indexing issues

2013-11-18 Thread Marcello Lorenzi
should check the Apache PDFBox project. A similar question: https://issues.apache.org/jira/browse/PDFBOX-940 2013/11/15 Marcello Lorenzi mlore...@sorint.it Hi, during you testing of Apache SOLR 4.3, we have noticed some errors occurred for PDF indexing: ERROR - 2013-11-15 15:14:26.248

PDF indexing issues

2013-11-15 Thread Marcello Lorenzi
Hi, during you testing of Apache SOLR 4.3, we have noticed some errors occurred for PDF indexing: ERROR - 2013-11-15 15:14:26.248; org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse predefined CMAP file for 'PDFXC30-Indentity0-UCS2' ERROR - 2013-11-15 15:14:36.108

Re: PDF indexing issues

2013-11-15 Thread Furkan KAMACI
You should check the Apache PDFBox project. A similar question: https://issues.apache.org/jira/browse/PDFBOX-940 2013/11/15 Marcello Lorenzi mlore...@sorint.it Hi, during you testing of Apache SOLR 4.3, we have noticed some errors occurred for PDF indexing: ERROR - 2013-11-15 15:14:26.248

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-27 Thread Furkan KAMACI
indexing works. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: Document is missing mandatory uniqueKey field: id for Solr PDF indexing I use Solr 4.2.1 and these are my fields: field name=id

Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI
I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true stored=true/ !-- Common metadata fields, named specifically to match up with SolrCell metadata when parsing rich

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Raymond Wiker
You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id. On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.comwrote: I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI
Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException while reading response: java.io.FileNotFoundException: 2013/4/26 Raymond Wiker rwi...@gmail.com You could start by doing java post.jar -help --- the 7th example shows exactly what you need to do to add a document id.

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Jan Høydahl
http://wiki.apache.org/solr/post.jar -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com: Hi Raymond; Now I get that error: SimplePostTool: WARNING: IOException

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI
If you can help me it would be nice. I get that error: SimplePostTool version 1.5 Posting files to base url http://localhost:8983/solr/update/extract.. Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI
I think that I should start a new thread for my question to help people who searches for same situation. 2013/4/26 Furkan KAMACI furkankam...@gmail.com If you can help me it would be nice. I get that error: SimplePostTool version 1.5 Posting files to base url

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Jack Krupansky
-Original Message- From: Furkan KAMACI Sent: Friday, April 26, 2013 5:30 AM To: solr-user@lucene.apache.org Subject: Document is missing mandatory uniqueKey field: id for Solr PDF indexing I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true

Re: Document is missing mandatory uniqueKey field: id for Solr PDF indexing

2013-04-26 Thread Furkan KAMACI
To: solr-user@lucene.apache.org Subject: Document is missing mandatory uniqueKey field: id for Solr PDF indexing I use Solr 4.2.1 and these are my fields: field name=id type=string indexed=true stored=true required=true multiValued=false / field name=text type=text_general indexed=true

Re: PDF indexing

2012-05-08 Thread Lance Norskog
Krupansky -Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24 PM To: solr-user@lucene.apache.org Subject: PDF indexing Hi, From what I have read, I think I have to use Tika (?) to index PDF, xls, doc, etc files. How do I start? Do I use mvn clean install in the source

PDF indexing

2012-05-07 Thread Tolga
Hi, From what I have read, I think I have to use Tika (?) to index PDF, xls, doc, etc files. How do I start? Do I use mvn clean install in the source directory to get all the jar files to begin? Centos doesn't provide mvn, how do I build Tika after getting it from http://maven.apache.org ?

Re: PDF indexing

2012-05-07 Thread Jack Krupansky
Try SolrCell (ExtractingRequestHandler). See: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24 PM To: solr-user@lucene.apache.org Subject: PDF indexing Hi, From what I have read, I think I have

Re: PDF indexing

2012-05-07 Thread Tolga
On 05/07/2012 10:35 PM, Jack Krupansky wrote: Try SolrCell (ExtractingRequestHandler). See: http://wiki.apache.org/solr/ExtractingRequestHandler -- Jack Krupansky -Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24 PM To: solr-user@lucene.apache.org Subject: PDF indexing

PDF indexing

2011-09-29 Thread Jón Helgi Jónsson
Good day, I'm checking if Solr would work for indexing PDFs. My requirements are: 1) I must know which page has what contents. 2) Left to right search support. Such as Hebrew. This has been the most trickiest to achieve. I also prefer to know the position of the searched contents on the page

Re: response time for pdf indexing

2011-06-23 Thread simon
How long are the documents ? indexing a large document can be slow (although 2 seconds is very slow indeed). 2011/6/22 Rode González (libnova) r...@libnova.es: Hi ! We are using Zend Search based on Lucene. Our indexing pdf consultations take longer than 2 seconds. We want to change to

response time for pdf indexing

2011-06-22 Thread libnova
Hi ! We are using Zend Search based on Lucene. Our indexing pdf consultations take longer than 2 seconds. We want to change to solr to try to solve this problem. i. Can anyone tell me the response time for querys on pdf documents on solr? ii. Can anyone tell me some strategies to

RE: response time for pdf indexing

2011-06-22 Thread Steven A Rowe
; Marcos; Mario Crespo (Silvereme); 'Rode' Subject: response time for pdf indexing Hi ! We are using Zend Search based on Lucene. Our indexing pdf consultations take longer than 2 seconds. We want to change to solr to try to solve this problem. i. Can anyone tell me the response time