Walter,
Well said. (And I love the hamburger conversion analogy - very apt.)
The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents. If all/most of your documents
You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
PDF is not a structured document format. It is a printer control format.
PDF does not have a paragraph marker. Instead, it says to move
to this spot on the page, choose this font, and print this letter. For a
paragraph, it moves farther. For the next letter in a word, it moves a
little bit.
Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.
I'd recommend two
Hello Team,
I am using the Solr for indexing and searching for pdf document
I have go through with your website document and installed solr but unable
to index and search the document.
For example: Suppose we have a PDF file which have no of paragraph with
separate heading.
So If I search for
Hi Team,
I am indexing PDF using Apache Solr 3.6 . Passing around 3000
keywords using the OR operator and able to get the files containing the
keywords. Kindly guide me to get the keyword list in a .PDF file.
Note : In Schema.xml have declared a unique tag id.
Your question is not terribly clear. Are you having troubles indexing PDF
in general? Try the tutorial and specifically look for extract handler.
Or you already got PDF into the system but your 3000 Keyword query does not
match it? In which case it might be just that PDF extraction is limited by
On Apr 29, 2014 2:52 PM, vignesh vignes...@ninestars.in wrote:
Hi Team,
I am indexing PDF using Apache Solr 3.6 . Passing around
3000 keywords using the OR operator and able to get the files containing
the keywords. Kindly guide me to get the keyword list in a .PDF file.
What
Hi Team,
I am indexing PDF using Apache Solr 3.6 . Passing around 3000
keywords using the OR operator (gardens OR flowers OR time OR train OR trees
OR etc) able to get the files containing these keywords. But every .PDF file
will not be containing all the keywords, some may
Hi,
I am able to use TIKA and DIH to Index a pdf as a single document.However
I need each page to be single document. Is there any inbuilt mechanism to
achieve the same or do I have to use pdfbox or any other tool achieve this?
Regards
Hi Sujatha,
There is no built in mechanism. Prepare page documents outside of the solr.
http://searchhub.org/2012/02/14/indexing-with-solrj/
And you may want to save text content somewhere too. If you change something in
index analysis/schema you need to reindex. If you save text data, you
: Wednesday, April 2, 2014 3:35 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF Indexing
Hi Sujatha,
There is no built in mechanism. Prepare page documents outside of the solr.
http://searchhub.org/2012/02/14/indexing-with-solrj/
And you may want to save text content somewhere too. If you change
should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi mlore...@sorint.it
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:26.248
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:26.248;
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse
predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15:14:36.108
You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi mlore...@sorint.it
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:26.248
indexing works.
-- Jack Krupansky
-Original Message- From: Furkan KAMACI
Sent: Friday, April 26, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
indexing
I use Solr 4.2.1 and these are my fields:
field name=id
I use Solr 4.2.1 and these are my fields:
field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=text type=text_general indexed=true stored=true/
!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich
You could start by doing
java post.jar -help
--- the 7th example shows exactly what you need to do to add a document id.
On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI furkankam...@gmail.comwrote:
I use Solr 4.2.1 and these are my fields:
field name=id type=string indexed=true stored=true
Hi Raymond;
Now I get that error: SimplePostTool: WARNING: IOException while reading
response: java.io.FileNotFoundException:
2013/4/26 Raymond Wiker rwi...@gmail.com
You could start by doing
java post.jar -help
--- the 7th example shows exactly what you need to do to add a document id.
http://wiki.apache.org/solr/post.jar
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
26. apr. 2013 kl. 13:28 skrev Furkan KAMACI furkankam...@gmail.com:
Hi Raymond;
Now I get that error: SimplePostTool: WARNING: IOException
If you can help me it would be nice. I get that error:
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/extract..
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing
I think that I should start a new thread for my question to help people who
searches for same situation.
2013/4/26 Furkan KAMACI furkankam...@gmail.com
If you can help me it would be nice. I get that error:
SimplePostTool version 1.5
Posting files to base url
-Original Message-
From: Furkan KAMACI
Sent: Friday, April 26, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
indexing
I use Solr 4.2.1 and these are my fields:
field name=id type=string indexed=true stored=true
To: solr-user@lucene.apache.org
Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
indexing
I use Solr 4.2.1 and these are my fields:
field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=text type=text_general indexed=true
Krupansky
-Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24 PM
To: solr-user@lucene.apache.org Subject: PDF indexing
Hi,
From what I have read, I think I have to use Tika (?) to index PDF, xls,
doc, etc files. How do I start? Do I use mvn clean install in the source
Hi,
From what I have read, I think I have to use Tika (?) to index PDF,
xls, doc, etc files. How do I start? Do I use mvn clean install in the
source directory to get all the jar files to begin? Centos doesn't
provide mvn, how do I build Tika after getting it from
http://maven.apache.org ?
Try SolrCell (ExtractingRequestHandler).
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
-- Jack Krupansky
-Original Message-
From: Tolga
Sent: Monday, May 07, 2012 3:24 PM
To: solr-user@lucene.apache.org
Subject: PDF indexing
Hi,
From what I have read, I think I have
On 05/07/2012 10:35 PM, Jack Krupansky wrote:
Try SolrCell (ExtractingRequestHandler).
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
-- Jack Krupansky
-Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24
PM To: solr-user@lucene.apache.org Subject: PDF indexing
Good day,
I'm checking if Solr would work for indexing PDFs. My requirements are:
1) I must know which page has what contents.
2) Left to right search support. Such as Hebrew. This has been the most
trickiest to achieve.
I also prefer to know the position of the searched contents on the page
How long are the documents ? indexing a large document can be slow
(although 2 seconds is very slow indeed).
2011/6/22 Rode González (libnova) r...@libnova.es:
Hi !
We are using Zend Search based on Lucene. Our indexing pdf consultations
take longer than 2 seconds.
We want to change to
Hi !
We are using Zend Search based on Lucene. Our indexing pdf consultations
take longer than 2 seconds.
We want to change to solr to try to solve this problem.
i. Can anyone tell me the response time for querys on pdf documents on solr?
ii. Can anyone tell me some strategies to
; Marcos; Mario Crespo
(Silvereme); 'Rode'
Subject: response time for pdf indexing
Hi !
We are using Zend Search based on Lucene. Our indexing pdf consultations
take longer than 2 seconds.
We want to change to solr to try to solve this problem.
i. Can anyone tell me the response time
32 matches
Mail list logo