Cescy wrote:
Thanks.
You mean in java class I should (1) convert pdf to xml (2) extract text in xml
file (3) extract image (4) index and search using lucene (5) display the
context in original format
1) Extract PDF metadata and searchable text -- use PDFBox
2) Extract tiff images from PDF -- use PDFBox
3) Manipulate tiffs to be uncompressed -- use ImageImagick
4) Manipulate uncompressed tiffs to be a max depth of 8 bit
per pixel -- ImageMagick
5) OCR resulting tiff -- use tesseract
6) Write all data (metadata, extracted searchable text, ocr text) to
xml file
6) Index all (but do not store in index) metadata, searchable text,
ocr text) -- use Lucene
Which APIs should I use at each step???
If the system find the result text, and how the system show the context (maybe
including image table...) in original format to user??? Because extract image
and text are 2 steps...
I don't understand the question.
Can I search the result text in pdf file and highlight it, then split that
page???
I show the text that I've extracted (as opposed to the PDF itself).
First I use Lucene to hi-light it and then I use a AJAX framework to
display it. The framework does the splitting and paging. I also
provide a link to the original PDF but I don't do any hi-lighting of the
PDF. FYI -- AdobeReader has some built-in highting features that are
available from the command line but
I don't use them.
James
Could you please send me some screenshot of your project or the JAR to let me
run???
------------------ Original ------------------
From: "James Wilson"<[email protected]>;
Date: Tue, Mar 8, 2011 01:52 AM
To: "users"<[email protected]>;
Cc: "itext-questions"<[email protected]>; "java-user"<[email protected]>;
Subject: Re: about pdf search
Cescy wrote:
Hi,
I am developing a pdf search engine, just use in local computer to search
massive pdf documents.
I used pdfbox+lucene to index and search, and then I have to display the
context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???
I have completed a project to do the exact same thing. I put the pdf
text in XML files. Then after I do a Lucene search I read the text from
the XML files. I do not store the text in the Lucene index. That would
bloat the index and slow down my searches. FYI -- I use PDFBox to
extract the "searchable" text and I use tesseract (OCR) to extract the
text from the images within the PDFs. In order to make tesseract work
correctly I have to use ImageMagick to do many modification to the
images so that tesseract can OCR them correctly. Image modification/OCR
is a slow process and it is extremely resource intensive (CPU
utilization specifically -- Disk IO to a lesser extent).
As far as displaying the extracted text I would use an AJAX framework
that would provide a nice pop-up view of the text. This pop-up should
also have built in paging. I use Lucene's built in hi-lighting of
matches as well.
Oh almost forgot -- I use PDFBox to extract the images from the PDFs.
James
THX
--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone: (505) 348-2081
Fax: (505) 348-2028