Cescy wrote:
Thanks.


You mean in java class I should (1) convert pdf to xml (2) extract text in xml 
file (3) extract image (4) index and search using lucene (5) display the 
context in original format

1)  Extract PDF metadata and searchable text -- use PDFBox
2)  Extract tiff images from PDF -- use PDFBox
3)  Manipulate tiffs to be uncompressed -- use ImageImagick
4)  Manipulate uncompressed tiffs to be a max depth of 8 bit
per pixel -- ImageMagick
5)  OCR resulting tiff -- use tesseract
6) Write all data (metadata, extracted searchable text, ocr text) to xml file
6)  Index all (but do not store in index) metadata, searchable text,
ocr text) -- use Lucene

Which APIs should I use at each step???


If the system find the result text, and how the system show the context (maybe 
including image table...) in original format to user??? Because extract image 
and text are 2 steps...

I don't understand the question.

Can I search the result text in pdf file and highlight it, then split that 
page???

I show the text that I've extracted (as opposed to the PDF itself). First I use Lucene to hi-light it and then I use a AJAX framework to display it. The framework does the splitting and paging. I also provide a link to the original PDF but I don't do any hi-lighting of the PDF. FYI -- AdobeReader has some built-in highting features that are available from the command line but
I don't use them.

James

Could you please send me some screenshot of your project or the JAR to let me 
run???
------------------ Original ------------------
From:  "James Wilson"<[email protected]>;
Date:  Tue, Mar 8, 2011 01:52 AM
To: "users"<[email protected]>; Cc: "itext-questions"<[email protected]>; "java-user"<[email protected]>; Subject: Re: about pdf search

Cescy wrote:
Hi,


I am developing a pdf search engine, just use in local computer to search 
massive pdf documents.


I used pdfbox+lucene to index and search, and then I have to display the 
context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???

I have completed a project to do the exact same thing.  I put the pdf
text in XML files.  Then after I do a Lucene search I read the text from
the XML files.  I do not store the text in the Lucene index.  That would
bloat the index and slow down my searches.  FYI -- I use PDFBox to
extract the "searchable" text and I use tesseract (OCR) to extract the
text from the images within the PDFs.  In order to make tesseract work
correctly I have to use ImageMagick to do many modification to the
images so that tesseract can OCR them correctly.  Image modification/OCR
is a slow process and it is extremely resource intensive (CPU utilization specifically -- Disk IO to a lesser extent).

As far as displaying the extracted text I would use an AJAX framework that would provide a nice pop-up view of the text. This pop-up should
also have built in paging.  I use Lucene's built in hi-lighting of
matches as well.

Oh almost forgot -- I use PDFBox to extract the images from the PDFs.

James

THX



--
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone:  (505) 348-2081
Fax:    (505) 348-2028

Reply via email to