Re: about pdf search

Cescy Mon, 07 Mar 2011 20:50:46 -0800

Thanks.

You mean in java class I should (1) convert pdf to xml (2) extract text in xml 
file (3) extract image (4) index and search using lucene (5) display the 
context in original format

Which APIs should I use at each step???

If the system find the result text, and how the system show the context (maybe 
including image table...) in original format to user??? Because extract image 
and text are 2 steps...

Can I search the result text in pdf file and highlight it, then split that 
page???

Could you please send me some screenshot of your project or the JAR to let me 
run???

------------------ Original ------------------
From:  "James Wilson"<[email protected]>;
Date:  Tue, Mar 8, 2011 01:52 AM
To:  "users"<[email protected]>; 
Cc:  "itext-questions"<[email protected]>; 
"java-user"<[email protected]>; 
Subject:  Re: about pdf search

 Cescy wrote:
> Hi,
> 
> 
> I am developing a pdf search engine, just use in local computer to search 
> massive pdf documents.
> 
> 
> I used pdfbox+lucene to index and search, and then I have to display the 
> context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS???

I have completed a project to do the exact same thing.  I put the pdf
text in XML files.  Then after I do a Lucene search I read the text from
the XML files.  I do not store the text in the Lucene index.  That would
bloat the index and slow down my searches.  FYI -- I use PDFBox to
extract the "searchable" text and I use tesseract (OCR) to extract the
text from the images within the PDFs.  In order to make tesseract work
correctly I have to use ImageMagick to do many modification to the
images so that tesseract can OCR them correctly.  Image modification/OCR
is a slow process and it is extremely resource intensive (CPU 
utilization specifically -- Disk IO to a lesser extent).

As far as displaying the extracted text I would use an AJAX framework 
that would provide a nice pop-up view of the text.  This pop-up should
also have built in paging.  I use Lucene's built in hi-lighting of
matches as well.

Oh almost forgot -- I use PDFBox to extract the images from the PDFs.

James
> 
> 
> THX

-- 
James J. Wilson II
Systems Engineer
U.S. District Court
District of New Mexico
333 Lomas Blvd., NW
Albuquerque, NM 87102
Phone:  (505) 348-2081
Fax:    (505) 348-2028

Re: about pdf search

Reply via email to