Thanks.
You mean in java class I should (1) convert pdf to xml (2) extract text in xml file (3) extract image (4) index and search using lucene (5) display the context in original format Which APIs should I use at each step??? If the system find the result text, and how the system show the context (maybe including image table...) in original format to user??? Because extract image and text are 2 steps... Can I search the result text in pdf file and highlight it, then split that page??? Could you please send me some screenshot of your project or the JAR to let me run??? ------------------ Original ------------------ From: "James Wilson"<[email protected]>; Date: Tue, Mar 8, 2011 01:52 AM To: "users"<[email protected]>; Cc: "itext-questions"<[email protected]>; "java-user"<[email protected]>; Subject: Re: about pdf search Cescy wrote: > Hi, > > > I am developing a pdf search engine, just use in local computer to search > massive pdf documents. > > > I used pdfbox+lucene to index and search, and then I have to display the > context to the user in pdf file in user interface. HOW CAN I ACHIEVE THIS??? I have completed a project to do the exact same thing. I put the pdf text in XML files. Then after I do a Lucene search I read the text from the XML files. I do not store the text in the Lucene index. That would bloat the index and slow down my searches. FYI -- I use PDFBox to extract the "searchable" text and I use tesseract (OCR) to extract the text from the images within the PDFs. In order to make tesseract work correctly I have to use ImageMagick to do many modification to the images so that tesseract can OCR them correctly. Image modification/OCR is a slow process and it is extremely resource intensive (CPU utilization specifically -- Disk IO to a lesser extent). As far as displaying the extracted text I would use an AJAX framework that would provide a nice pop-up view of the text. This pop-up should also have built in paging. I use Lucene's built in hi-lighting of matches as well. Oh almost forgot -- I use PDFBox to extract the images from the PDFs. James > > > THX -- James J. Wilson II Systems Engineer U.S. District Court District of New Mexico 333 Lomas Blvd., NW Albuquerque, NM 87102 Phone: (505) 348-2081 Fax: (505) 348-2028

