PDF Text Stripper

2002-07-09 Thread Ben Litchfield
any bugs or feature requests. The library can be retrieved from http://www.csh.rit.edu/~ben/projects/pdfparser/ -Ben Litchfield -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]

Re: PDF Text Stripper

2002-07-09 Thread Ben Litchfield
Can you send me the PDF document that you are having problems with and I will look into it. There are still some issues that I am working out with the spacing of characters. -Ben On Tue, 9 Jul 2002, Keith Gunn wrote: On Tue, 9 Jul 2002, Ben Litchfield wrote: Hi, I have written a PDF

Re: problems with HTML Parser

2002-08-14 Thread Ben Litchfield
Maurits, You can get a PDF parser from http://www.pdfbox.org -Ben On Wed, 14 Aug 2002, Maurits van Wijland wrote: Keith, I haven't noticed the problem with the Parser...but you trigger me by saying that you have a PDFParser!!! Are you able to contribute this PDFParser?? Maurits. --

Re: pdfbox on solaris

2002-08-28 Thread Ben Litchfield
I know that there are some memory issues with some documents. The next release of pdfbox fixes some of these. Although I am not sure why it would run differently under windows than solaris. Off the top of my head maybe the solaris JVM uses more memory per object than the windows JVM. The

IOException not a directory

2002-10-28 Thread Ben Litchfield
Has anybody seen this type of error before. This used to work and all of a sudden broke. That path is a folder. Ben Litchfield 2002-10-28 12:51:31,109 [Default] java.io.IOException: \\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index not a directory 2002-10-28 12

Re: PDF Text extraction

2002-12-27 Thread Ben Litchfield
String line = null; while( (line = contentsReader.readLine() ) != null ) { System.out.println( line ); } I have not tested if this compiles but it should be pretty close. Ben Litchfield On Fri, 27 Dec 2002, Suhas Indra wrote: Hello List I am using PDFBox to index some of the PDF documents

RE: OutOfMemoryException while Indexing an XML file/PdfParser

2003-02-18 Thread Ben Litchfield
I am aware of the issues with parsing certain PDF documents. I am currently working on refactoring PDFBox to deal with large documents. You will see this in the next release. I would like to thank people for feedback and sending problem documents. Ben Litchfield http://www.pdfbox.org On Tue

[ANN] PDFBox 0.6.0

2003-03-05 Thread Ben Litchfield
would fail with some pdfs with double endobj definitions -Added PDF document summary fields to the lucene document Thank you, Ben Litchfield http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

Re: [ANN] PDFBox 0.6.0

2003-03-06 Thread Ben Litchfield
java.io.EOFException with message: Unexpected end of ZLIB input stream Eric Anderson LanRx Network Solutions Quoting Ben Litchfield [EMAIL PROTECTED]: I would like to announce the next release of PDFBox. PDFBox allows for PDF documents to be indexed using lucene through a simple interface. Please

Re: [ANN] PDFBox 0.6.0

2003-03-09 Thread Ben Litchfield
I believe this problem has been fixed with 0.6.1. Please give it a try. Ben Litchfield -- On Thu, 6 Mar 2003, Eric Anderson wrote: When it throws the exception, the indexer fails, so I cannot continue the index. It appears that it's only related to some files, as I have been able

Re: getting PDFBox O/P into a stream

2003-03-25 Thread Ben Litchfield
can be called from the command line to create an index. Ben Litchfield -- On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote: Can some one please help me with the command to get O/P from PDFBox on command line or into streams rather that dumping it into a text file. thanks, vikas

RE: out of memory

2003-04-02 Thread Ben Litchfield
It is possible that it is one single PDF that is having an issue. Can you track it down to that one and let me know which it is. It would be very helpful if you could send it to me as well. Ben http://www.pdfbox.org On Wed, 2 Apr 2003, Eoghan S wrote: i have tried every memory setting

Re: Lucene demo ideas?

2003-09-17 Thread Ben Litchfield
- Index text and HTML files. Any others? What, no PDF files!! Ben -- http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Does the Lucene search engine work with PDF's?

2003-10-17 Thread Ben Litchfield
You need to be able to extract the text from them and feed that to lucene. http://ww.pdfbox.org can extract text from pdf documents. Ben On Fri, 17 Oct 2003, Andre Hughes wrote: Hello, Can the Lucene search engine index and search though PDF documents? What are the file format limits for

Re: Exotic format indexing?

2003-10-30 Thread Ben Litchfield
Unfortunately, it is not quite so easy. I am not sure about Word documents but PDFs usually have there contents compressed so a raw fishing around for text would be pointless. Your best bet is to use a package like the one from textmining.org that handles various formats for you. Ben On Thu,

Re: Missing pdf document title

2003-11-10 Thread Ben Litchfield
I would try two things. 1)Is PDFBox getting the title from the document? You can run this example to find out java org.pdfbox.examples.pdmodel.PrintDocumentMetaData input-pdf 2)Is the lucene field getting properly set in the lucene database. I would use luke(http://www.getopt.org/luke/) to

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield
Yes, just add the log4j configuration. The easiest way to do that is as a system parameter like this java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML -create -index c:\\index .. Where log4j.xml is the path to your log4j config, PDFBox has an example one you can use. Ben

RE: Lucene refresh index function (incremental indexing).

2003-11-25 Thread Ben Litchfield
Message- From: Ben Litchfield [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 25, 2003 9:45 AM To: Lucene Users List Subject: RE: Lucene refresh index function (incremental indexing). Yes, just add the log4j configuration. The easiest way to do that is as a system parameter like

Re: Indexing japanese PDF documents

2004-03-22 Thread Ben Litchfield
Yes he did, but I was away the past couple days. As this is more of a PDFBox issue I responded in the PDFBox forums, please follow the thread there if you are interested. Ben On Mon, 22 Mar 2004, Otis Gospodnetic wrote: I have not tried these other tools yet. Have you asked Ben Litchfield

Re: Problem while Indexing Pdf files

2004-03-25 Thread Ben Litchfield
The latest release of PDFBox changed the way it dealt with fonts and introduced this bug, please try the version in CVS and let me know if you are still having a problem. Ben On Thu, 25 Mar 2004, Ankur Goel wrote: Hi, I have to index PDF files. For that I am using pdfbox. But when I try

Re: too many files open error

2004-03-26 Thread Ben Litchfield
As PDFBox is an all Java solution there is no specific linux/unix version. The source that is available with the downloaded package should suit your needs. What does the sourceforge site not provide for you? Ben On Fri, 26 Mar 2004, Charlie Smith wrote: Is there another source for the

Re: PDFBox problem.

2004-07-23 Thread Ben Litchfield
I usually use use -Dlog4j.configuration=log4j.xml when invoking java from the command line, but I believe this depends on your environment. ex java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf Ben On Fri, 23 Jul 2004, Christiaan Fluit wrote: We invoke the following

Re: pdfbox performance.

2004-07-28 Thread Ben Litchfield
Different PDFs will exhibit different extraction speeds because of the way that PDF documents are structured. I assume you are using the latest version 0.6.6, could you give 0.6.5 a try and see if you notice faster speeds. Ben On Thu, 29 Jul 2004, Miroslaw Milewski wrote: Paul Smith wrote:

Re: PDFBox Issue

2004-08-17 Thread Ben Litchfield
PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar file), I believe that 1.2.8 is the latest. I will make sure that the next version of PDFBox includes the latest log4j version, which I assume is what everybody would like to use. But, by looking at the below error message it

Re: Fw: pdf search

2004-08-20 Thread Ben Litchfield
In order to search through a PDF document the text must be extracted from the PDF document. There are several libraries to do that, including http://www.pdfbox.org After you have the text from the PDF document you just add it to the lucene index like any other text document. You should go

Re: integration of lucene with pdfbox

2004-08-23 Thread Ben Litchfield
If you can use lucene on its own then you already know how to add a lucene Document to the index. So you need to be able to take a PDF and get a lucene Document. org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument() does that for you. Ben On Mon, 23 Aug 2004, Santosh wrote: I

Re: pdf in Chinese

2004-09-08 Thread Ben Litchfield
This appears to be more of a PDFBox issue than a lucene issue, please post an issue to the PDFBox site. Also note, that because of certain encodings that a PDF writer can use, it is impossible to extract text from all PDF documents. Ben On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote: it is not

PDF-Text Performance comparison

2004-09-08 Thread Ben Litchfield
On Wed, 8 Sep 2004, Chas Emerick wrote: PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ For those that have not seen, snowtide.com has done a performance comparison against several Java PDF-Text libraries, including Snowtide's

Re: Out of memory in lucene 1.4.1 when re-indexing large number of documents

2004-09-10 Thread Ben Litchfield
I can say that gc is not collecting these objects since I forced gc runs when indexing every now and then (when parsing pdf-type objects, that is): No effect. What PDF parser are you using? Is the problem within the parser and not lucene? Are you releasing all resources? Ben

Re: Google Desktop Could be Better

2004-10-16 Thread Ben Litchfield
that the capability of indexing PDF documents would outweigh the extra time for the download. Ben On Sat, 16 Oct 2004, Bill Tschumy wrote: On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote: types. It uses Lucene underneath. I'm thinking about extending it in the direction that Google Desktop

RE: Highlighting PDF file after the search

2004-09-27 Thread Ben Litchfield
With some work this is possible with PDFBox. PDFBox extracts text with positioning and sizing. When the text was found you could add to the page content stream the drawing of a highlighted box. PDFBox has an open RFE for this functionality, please monitor it for progress.

Re: Need advice: what pdf lib to use?

2004-10-22 Thread Ben Litchfield
Please post any PDFBox issues you notice on the PDFBox sourceforge bug list, if possible attach/email any problem PDFs that you encounter. There are some efforts underway to improve the speed of PDFBox, you can monitor the progress at

Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield
PDFBox does not 'stumble' when it gives that message, that is correct functionality if that permission is not allowed. If your company is willing to pay a 'fortune' why not sponsor a change to an open source project for half a fortune. Ben http://www.pdfbox.org On Mon, 25 Oct 2004 [EMAIL

Re: Need advice: what pdf lib to use?

2004-10-25 Thread Ben Litchfield
security set. In short, if You also could implement this uncorrect functionality the closed source guys did, it would be really great! As far as sponsoring is concerned I would be ready to hack (or at least to try) it even for 1/3 of that fortune:))) J. Ben Litchfield [EMAIL PROTECTED

Re: PDF Index Time

2004-11-18 Thread Ben Litchfield
PDFBox is slow, there is an open issue for it on the sourceforge site and I am actively working on improving speed and should see significant improvements in the next release. I have not extensively tried the snowtide package but they have a trial download and the docs show that it should be

.NET Version of Lucene

2004-12-06 Thread Ben Litchfield
I know there has been talk about a .NET version of lucene. I have been looking into doing something similar for PDFBox and came across a project called IKVM http://www.ikvm.net/ I don't believe it has been mentioned on this list. It is a little different approach than what I people have been

Re: C# Ports

2004-12-15 Thread Ben Litchfield
I have created a DLL from the lucene jars for use in the PDFBox project. It uses IKVM(http://www.ikvm.net) to create a DLL from a jar. The binary version can be found here http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip This includes the ant script used to

Re: FOP Generated PDF and PDFBox

2005-01-21 Thread Ben Litchfield
Are you indexing the FOP PDF's differently than other PDF documents? Can I assume that you are using PDFBox's LucenePDFDocument.getDocument() method? Ben On Fri, 21 Jan 2005, Luke Shannon wrote: Hello; Our CMS now allows users to create PDF documents (uses FOP) and than search them. I

Re: Use an executable from java ...

2005-01-31 Thread Ben Litchfield
I will assume you are asking this question on the lucene mailing list because you now want to index that PDF document. Have you tried PDFBox? It can't create an html file for you but it can extract text. Ben http://www.pdfbox.org On Mon, 31 Jan 2005, Bertrand VENZAL wrote: Hi all, I ve

Re: Investingating Lucene For Project

2005-03-01 Thread Ben Litchfield
See inlined comments below. We have had requests from some clients who would like the ability to index PDF files, now and possibly other text files in the future. The PDF files live on a server and are in a structured environment. I would like to somehow index the content inside the PDF and