any bugs or feature requests.
The library can be retrieved from
http://www.csh.rit.edu/~ben/projects/pdfparser/
-Ben Litchfield
--
To unsubscribe, e-mail: mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Can you send me the PDF document that you are having problems with and I
will look into it.
There are still some issues that I am working out with the spacing of
characters.
-Ben
On Tue, 9 Jul 2002, Keith Gunn wrote:
On Tue, 9 Jul 2002, Ben Litchfield wrote:
Hi,
I have written a PDF
Maurits,
You can get a PDF parser from http://www.pdfbox.org
-Ben
On Wed, 14 Aug 2002, Maurits van Wijland wrote:
Keith,
I haven't noticed the problem with the Parser...but you trigger me
by saying that you have a PDFParser!!!
Are you able to contribute this PDFParser??
Maurits.
--
I know that there are some memory issues with some documents. The next
release of pdfbox fixes some of these. Although I am not sure why it
would run differently under windows than solaris. Off the top of my head
maybe the solaris JVM uses more memory per object than the windows JVM.
The
Has anybody seen this type of error before. This used to work and all of
a sudden broke. That path is a folder.
Ben Litchfield
2002-10-28 12:51:31,109 [Default] java.io.IOException:
\\Finsrv04\JBoss-2.4.1_Tomcat-3.2.3\fast_generated_output\lucene\website\index
not a directory
2002-10-28 12
String line = null;
while( (line = contentsReader.readLine() ) != null )
{
System.out.println( line );
}
I have not tested if this compiles but it should be pretty close.
Ben Litchfield
On Fri, 27 Dec 2002, Suhas Indra wrote:
Hello List
I am using PDFBox to index some of the PDF documents
I am aware of the issues with parsing certain PDF documents. I am
currently working on refactoring PDFBox to deal with large documents. You
will see this in the next release. I would like to thank people for
feedback and sending problem documents.
Ben Litchfield
http://www.pdfbox.org
On Tue
would fail with some pdfs with double endobj
definitions
-Added PDF document summary fields to the lucene document
Thank you,
Ben Litchfield
http://www.pdfbox.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional
java.io.EOFException
with message: Unexpected end of ZLIB input stream
Eric Anderson
LanRx Network Solutions
Quoting Ben Litchfield [EMAIL PROTECTED]:
I would like to announce the next release of PDFBox. PDFBox allows for
PDF documents to be indexed using lucene through a simple interface.
Please
I believe this problem has been fixed with 0.6.1. Please give it a try.
Ben Litchfield
--
On Thu, 6 Mar 2003, Eric Anderson wrote:
When it throws the exception, the indexer fails, so I cannot continue the index.
It appears that it's only related to some files, as I have been able
can be called from the command line to create an
index.
Ben Litchfield
--
On Tue, 25 Mar 2003, Ramrakhiani, Vikas wrote:
Can some one please help me with the command to get O/P from PDFBox on
command line or into streams rather that dumping it into a text file.
thanks,
vikas
It is possible that it is one single PDF that is having an issue. Can you
track it down to that one and let me know which it is. It would be very
helpful if you could send it to me as well.
Ben
http://www.pdfbox.org
On Wed, 2 Apr 2003, Eoghan S wrote:
i have tried every memory setting
- Index text and HTML files. Any others?
What, no PDF files!!
Ben
--
http://www.pdfbox.org
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
You need to be able to extract the text from them and feed that to lucene.
http://ww.pdfbox.org can extract text from pdf documents.
Ben
On Fri, 17 Oct 2003, Andre Hughes wrote:
Hello,
Can the Lucene search engine index and search though PDF documents?
What are the file format limits for
Unfortunately, it is not quite so easy. I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
fishing around for text would be pointless. Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.
Ben
On Thu,
I would try two things.
1)Is PDFBox getting the title from the document?
You can run this example to find out
java org.pdfbox.examples.pdmodel.PrintDocumentMetaData input-pdf
2)Is the lucene field getting properly set in the lucene database. I
would use luke(http://www.getopt.org/luke/) to
Yes, just add the log4j configuration. The easiest way to do that is as a
system parameter like this
java -Dlog4j.configuration=log4j.xml org.apache.lucene.demo.IndexHTML
-create -index c:\\index ..
Where log4j.xml is the path to your log4j config, PDFBox has an example
one you can use.
Ben
Message-
From: Ben Litchfield [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 25, 2003 9:45 AM
To: Lucene Users List
Subject: RE: Lucene refresh index function (incremental indexing).
Yes, just add the log4j configuration. The easiest way to do that is as a
system parameter like
Yes he did, but I was away the past couple days. As this is more of a
PDFBox issue I responded in the PDFBox forums, please follow the thread
there if you are interested.
Ben
On Mon, 22 Mar 2004, Otis Gospodnetic wrote:
I have not tried these other tools yet.
Have you asked Ben Litchfield
The latest release of PDFBox changed the way it dealt with fonts and
introduced this bug, please try the version in CVS and let me know if you
are still having a problem.
Ben
On Thu, 25 Mar 2004, Ankur Goel wrote:
Hi,
I have to index PDF files. For that I am using pdfbox. But when I try
As PDFBox is an all Java solution there is no specific linux/unix version.
The source that is available with the downloaded package should suit your
needs. What does the sourceforge site not provide for you?
Ben
On Fri, 26 Mar 2004, Charlie Smith wrote:
Is there another source for the
I usually use use -Dlog4j.configuration=log4j.xml when invoking java from
the command line, but I believe this depends on your environment.
ex
java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText input.pdf
Ben
On Fri, 23 Jul 2004, Christiaan Fluit wrote:
We invoke the following
Different PDFs will exhibit different extraction speeds because of the way
that PDF documents are structured.
I assume you are using the latest version 0.6.6, could you give 0.6.5 a
try and see if you notice faster speeds.
Ben
On Thu, 29 Jul 2004, Miroslaw Milewski wrote:
Paul Smith wrote:
PDFBox comes with log4j version 1.2.5(according to MANIFEST.MF in jar
file), I believe that 1.2.8 is the latest. I will make sure that the next
version of PDFBox includes the latest log4j version, which I assume is
what everybody would like to use.
But, by looking at the below error message it
In order to search through a PDF document the text must be extracted from
the PDF document. There are several libraries to do that, including
http://www.pdfbox.org After you have the text from the PDF document you
just add it to the lucene index like any other text document. You should
go
If you can use lucene on its own then you already know how to add a lucene
Document to the index. So you need to be able to take a PDF and get a
lucene Document.
org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument()
does that for you.
Ben
On Mon, 23 Aug 2004, Santosh wrote:
I
This appears to be more of a PDFBox issue than a lucene issue, please post
an issue to the PDFBox site.
Also note, that because of certain encodings that a PDF writer can use, it
is impossible to extract text from all PDF documents.
Ben
On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote:
it is not
On Wed, 8 Sep 2004, Chas Emerick wrote:
PDFTextStream: fast PDF text extraction for Java applications
http://snowtide.com/home/PDFTextStream/
For those that have not seen, snowtide.com has done a performance
comparison against several Java PDF-Text libraries, including Snowtide's
I can say that gc is not collecting these objects since I forced gc
runs when indexing every now and then (when parsing pdf-type objects,
that is): No effect.
What PDF parser are you using? Is the problem within the parser and not
lucene? Are you releasing all resources?
Ben
that
the capability of indexing PDF documents would outweigh the extra time for
the download.
Ben
On Sat, 16 Oct 2004, Bill Tschumy wrote:
On Oct 16, 2004, at 9:47 PM, Ben Litchfield wrote:
types. It uses Lucene underneath. I'm thinking about extending it in
the direction that Google Desktop
With some work this is possible with PDFBox. PDFBox extracts text with
positioning and sizing. When the text was found you could add to the page
content stream the drawing of a highlighted box.
PDFBox has an open RFE for this functionality, please monitor it for
progress.
Please post any PDFBox issues you notice on the PDFBox sourceforge bug
list, if possible attach/email any problem PDFs that you encounter.
There are some efforts underway to improve the speed of PDFBox, you can
monitor the progress at
PDFBox does not 'stumble' when it gives that message, that is correct
functionality if that permission is not allowed.
If your company is willing to pay a 'fortune' why not sponsor a change to
an open source project for half a fortune.
Ben
http://www.pdfbox.org
On Mon, 25 Oct 2004 [EMAIL
security set.
In short, if You also could implement this uncorrect functionality the
closed source guys did, it would be really great!
As far as sponsoring is concerned I would be ready to hack (or at least to
try) it even for 1/3 of that fortune:)))
J.
Ben Litchfield [EMAIL PROTECTED
PDFBox is slow, there is an open issue for it on the sourceforge site and
I am actively working on improving speed and should see significant
improvements in the next release.
I have not extensively tried the snowtide package but they have a trial
download and the docs show that it should be
I know there has been talk about a .NET version of lucene. I have been
looking into doing something similar for PDFBox and came across a project
called IKVM http://www.ikvm.net/ I don't believe it has been mentioned on
this list.
It is a little different approach than what I people have been
I have created a DLL from the lucene jars for use in the PDFBox project.
It uses IKVM(http://www.ikvm.net) to create a DLL from a jar.
The binary version can be found here
http://www.csh.rit.edu/~ben/projects/pdfbox/nightly-release/PDFBox-.NET-0.7.0-dev.zip
This includes the ant script used to
Are you indexing the FOP PDF's differently than other PDF documents?
Can I assume that you are using PDFBox's LucenePDFDocument.getDocument()
method?
Ben
On Fri, 21 Jan 2005, Luke Shannon wrote:
Hello;
Our CMS now allows users to create PDF documents (uses FOP) and than search
them.
I
I will assume you are asking this question on the lucene mailing list
because you now want to index that PDF document.
Have you tried PDFBox? It can't create an html file for you but it can
extract text.
Ben
http://www.pdfbox.org
On Mon, 31 Jan 2005, Bertrand VENZAL wrote:
Hi all,
I ve
See inlined comments below.
We have had requests from some clients who would like the ability to
index PDF files, now and possibly other text files in the future. The
PDF files live on a server and are in a structured environment. I would
like to somehow index the content inside the PDF and
40 matches
Mail list logo