Re: Use an executable from java ...
Kristian, I assume all of you comments are with the 0.7.0 version of PDFBox. There were some great improvements in that version in terms of speed and accuracy. > That's courious beacause we experienced that pdftotext was able to > convert 33% more pdf documents than PDFBox. Depending on the set of PDF documents you will notice different results. I welcome any bug reports(if they don't already exist) on that 33% that are not working for you. In particular, PDFBox needs some work on non-english languages. > That's good. Out application supports alternative conversion pipelines > that provide fallback mechanims. If the first converter cannot convert a > document a second converter is called. So PDFBox is our fallback > converter. Well, at least PDFBox made it as the "fallback. :) Ben http://www.pdfbox.org - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
Hi Christiaan Just to defend PDFBox: we actually recently decided to move in the opposite direction. I didn't want to offend PDFBox *g* We just removed pdftotext from our application and are now using PDFBox 0.7.0 for all our PDF processing. Before we were using them both in parallel: pdftotext for fast text extraction and PDFBox for all metadata such as titles, authors, etc. pdftotext is able to produce html output which contains these metadata as well. Conversion from pdf to html and parsing html is (with our tests) still twice as fast as PDFBox. Upon closer inspection of the output, we also saw that pdftotext was not able to extract text from a significant amount of PDFs (9 out of 113 documents, all perfectly readable PDF documents) while PDFBox performed flawlessly. For us, quality is of greater concern than speed. That's courious beacause we experienced that pdftotext was able to convert 33% more pdf documents than PDFBox. Finally, I must say that the speed and quality of Ben's replies to bug reports and suggestions is very impressive, giving us confidence in that future problems will be handled satisfactorily. That's good. Out application supports alternative conversion pipelines that provide fallback mechanims. If the first converter cannot convert a document a second converter is called. So PDFBox is our fallback converter. Greetings Kristian -- ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind Kristian Hermsdorf Interface Projects GmbH Tolkewitzer Straße 49 01277 Dresden tel.: ++49-351-3 18 09 39 mail: [EMAIL PROTECTED] priv: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
Kristian Hermsdorf wrote: We're using pdftotext as well, because PDFbox ist really slow. If your application should work under Windows you will probably experiance some mystic Java-VM crashes while executing external processes in batch-mode. (This is because of a bug in Windows-VM... we implemented out own Process with JNI to compensate this bug). Just to defend PDFBox: we actually recently decided to move in the opposite direction. We just removed pdftotext from our application and are now using PDFBox 0.7.0 for all our PDF processing. Before we were using them both in parallel: pdftotext for fast text extraction and PDFBox for all metadata such as titles, authors, etc. One reason for this is that with version 0.7.0 the difference in performance was only marginal on our testset of 113 PDF documents from various sources. Of course the difference will be bigger when you are only extracting text, because in the old situation we had to let two tools process the same file. Upon closer inspection of the output, we also saw that pdftotext was not able to extract text from a significant amount of PDFs (9 out of 113 documents, all perfectly readable PDF documents) while PDFBox performed flawlessly. For us, quality is of greater concern than speed. Finally, I must say that the speed and quality of Ben's replies to bug reports and suggestions is very impressive, giving us confidence in that future problems will be handled satisfactorily. Regards, Chris -- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
Hi I ve a kind of problem to execute a converting tool to modify a pdf to an html under Linux. In fact, i have an executable "pdftohtml" which work correctly on batch mode, and when I want to use it through Java under Windows 2000 works also,BUT it does not work at all on the server under linux. I m using the following code you've got to read the processes stdout and stderr while the process is running. If you don't read those streams the process will block after it wrote some (about 8k) bytes to ist's stdout/stderr. We're using pdftotext as well, because PDFbox ist really slow. If your application should work under Windows you will probably experiance some mystic Java-VM crashes while executing external processes in batch-mode. (This is because of a bug in Windows-VM... we implemented out own Process with JNI to compensate this bug). Greetings, Kristian -- ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind Kristian Hermsdorf Interface Projects GmbH Tolkewitzer Straße 49 01277 Dresden tel.: ++49-351-3 18 09 39 mail: [EMAIL PROTECTED] priv: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
Check out http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html which provides some pointers and code which should be helpful. Cheers, Kelvin http://www.supermind.org On Mon, 31 Jan 2005 19:01:11 +0100, Bertrand VENZAL wrote: > Hi all, > > I ve a kind of problem to execute a converting tool to modify a pdf > to an html under Linux. In fact, i have an executable "pdftohtml" > which work correctly on batch mode, and when I want to use it > through Java under Windows 2000 works also,BUT it does not work at > all on the server under linux. I m using the following code. > > scommand = "/bin/sh -c \"myCommand fileName output\" "; > > Runtime runtime = Runtime.getRuntime(); > Process proc = runtime.exec(scommand); > proc.waitFor(); > > > I m running my code under Linux-redhat with a classic shell. Is > there an other way to do the same thing or maybe am i missing > something ? Any help will be grandly appreciate. > > Thanks > Bertrand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use an executable from java ...
I will assume you are asking this question on the lucene mailing list because you now want to index that PDF document. Have you tried PDFBox? It can't create an html file for you but it can extract text. Ben http://www.pdfbox.org On Mon, 31 Jan 2005, Bertrand VENZAL wrote: > Hi all, > > I ve a kind of problem to execute a converting tool to modify a pdf to an > html under Linux. In fact, i have an executable "pdftohtml" which work > correctly on batch mode, and when I want to use it through Java under > Windows 2000 works also,BUT it does not work at all on the server under > linux. I m using the following code. > > scommand = "/bin/sh -c \"myCommand fileName output\" "; > > Runtime runtime = Runtime.getRuntime(); > Process proc = runtime.exec(scommand); > proc.waitFor(); > > > I m running my code under Linux-redhat with a classic shell. > Is there an other way to do the same thing or maybe am i missing something > ? > Any help will be grandly appreciate. > > Thanks > Bertrand > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]