Re: Use an executable from java ...

2005-02-08 Thread Ben Litchfield

Kristian,

I assume all of you comments are with the 0.7.0 version of PDFBox.  There
were some great improvements in that version in terms of speed and
accuracy.

> That's courious beacause we experienced that pdftotext was able to
> convert 33% more pdf documents than PDFBox.

Depending on the set of PDF documents you will notice different results.
I welcome any bug reports(if they don't already exist) on that 33% that
are not working for you.  In particular, PDFBox needs some work on
non-english languages.


> That's good. Out application supports alternative conversion pipelines
> that provide fallback mechanims. If the first converter cannot convert a
> document a second converter is called. So PDFBox is our fallback
> converter.


Well, at least PDFBox made it as the "fallback.  :)

Ben
http://www.pdfbox.org

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use an executable from java ...

2005-02-08 Thread Kristian Hermsdorf
Hi Christiaan
Just to defend PDFBox: we actually recently decided to move in the
opposite direction.
I didn't want to offend PDFBox *g*
We just removed pdftotext from our application and are now using PDFBox
0.7.0 for all our PDF processing. Before we were using them both in
parallel: pdftotext for fast text extraction and PDFBox for all metadata
such as titles, authors, etc.
pdftotext is able to produce html output which contains these metadata as 
well.
Conversion from pdf to html and parsing html is (with our tests) still twice as 
fast as PDFBox.
Upon closer inspection of the output, we also saw that pdftotext was not
able to extract text from a significant amount of PDFs (9 out of 113
documents, all perfectly readable PDF documents) while PDFBox performed
flawlessly. For us, quality is of greater concern than speed.
That's courious beacause we experienced that pdftotext was able to convert 
33% more pdf documents than PDFBox.
Finally, I must say that the speed and quality of Ben's replies to bug
reports and suggestions is very impressive, giving us confidence in that
future problems will be handled satisfactorily.
That's good. Out application supports alternative conversion pipelines that 
provide fallback mechanims. If the first converter cannot convert a document a 
second converter is called. So PDFBox is our fallback converter.
Greetings
Kristian
--
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  
Kristian Hermsdorf
Interface Projects GmbH
Tolkewitzer Straße  49  
01277 Dresden   
tel.: ++49-351-3 18 09 39
mail: [EMAIL PROTECTED]
priv: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use an executable from java ...

2005-02-08 Thread Christiaan Fluit
Kristian Hermsdorf wrote:
We're using pdftotext as well, because PDFbox ist really slow. If your 
application should work under Windows you will probably experiance some 
mystic Java-VM crashes while executing external processes in batch-mode. 
(This is because of a bug in Windows-VM... we implemented out own 
Process with JNI to compensate this bug).
Just to defend PDFBox: we actually recently decided to move in the 
opposite direction.

We just removed pdftotext from our application and are now using PDFBox 
0.7.0 for all our PDF processing. Before we were using them both in 
parallel: pdftotext for fast text extraction and PDFBox for all metadata 
such as titles, authors, etc.

One reason for this is that with version 0.7.0 the difference in 
performance was only marginal on our testset of 113 PDF documents from 
various sources. Of course the difference will be bigger when you are 
only extracting text, because in the old situation we had to let two 
tools process the same file.

Upon closer inspection of the output, we also saw that pdftotext was not 
able to extract text from a significant amount of PDFs (9 out of 113 
documents, all perfectly readable PDF documents) while PDFBox performed 
flawlessly. For us, quality is of greater concern than speed.

Finally, I must say that the speed and quality of Ben's replies to bug 
reports and suggestions is very impressive, giving us confidence in that 
future problems will be handled satisfactorily.

Regards,
Chris
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use an executable from java ...

2005-02-08 Thread Kristian Hermsdorf
Hi
I ve a kind of problem to execute a converting tool to modify a pdf to an
html under Linux. In fact, i have an executable "pdftohtml" which work
correctly on batch mode, and when I want to use it through Java under
Windows 2000 works also,BUT it does not work at all on the server under
linux. I m using the following code
you've got to read the processes stdout and stderr while the process is 
running. If you don't read those streams the process will block after it wrote 
some (about 8k) bytes to ist's stdout/stderr.
We're using pdftotext as well, because PDFbox ist really slow. If your 
application should work under Windows you will probably experiance some mystic 
Java-VM crashes while executing external processes in batch-mode. (This is 
because of a bug in Windows-VM... we implemented out own Process with JNI to 
compensate this bug).
Greetings,
Kristian
--
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  
Kristian Hermsdorf
Interface Projects GmbH
Tolkewitzer Straße  49  
01277 Dresden   
tel.: ++49-351-3 18 09 39
mail: [EMAIL PROTECTED]
priv: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Use an executable from java ...

2005-01-31 Thread Kelvin Tan
Check out http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html 
which provides some pointers and code which should be helpful.

Cheers,
Kelvin
http://www.supermind.org

On Mon, 31 Jan 2005 19:01:11 +0100, Bertrand VENZAL wrote:
> Hi all,
>
> I ve a kind of problem to execute a converting tool to modify a pdf
> to an html under Linux. In fact, i have an executable "pdftohtml"
> which work correctly on batch mode, and when I want to use it
> through Java under Windows 2000 works also,BUT it does not work at
> all on the server under linux. I m using the following code.
>
> scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
> Runtime runtime = Runtime.getRuntime();
> Process proc = runtime.exec(scommand);
> proc.waitFor();
>
>
> I m running my code under Linux-redhat with a classic shell. Is
> there an other way to do the same thing or maybe am i missing
> something ? Any help will be grandly appreciate.
>
> Thanks
> Bertrand



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Use an executable from java ...

2005-01-31 Thread Ben Litchfield

I will assume you are asking this question on the lucene mailing list
because you now want to index that PDF document.

Have you tried PDFBox?  It can't create an html file for you but it can
extract text.

Ben
http://www.pdfbox.org



On Mon, 31 Jan 2005, Bertrand VENZAL wrote:

> Hi all,
>
> I ve a kind of problem to execute a converting tool to modify a pdf to an
> html under Linux. In fact, i have an executable "pdftohtml" which work
> correctly on batch mode, and when I want to use it through Java under
> Windows 2000 works also,BUT it does not work at all on the server under
> linux. I m using the following code.
>
> scommand = "/bin/sh -c \"myCommand fileName output\" ";
>
> Runtime runtime = Runtime.getRuntime();
> Process proc = runtime.exec(scommand);
> proc.waitFor();
>
>
> I m running my code under Linux-redhat with a classic shell.
> Is there an other way to do the same thing or maybe am i missing something
> ?
> Any help will be grandly appreciate.
>
> Thanks
> Bertrand
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]