[BlueObelisk-discuss] PDF 2 text parsing?

2010-08-14 Thread Egon Willighagen
Hi all (and Peter's team in particular), Converting PDF back into text is a somewhat tricky exercise, as words can actually be characters rendered in approximately word format... Strigi [0] does a decent but not perfect recovery of text... is there someone here who has experience with PDF 2 text

Re: [BlueObelisk-discuss] PDF 2 text parsing?

2010-08-14 Thread Peter Murray-Rust
On Sat, Aug 14, 2010 at 11:41 AM, Egon Willighagen egon.willigha...@gmail.com wrote: Hi all (and Peter's team in particular), Converting PDF back into text is a somewhat tricky exercise, as words can actually be characters rendered in approximately word format... This understates the

Re: [BlueObelisk-discuss] PDF 2 text parsing?

2010-08-14 Thread Egon Willighagen
On Sat, Aug 14, 2010 at 4:26 PM, Peter Murray-Rust pm...@cam.ac.uk wrote: On Sat, Aug 14, 2010 at 11:41 AM, Egon Willighagen Converting PDF back into text is a somewhat tricky exercise, as words can actually be characters rendered in approximately word format... This understates the problem

Re: [BlueObelisk-discuss] PDF 2 text parsing?

2010-08-14 Thread Peter Murray-Rust
On Sat, Aug 14, 2010 at 4:50 PM, Egon Willighagen egon.willigha...@gmail.com wrote: Well, the problem is just how to accurately extract the text from PDF files... it would need a C++ library that takes characters and returns text, I guess... but since the plugin is offline in Strigi anyway,