Re: [iText-questions] Save PDF as plain text

2011-11-19 Thread WMJ
Hello, Oh, it is in the extra pack and I am using the C# version which doesn't have that class--I failed to find it. WMJ. > > From: 1T3XT BVBA >Subject: Re: [iText-questions] Save PDF as plain text > > >On 19/11/2011 17:01, W

Re: [iText-questions] Save PDF as plain text

2011-11-19 Thread 1T3XT BVBA
On 19/11/2011 17:01, WMJ wrote: Hello Kevin, Thank you for the reply. Currently the event model is read-only, and it does do a good job telling us some text and image information. However, when it comes to modifying the existing document content, we have to write it back. That's a lacking fea

Re: [iText-questions] Save PDF as plain text

2011-11-19 Thread WMJ
line. Up to now, the page stamper can not achieve this. That's why I am looking for solution to change the content. WMJ. > > From: Kevin Day >Subject: Re: [iText-questions] Save PDF as plain text > >WMJ - > >Your comment about "

Re: [iText-questions] Save PDF as plain text

2011-11-19 Thread Kevin Day
WMJ - Your comment about "requiring lots of classes" makes me think that maybe you are trying to interact with the low level PdfContentStreamProcessor, instead of registering a RenderListener? I'm wondering if you might be trying to use the wrong level of the architecture. There is a hierarchy o

Re: [iText-questions] Save PDF as plain text

2011-11-18 Thread WMJ
I've forgotten to agree that PDF can have a lot of crap. :) > > From: Leonard Rosenthol >Subject: Re: [iText-questions] Save PDF as plain text > > >Find some vector-heavy documents such as those in prepress/publishing or CAD >draw

Re: [iText-questions] Save PDF as plain text

2011-11-17 Thread WMJ
ol >Subject: Re: [iText-questions] Save PDF as plain text > > >Find some vector-heavy documents such as those in prepress/publishing or CAD >drawings.  Those will give you the heaviest content streams for your DOM.   >I've enclosed TWO PAGES from a REAL WORLD document to

Re: [iText-questions] Save PDF as plain text

2011-11-17 Thread WMJ
Hello, Firstly I agree that it is easy to convert the current event model to DOM model. And I've done already implemented a very basic model with one or two days' work. Currently I've processed quite some PDF files and I think huge page command trees are rare. Few PDF documents contain page

Re: [iText-questions] Save PDF as plain text

2011-11-16 Thread Paulo Soares
I'll take care of the cmaps encoding right after the big PDF support. Paulo - Original Message - From: Kevin Day To: itext-questions@lists.sourceforge.net Sent: Tuesday, November 15, 2011 4:45 PM Subject: Re: [iText-questions] Save PDF as plain text Dániel Kékesi

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Kevin Day
WMJ wrote: > > Currently the parser is event-based. I would more love to have a DOM-like > thing. > h... Well that was certainly part of the original design consideration. But when you are processing a stack based operator stream, and you have the potential for huge streams, an event base

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Kevin Day
Dániel Kékesi wrote: > > Not to hijack this thread, but what I'd like to do see is to have support > for > more encoding types. For example the attached document produces no output > using any extraction startegy (I tried with 5.1.2). > Not a hi-jack at all - I think this is a much more meani

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread mkl
WMJ, WMJ wrote: > Currently the parser is event-based. I would more love to have a DOM-like > thing. IMO it is a good choice of the current API to work in an event based manner. On the one hand this requires the least resources --- if it always first transformed the page content into objects as

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Leonard Rosenthol
Post here mailto:itext-questions@lists.sourceforge.net>> Subject: Re: [iText-questions] Save PDF as plain text Hello Kevin, Currently the parser is event-based. I would more love to have a DOM-like thing. For example, q BT 1 0 0 1 12 12 Tm /F1 12 Tf %F1 is a font named Times now Roman (abcdef

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread WMJ
Hello, I agree. Supporting more encoding types really helps extracting text. Currently only text with ToUnicode section in the corresponding fonts can be extracted. WMJ > > > >Hi Kevin, > >Not to hijack this thread, but what I'd like to do see is to have suppor

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread WMJ
Hello Kevin, Currently the parser is event-based. I would more love to have a DOM-like thing. For example, q BT 1 0 0 1 12 12 Tm /F1 12 Tf %F1 is a font named Times now Roman (abcdef) Tj ET Q The above commands can be parsed to structured content like this:               abce

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread WMJ
Hello, I agree. Supporting more encoding types really helps extracting text. Currently only text with ToUnicode section in the corresponding fonts can be extracted. WMJ > > > >Hi Kevin, > >Not to hijack this thread, but what I'd like to do see is to have suppor

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Kevin Day
WMJ wrote: > > The parser is not very powerful or convenient yet, but it does point you > to the most detailed part of the PDF text. > WMJ - what would you see as enhancements that would make it more powerful or convenient? Understanding the details of this will help us improve. Thanks. -- V

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Kevin Day
There are currently two text extraction strategies. One is a very simple extraction of text directly from the content stream. The other is a much more advanced, location based extraction (this is the default). Extending that to add additional formatting capabilities is possible, and was the inte

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread WMJ
Currently you don't have any option. You have to analyze the position of the extracted text segments and determine whether there should be spaces between them, whether the adjacent lines belong to the same paragraph. If you want to know about the color, font, style and size of the text, you ha

Re: [iText-questions] Save PDF as plain text

2011-11-15 Thread Verakso
Thanks for pointing me in the right direction - that helped a lot. I have managed to extract text from my PDF files, but I whished there was some more "formatting" options on the output - have I missed anything? I have a small project where I used foolabs Xpdf pdftotext.exe, which have an option

Re: [iText-questions] Save PDF as plain text

2011-11-14 Thread 1T3XT BVBA
On 15/11/2011 0:03, Verakso wrote: > this ends up with linkts to old post that says iText can't do that. Those must be very old mails. iText can parse PDFs for plain text for a couple of years now. > I do know that /iText doesn't do OCR /but how do I convert a page to > plain text? That's ex

[iText-questions] Save PDF as plain text

2011-11-14 Thread Verakso
On the webpage iTextpdf.com there is a section saying that iText can read and extract PDF files. Actually is says, that iText can access the content stream of each page. But it must be me that is blind, because all my searches on how to do this ends up with linkts to old post that says iText can'