Hello,
Oh, it is in the extra pack and I am using the C# version which doesn't have
that class--I failed to find it.
WMJ.
>
> From: 1T3XT BVBA
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>On 19/11/2011 17:01, W
On 19/11/2011 17:01, WMJ wrote:
Hello Kevin,
Thank you for the reply.
Currently the event model is read-only, and it does do a good job
telling us some text and image information. However, when it comes to
modifying the existing document content, we have to write it back.
That's a lacking fea
line.
Up to now, the page stamper can not achieve this. That's why I am looking for
solution to change the content.
WMJ.
>
> From: Kevin Day
>Subject: Re: [iText-questions] Save PDF as plain text
>
>WMJ -
>
>Your comment about "
WMJ -
Your comment about "requiring lots of classes" makes me think that maybe you
are trying to interact with the low level PdfContentStreamProcessor, instead
of registering a RenderListener? I'm wondering if you might be trying to
use the wrong level of the architecture.
There is a hierarchy o
I've forgotten to agree that PDF can have a lot of crap. :)
>
> From: Leonard Rosenthol
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>Find some vector-heavy documents such as those in prepress/publishing or CAD
>draw
ol
>Subject: Re: [iText-questions] Save PDF as plain text
>
>
>Find some vector-heavy documents such as those in prepress/publishing or CAD
>drawings. Those will give you the heaviest content streams for your DOM.
>I've enclosed TWO PAGES from a REAL WORLD document to
Hello,
Firstly I agree that it is easy to convert the current event model to DOM
model. And I've done already implemented a very basic model with one or two
days' work.
Currently I've processed quite some PDF files and I think huge page command
trees are rare. Few PDF documents contain page
I'll take care of the cmaps encoding right after the big PDF support.
Paulo
- Original Message -
From: Kevin Day
To: itext-questions@lists.sourceforge.net
Sent: Tuesday, November 15, 2011 4:45 PM
Subject: Re: [iText-questions] Save PDF as plain text
Dániel Kékesi
WMJ wrote:
>
> Currently the parser is event-based. I would more love to have a DOM-like
> thing.
>
h... Well that was certainly part of the original design consideration.
But when you are processing a stack based operator stream, and you have the
potential for huge streams, an event base
Dániel Kékesi wrote:
>
> Not to hijack this thread, but what I'd like to do see is to have support
> for
> more encoding types. For example the attached document produces no output
> using any extraction startegy (I tried with 5.1.2).
>
Not a hi-jack at all - I think this is a much more meani
WMJ,
WMJ wrote:
> Currently the parser is event-based. I would more love to have a DOM-like
> thing.
IMO it is a good choice of the current API to work in an event based manner.
On the one hand this requires the least resources --- if it always first
transformed the page content into objects as
Post here
mailto:itext-questions@lists.sourceforge.net>>
Subject: Re: [iText-questions] Save PDF as plain text
Hello Kevin,
Currently the parser is event-based. I would more love to have a DOM-like thing.
For example,
q
BT
1 0 0 1 12 12 Tm
/F1 12 Tf %F1 is a font named Times now Roman
(abcdef
Hello,
I agree. Supporting more encoding types really helps extracting text. Currently
only text with ToUnicode section in the corresponding fonts can be extracted.
WMJ
>
>
>
>Hi Kevin,
>
>Not to hijack this thread, but what I'd like to do see is to have suppor
Hello Kevin,
Currently the parser is event-based. I would more love to have a DOM-like thing.
For example,
q
BT
1 0 0 1 12 12 Tm
/F1 12 Tf %F1 is a font named Times now Roman
(abcdef) Tj
ET
Q
The above commands can be parsed to structured content like this:
abce
Hello,
I agree. Supporting more encoding types really helps extracting text. Currently
only text with ToUnicode section in the corresponding fonts can be extracted.
WMJ
>
>
>
>Hi Kevin,
>
>Not to hijack this thread, but what I'd like to do see is to have suppor
WMJ wrote:
>
> The parser is not very powerful or convenient yet, but it does point you
> to the most detailed part of the PDF text.
>
WMJ - what would you see as enhancements that would make it more powerful or
convenient? Understanding the details of this will help us improve.
Thanks.
--
V
There are currently two text extraction strategies. One is a very simple
extraction of text directly from the content stream. The other is a much
more advanced, location based extraction (this is the default).
Extending that to add additional formatting capabilities is possible, and
was the inte
Currently you don't have any option.
You have to analyze the position of the extracted text segments and determine
whether there should be spaces between them, whether the adjacent lines belong
to the same paragraph. If you want to know about the color, font, style and
size of the text, you ha
Thanks for pointing me in the right direction - that helped a lot.
I have managed to extract text from my PDF files, but I whished there was
some more "formatting" options on the output - have I missed anything?
I have a small project where I used foolabs Xpdf pdftotext.exe, which have
an option
On 15/11/2011 0:03, Verakso wrote:
> this ends up with linkts to old post that says iText can't do that.
Those must be very old mails. iText can parse PDFs for plain text for a
couple of years now.
> I do know that /iText doesn't do OCR /but how do I convert a page to
> plain text?
That's ex
On the webpage iTextpdf.com there is a section saying that iText can read
and extract PDF files.
Actually is says, that iText can access the content stream of each page.
But it must be me that is blind, because all my searches on how to do this
ends up with linkts to old post that says iText can'
21 matches
Mail list logo