Hi Dimuthu
I'm travelling for the next week so I'm ping to be a little slow at replying
and somewhat brief.
The scale can simply be 1.0 at all times. The font size should be the height of
the current line of text in points (1/72 inch). To calculate this from the
height of the text in pixels
Hi john,
I managed to override processStream method and pass some hardcoded
text position values to processStream method.
I still have doubts about totalVerticalDisplacementDisp and
fontSizeText variables. Is there is standard way to calculate the
fontSizeText variable? What is the use of
Hi Dimuthu
Each line of text is handled by the processEncodedText method in PDFStreamEngine
which calls processTextPosition once for each character. The processTextPosition
method in PDFStreamEngine collects the text positions into lines, paragraphs and
columns (also called “articles”). Text on a
Hi John,
I looked at processTextPosition method in PDFTextStripper. But I
couldn't understand actual process happening inside the method. What
should be the input for that method? In my case I have words with
bounding box's coordinates. How can I make those data to compatible
with the input of
functionalities into
existing
PDFtoText algorithms or package them as a new sub
system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character
into
existing
PDFtoText algorithms or package them as a new sub
system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition
Hi Dimuthu
1 Print those data into PDDocument again and pass through TextStripper
of PDFBox. This could reduce the performance of overall process.
This was what I had in mind, but rather than printing the text into the
PDDocument
you can inject it directly into PDFTextStripper as TextPosition
functionalities into existing
PDFtoText algorithms or package them as a new sub
system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs
have
corrupt
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs
have
corrupt encodings
or package them as a new sub system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly
@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs
have
corrupt encodings, which
system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y
Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some
-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs
have
2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and
page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have
corrupt encodings, which means the ACSII codes map to the wrong
glyphs. We
or package them as a new sub system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly
sub system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean
functionalities into existing
PDFtoText algorithms or package them as a new sub system(something like
an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character
-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation
PDFtoText algorithms or package them as a new sub system(something
like an
API)?
-Original Message-
From: John Hewson j...@jahewson.com
Sent: 26/02/2014 07:38
To: dev@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project
@pdfbox.apache.org dev@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have
corrupt encodings, which
@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have
corrupt encodings, which means the ACSII codes map
Subject: Re: [GSoC 2014]Optical Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have
corrupt encodings, which means the ACSII codes map to the wrong glyphs. We
@pdfbox.apache.org
Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have corrupt
encodings, which means the ACSII codes map
Character Recognition project -
Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page
rotation.
There is another use case for OCR: some fonts embedded in PDFs have
corrupt encodings, which means the ACSII codes map to the wrong glyphs. We
could OCR the glyphs
Ok fixed. This is what I did
Right click on the new project -Debug As- Debug Configurations -Source
-Add - Project
Then I selected PDFBox project.
Thanks
Dimuthu
On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha dimuthu.upeks...@gmail.com
wrote:
I'm using eclipse. This is what I want. I
Hi John,
I got a couple of questions.
1. What is called glyphs ?
2. What is the main requirement of this project?
As far as I understood, first we need to generate an image of
malformed pdfs from
PDFBox and then we need to do processing using OCR for further accurate
results. But the problem is,
1. What is called glyphs” ?
http://en.wikipedia.org/wiki/Glyph
2. What is the main requirement of this project?
As far as I understood, first we need to generate an image of
malformed pdfs from
PDFBox and then we need to do processing using OCR for further accurate
results. But the problem
Yes, exactly. By location data I just mean (x,y) coordinates and page rotation.
There is another use case for OCR: some fonts embedded in PDFs have corrupt
encodings, which means the ACSII codes map to the wrong glyphs. We could OCR
the glyphs to repair the encoding.
-- John
On 25 Feb 2014,
: [GSoC 2014]Optical Character Recognition project - Introduction
Yes, exactly. By location data I just mean (x,y) coordinates and page rotation.
There is another use case for OCR: some fonts embedded in PDFs have corrupt
encodings, which means the ACSII codes map to the wrong glyphs. We could OCR
Hi Dimuthu
The PDFBox website can be found at http://pdfbox.apache.org/ it contains a
basic overview of the project
and details on how to obtain the source code and build PDFBox for yourself.
Currently we do not perform any OCR and PDFBOX-1912 details the only thoughts
so far regarding it.
Hi John,
Thanks for the reply. Yes I checked out PDFBox code and managed to build
code successfully. I looked at the classes you mentioned and I got a rough
idea about how they are working. To check them I used the jars in target
folder to my separate java project. I tried samples in
The best is to build from the complete source and to use that one. Then
you should be able to set breakpoints within the PDFBOX source code.
(Thats what I do all the time)
Tilman
Am 25.02.2014 07:38, schrieb DImuthu Upeksha:
Hi John,
Thanks for the reply. Yes I checked out PDFBox code and
Which IDE are you using? You should be able to run the PDFToText class (in
pdfbox-tools) using your IDE and pass a PDF file path as the command line
argument.
-- John
On 24 Feb 2014, at 22:38, DImuthu Upeksha dimuthu.upeks...@gmail.com wrote:
Hi John,
Thanks for the reply. Yes I checked
I'm using eclipse. This is what I want. I created a new Java application
project (say TestPDFBox) with a main class with following code.
PDDocument document = new PDDocument();PDPage blankPage = new
PDPage();document.addPage( blankPage
);document.save(BlankPage.pdf);document.close();
Then I need
Hi,
I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University of
Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache ISIS
[1] project. I'm very much interested in OCR and image processing stuff. So
I would like to select this project idea as my GSoC 2014 project
39 matches
Mail list logo