Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-04-16 Thread John Hewson
Hi Dimuthu I'm travelling for the next week so I'm ping to be a little slow at replying and somewhat brief. The scale can simply be 1.0 at all times. The font size should be the height of the current line of text in points (1/72 inch). To calculate this from the height of the text in pixels

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-04-14 Thread DImuthu Upeksha
Hi john, I managed to override processStream method and pass some hardcoded text position values to processStream method. I still have doubts about totalVerticalDisplacementDisp and fontSizeText variables. Is there is standard way to calculate the fontSizeText variable? What is the use of

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-25 Thread John Hewson
Hi Dimuthu Each line of text is handled by the processEncodedText method in PDFStreamEngine which calls processTextPosition once for each character. The processTextPosition method in PDFStreamEngine collects the text positions into lines, paragraphs and columns (also called “articles”). Text on a

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-24 Thread DImuthu Upeksha
Hi John, I looked at processTextPosition method in PDFTextStripper. But I couldn't understand actual process happening inside the method. What should be the input for that method? In my case I have words with bounding box's coordinates. How can I make those data to compatible with the input of

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread John Hewson
functionalities into existing PDFtoText algorithms or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread DImuthu Upeksha
into existing PDFtoText algorithms or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-19 Thread John Hewson
Hi Dimuthu 1 Print those data into PDDocument again and pass through TextStripper of PDFBox. This could reduce the performance of overall process. This was what I had in mind, but rather than printing the text into the PDDocument you can inject it directly into PDFTextStripper as TextPosition

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-16 Thread DImuthu Upeksha
functionalities into existing PDFtoText algorithms or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-13 Thread John Hewson
To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-11 Thread DImuthu Upeksha
Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-11 Thread John Hewson
To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-11 Thread DImuthu Upeksha
or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-10 Thread John Hewson
@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-10 Thread John Hewson
system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-10 Thread John Hewson
Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-09 Thread DImuthu Upeksha
- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-07 Thread DImuthu Upeksha
To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-05 Thread DImuthu Upeksha
2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-05 Thread John Hewson
or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-04 Thread John Hewson
sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-03 Thread John Hewson
functionalities into existing PDFtoText algorithms or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-03 Thread DImuthu Upeksha
- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-03 Thread DImuthu Upeksha
PDFtoText algorithms or package them as a new sub system(something like an API)? -Original Message- From: John Hewson j...@jahewson.com Sent: 26/02/2014 07:38 To: dev@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-03-01 Thread DImuthu Upeksha
@pdfbox.apache.org dev@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-28 Thread John Hewson
@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-26 Thread DImuthu Upeksha
Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-26 Thread John Hewson
@pdfbox.apache.org Subject: Re: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-26 Thread DImuthu Upeksha
Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We could OCR the glyphs

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-25 Thread DImuthu Upeksha
Ok fixed. This is what I did Right click on the new project -Debug As- Debug Configurations -Source -Add - Project Then I selected PDFBox project. Thanks Dimuthu On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha dimuthu.upeks...@gmail.com wrote: I'm using eclipse. This is what I want. I

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-25 Thread DImuthu Upeksha
Hi John, I got a couple of questions. 1. What is called glyphs ? 2. What is the main requirement of this project? As far as I understood, first we need to generate an image of malformed pdfs from PDFBox and then we need to do processing using OCR for further accurate results. But the problem is,

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-25 Thread John Hewson
1. What is called glyphs” ? http://en.wikipedia.org/wiki/Glyph 2. What is the main requirement of this project? As far as I understood, first we need to generate an image of malformed pdfs from PDFBox and then we need to do processing using OCR for further accurate results. But the problem

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-25 Thread John Hewson
Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We could OCR the glyphs to repair the encoding. -- John On 25 Feb 2014,

RE: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-25 Thread Dimuthu
: [GSoC 2014]Optical Character Recognition project - Introduction Yes, exactly. By location data I just mean (x,y) coordinates and page rotation. There is another use case for OCR: some fonts embedded in PDFs have corrupt encodings, which means the ACSII codes map to the wrong glyphs. We could OCR

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-24 Thread John Hewson
Hi Dimuthu The PDFBox website can be found at http://pdfbox.apache.org/ it contains a basic overview of the project and details on how to obtain the source code and build PDFBox for yourself. Currently we do not perform any OCR and PDFBOX-1912 details the only thoughts so far regarding it.

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-24 Thread DImuthu Upeksha
Hi John, Thanks for the reply. Yes I checked out PDFBox code and managed to build code successfully. I looked at the classes you mentioned and I got a rough idea about how they are working. To check them I used the jars in target folder to my separate java project. I tried samples in

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-24 Thread Tilman Hausherr
The best is to build from the complete source and to use that one. Then you should be able to set breakpoints within the PDFBOX source code. (Thats what I do all the time) Tilman Am 25.02.2014 07:38, schrieb DImuthu Upeksha: Hi John, Thanks for the reply. Yes I checked out PDFBox code and

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-24 Thread John Hewson
Which IDE are you using? You should be able to run the PDFToText class (in pdfbox-tools) using your IDE and pass a PDF file path as the command line argument. -- John On 24 Feb 2014, at 22:38, DImuthu Upeksha dimuthu.upeks...@gmail.com wrote: Hi John, Thanks for the reply. Yes I checked

Re: [GSoC 2014]Optical Character Recognition project - Introduction

2014-02-24 Thread DImuthu Upeksha
I'm using eclipse. This is what I want. I created a new Java application project (say TestPDFBox) with a main class with following code. PDDocument document = new PDDocument();PDPage blankPage = new PDPage();document.addPage( blankPage );document.save(BlankPage.pdf);document.close(); Then I need

[GSoC 2014]Optical Character Recognition project - Introduction

2014-02-23 Thread DImuthu Upeksha
Hi, I am Dimuthu Upeksha, a Computer Engineering Undergraduate at University of Moratuwa Sri Lanka. I successfully completed my GSoC 2013 with Apache ISIS [1] project. I'm very much interested in OCR and image processing stuff. So I would like to select this project idea as my GSoC 2014 project