Re: reading text out of ps/pdf

2001-01-15 Thread Tuukka Toivonen

On Sun, 14 Jan 2001, Jan Goebel wrote:

 you can maybe scanner/OCR software like GOCR (open source)
 take a look at:
 http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

Sure. You can try it. But don't expect too much. When I last time (maybe a
half year ago) tested all free OCR programs for Linux -- there were
several of them -- they were all very poor. I think that OCR programs are
one of the weakest points of Linux :(

There is still Wine...




Re: reading text out of ps/pdf

2001-01-15 Thread Herbert Voss

Tuukka Toivonen wrote:
 
 On Sun, 14 Jan 2001, Jan Goebel wrote:
 
  you can maybe scanner/OCR software like GOCR (open source)
  take a look at:
  http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html
 
 Sure. You can try it. But don't expect too much. When I last time (maybe a
 half year ago) tested all free OCR programs for Linux -- there were
 several of them -- they were all very poor. I think that OCR programs are
 one of the weakest points of Linux :(

i don't think so, because  i use OCRShop from www.vividata.com.
sure, it's not open source, but 99$ for a personal edition is
okay. and it works wonderful, for me ;-)

Herbert 

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/




Re: reading text out of ps/pdf

2001-01-15 Thread Tuukka Toivonen

On Sun, 14 Jan 2001, Jan Goebel wrote:

 you can maybe scanner/OCR software like GOCR (open source)
 take a look at:
 http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

Sure. You can try it. But don't expect too much. When I last time (maybe a
half year ago) tested all free OCR programs for Linux -- there were
several of them -- they were all very poor. I think that OCR programs are
one of the weakest points of Linux :(

There is still Wine...




Re: reading text out of ps/pdf

2001-01-15 Thread Herbert Voss

Tuukka Toivonen wrote:
 
 On Sun, 14 Jan 2001, Jan Goebel wrote:
 
  you can maybe scanner/OCR software like GOCR (open source)
  take a look at:
  http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html
 
 Sure. You can try it. But don't expect too much. When I last time (maybe a
 half year ago) tested all free OCR programs for Linux -- there were
 several of them -- they were all very poor. I think that OCR programs are
 one of the weakest points of Linux :(

i don't think so, because  i use OCRShop from www.vividata.com.
sure, it's not open source, but 99$ for a personal edition is
okay. and it works wonderful, for me ;-)

Herbert 

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/




Re: reading text out of ps/pdf

2001-01-15 Thread Tuukka Toivonen

On Sun, 14 Jan 2001, Jan Goebel wrote:

> you can maybe scanner/OCR software like GOCR (open source)
> take a look at:
> http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

Sure. You can try it. But don't expect too much. When I last time (maybe a
half year ago) tested all free OCR programs for Linux -- there were
several of them -- they were all very poor. I think that OCR programs are
one of the weakest points of Linux :(

There is still Wine...




Re: reading text out of ps/pdf

2001-01-15 Thread Herbert Voss

Tuukka Toivonen wrote:
> 
> On Sun, 14 Jan 2001, Jan Goebel wrote:
> 
> > you can maybe scanner/OCR software like GOCR (open source)
> > take a look at:
> > http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html
> 
> Sure. You can try it. But don't expect too much. When I last time (maybe a
> half year ago) tested all free OCR programs for Linux -- there were
> several of them -- they were all very poor. I think that OCR programs are
> one of the weakest points of Linux :(

i don't think so, because  i use OCRShop from www.vividata.com.
sure, it's not open source, but 99$ for a personal edition is
okay. and it works wonderful, for me ;-)

Herbert 

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/




Re: reading text out of ps/pdf

2001-01-14 Thread Jan Goebel

Hello,

you can maybe scanner/OCR software like GOCR (open source)
take a look at:
http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

good luck

jan

PS: @christopher: if you were sucessfull, you may give me a reply?
maybe i need it sometimes, too.

On Sat, 13 Jan 2001, Matej Cepl wrote:

 Christopher Jones wrote:
  So my question is: is there any software out there which attempts to look at
  bitmaps and guess what the ascii would be-- something like those programs which
  read books through a scanner and try to match font characters to the image. And
  I say this question is a reach, because I know that those programs which I have
  heard about are either very expensive or very innacurate.
 
 I am afraid, that you have not much choice, than try any of these
 programs. Some of them are now much better, than they used to be.
 Try to find anybody with scanner - most of these programs should
 be able to scan documents from the external file. I am afraid,
 that there is nothing better to offer you.
 
 Matej
 

-- 
+---
 Jan Goebel (mailto:[EMAIL PROTECTED])

 DIW Berlin 
 Longitudinal Data and Microanalysis
 Knigin-Luise-Str. 5
 D-14195 Berlin -- Germany --
 phone: 49 30 89789-377

+---



Re: reading text out of ps/pdf

2001-01-14 Thread Matej Cepl

Christopher Jones wrote:
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej





Re: reading text out of ps/pdf

2001-01-14 Thread Jan Goebel

Hello,

you can maybe scanner/OCR software like GOCR (open source)
take a look at:
http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

good luck

jan

PS: @christopher: if you were sucessfull, you may give me a reply?
maybe i need it sometimes, too.

On Sat, 13 Jan 2001, Matej Cepl wrote:

 Christopher Jones wrote:
  So my question is: is there any software out there which attempts to look at
  bitmaps and guess what the ascii would be-- something like those programs which
  read books through a scanner and try to match font characters to the image. And
  I say this question is a reach, because I know that those programs which I have
  heard about are either very expensive or very innacurate.
 
 I am afraid, that you have not much choice, than try any of these
 programs. Some of them are now much better, than they used to be.
 Try to find anybody with scanner - most of these programs should
 be able to scan documents from the external file. I am afraid,
 that there is nothing better to offer you.
 
 Matej
 

-- 
+---
 Jan Goebel (mailto:[EMAIL PROTECTED])

 DIW Berlin 
 Longitudinal Data and Microanalysis
 Knigin-Luise-Str. 5
 D-14195 Berlin -- Germany --
 phone: 49 30 89789-377

+---



Re: reading text out of ps/pdf

2001-01-14 Thread Matej Cepl

Christopher Jones wrote:
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej





Re: reading text out of ps/pdf

2001-01-14 Thread Jan Goebel

Hello,

you can maybe scanner/OCR software like GOCR (open source)
take a look at:
http://altmark.nat.uni-magdeburg.de/~jschulen/ocr/index.html

good luck

jan

PS: @christopher: if you were sucessfull, you may give me a reply?
maybe i need it sometimes, too.

On Sat, 13 Jan 2001, Matej Cepl wrote:

> Christopher Jones wrote:
> > So my question is: is there any software out there which attempts to look at
> > bitmaps and guess what the ascii would be-- something like those programs which
> > read books through a scanner and try to match font characters to the image. And
> > I say this question is a reach, because I know that those programs which I have
> > heard about are either very expensive or very innacurate.
> 
> I am afraid, that you have not much choice, than try any of these
> programs. Some of them are now much better, than they used to be.
> Try to find anybody with scanner - most of these programs should
> be able to scan documents from the external file. I am afraid,
> that there is nothing better to offer you.
> 
> Matej
> 

-- 
+---
 Jan Goebel (mailto:[EMAIL PROTECTED])

 DIW Berlin 
 Longitudinal Data and Microanalysis
 Königin-Luise-Str. 5
 D-14195 Berlin -- Germany --
 phone: 49 30 89789-377

+---



Re: reading text out of ps/pdf

2001-01-14 Thread Matej Cepl

Christopher Jones wrote:
> So my question is: is there any software out there which attempts to look at
> bitmaps and guess what the ascii would be-- something like those programs which
> read books through a scanner and try to match font characters to the image. And
> I say this question is a reach, because I know that those programs which I have
> heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej





Re: reading text out of ps/pdf

2001-01-13 Thread R. E. de Lima-Lopes

yes

there is a tool called ps2ascii, it extracts plain texts form *.ps files

[]s
lima-lopes

R.E. de Lima-Lopes
[EMAIL PROTECTED]
GNU/Linux Registered User # 182240

On Sat, 13 Jan 2001, Christopher Jones wrote:

 Date: Sat, 13 Jan 2001 11:34:48 -0600
 From: Christopher Jones [EMAIL PROTECTED]
 To: LyX [EMAIL PROTECTED]
 Subject: reading text out of ps/pdf
 
 This is a reach, I know. But in the hopes that there is something out there for
 me, I'll ask the question: is there anything which reads text out of a bitmaped
 pdf or ps file? 
 




Re: reading text out of ps/pdf

2001-01-13 Thread Christopher Jones

I have that tool. But some pdf or ps files consist not of coded text but a
bitmapped image. For instance, pdf and ps files which I download from journal
databases are scanned images of journal pages. ps2ascii and pdftotext will not
extract text from these files, since there is no ascii content to extract. 

Anyway, that is the best explanation I have been able to figure, by examining
the contents of pdf and ps files and seeing that the post-preamble stuff is
sometimes text, sometimes not, and seeing that ps2ascii poops out on the
latter, though not on the former.

So my question is: is there any software out there which attempts to look at
bitmaps and guess what the ascii would be-- something like those programs which
read books through a scanner and try to match font characters to the image. And
I say this question is a reach, because I know that those programs which I have
heard about are either very expensive or very innacurate. 

Thanks very much for the response.

On Sun, 14 Jan 2001, you wrote:
 yes
 
 there is a tool called ps2ascii, it extracts plain texts form *.ps files
 



Re: reading text out of ps/pdf

2001-01-13 Thread Herbert Voss

Christopher Jones wrote:
 
 I have that tool. But some pdf or ps files consist not of coded text but a
 bitmapped image. For instance, pdf and ps files which I download from journal
 databases are scanned images of journal pages. ps2ascii and pdftotext will not
 extract text from these files, since there is no ascii content to extract.
 
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

with

pdfimages -f 1 file.pdf DirForTheImages

extract all images in the pdf-file. with option -j you can save them
as jpegs, otherwise by default ppm or pbm - format (a good choice).
With 

pdftotext file.pdf file.txt

convert all to text.
when the pdf-file has some scanned-text, which are saved as images
you can convert these from pbm to tiff and than running an OCR
program.


Herbert

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/





Re: reading text out of ps/pdf

2001-01-13 Thread Matej Cepl

Christopher Jones wrote:
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej





Re: reading text out of ps/pdf

2001-01-13 Thread R. E. de Lima-Lopes

yes

there is a tool called ps2ascii, it extracts plain texts form *.ps files

[]s
lima-lopes

R.E. de Lima-Lopes
[EMAIL PROTECTED]
GNU/Linux Registered User # 182240

On Sat, 13 Jan 2001, Christopher Jones wrote:

 Date: Sat, 13 Jan 2001 11:34:48 -0600
 From: Christopher Jones [EMAIL PROTECTED]
 To: LyX [EMAIL PROTECTED]
 Subject: reading text out of ps/pdf
 
 This is a reach, I know. But in the hopes that there is something out there for
 me, I'll ask the question: is there anything which reads text out of a bitmaped
 pdf or ps file? 
 




Re: reading text out of ps/pdf

2001-01-13 Thread Christopher Jones

I have that tool. But some pdf or ps files consist not of coded text but a
bitmapped image. For instance, pdf and ps files which I download from journal
databases are scanned images of journal pages. ps2ascii and pdftotext will not
extract text from these files, since there is no ascii content to extract. 

Anyway, that is the best explanation I have been able to figure, by examining
the contents of pdf and ps files and seeing that the post-preamble stuff is
sometimes text, sometimes not, and seeing that ps2ascii poops out on the
latter, though not on the former.

So my question is: is there any software out there which attempts to look at
bitmaps and guess what the ascii would be-- something like those programs which
read books through a scanner and try to match font characters to the image. And
I say this question is a reach, because I know that those programs which I have
heard about are either very expensive or very innacurate. 

Thanks very much for the response.

On Sun, 14 Jan 2001, you wrote:
 yes
 
 there is a tool called ps2ascii, it extracts plain texts form *.ps files
 



Re: reading text out of ps/pdf

2001-01-13 Thread Herbert Voss

Christopher Jones wrote:
 
 I have that tool. But some pdf or ps files consist not of coded text but a
 bitmapped image. For instance, pdf and ps files which I download from journal
 databases are scanned images of journal pages. ps2ascii and pdftotext will not
 extract text from these files, since there is no ascii content to extract.
 
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

with

pdfimages -f 1 file.pdf DirForTheImages

extract all images in the pdf-file. with option -j you can save them
as jpegs, otherwise by default ppm or pbm - format (a good choice).
With 

pdftotext file.pdf file.txt

convert all to text.
when the pdf-file has some scanned-text, which are saved as images
you can convert these from pbm to tiff and than running an OCR
program.


Herbert

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/





Re: reading text out of ps/pdf

2001-01-13 Thread Matej Cepl

Christopher Jones wrote:
 So my question is: is there any software out there which attempts to look at
 bitmaps and guess what the ascii would be-- something like those programs which
 read books through a scanner and try to match font characters to the image. And
 I say this question is a reach, because I know that those programs which I have
 heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej





Re: reading text out of ps/pdf

2001-01-13 Thread R. E. de Lima-Lopes

yes

there is a tool called ps2ascii, it extracts plain texts form *.ps files

[]s
lima-lopes

R.E. de Lima-Lopes
[EMAIL PROTECTED]
GNU/Linux Registered User # 182240

On Sat, 13 Jan 2001, Christopher Jones wrote:

> Date: Sat, 13 Jan 2001 11:34:48 -0600
> From: Christopher Jones <[EMAIL PROTECTED]>
> To: LyX <[EMAIL PROTECTED]>
> Subject: reading text out of ps/pdf
> 
> This is a reach, I know. But in the hopes that there is something out there for
> me, I'll ask the question: is there anything which reads text out of a bitmaped
> pdf or ps file? 
> 




Re: reading text out of ps/pdf

2001-01-13 Thread Christopher Jones

I have that tool. But some pdf or ps files consist not of coded text but a
bitmapped image. For instance, pdf and ps files which I download from journal
databases are scanned images of journal pages. ps2ascii and pdftotext will not
extract text from these files, since there is no ascii content to extract. 

Anyway, that is the best explanation I have been able to figure, by examining
the contents of pdf and ps files and seeing that the post-preamble stuff is
sometimes text, sometimes not, and seeing that ps2ascii poops out on the
latter, though not on the former.

So my question is: is there any software out there which attempts to look at
bitmaps and guess what the ascii would be-- something like those programs which
read books through a scanner and try to match font characters to the image. And
I say this question is a reach, because I know that those programs which I have
heard about are either very expensive or very innacurate. 

Thanks very much for the response.

On Sun, 14 Jan 2001, you wrote:
> yes
> 
> there is a tool called ps2ascii, it extracts plain texts form *.ps files
> 



Re: reading text out of ps/pdf

2001-01-13 Thread Herbert Voss

Christopher Jones wrote:
> 
> I have that tool. But some pdf or ps files consist not of coded text but a
> bitmapped image. For instance, pdf and ps files which I download from journal
> databases are scanned images of journal pages. ps2ascii and pdftotext will not
> extract text from these files, since there is no ascii content to extract.
> 
> So my question is: is there any software out there which attempts to look at
> bitmaps and guess what the ascii would be-- something like those programs which
> read books through a scanner and try to match font characters to the image. And
> I say this question is a reach, because I know that those programs which I have
> heard about are either very expensive or very innacurate.

with

pdfimages -f 1 file.pdf DirForTheImages

extract all images in the pdf-file. with option -j you can save them
as jpegs, otherwise by default ppm or pbm - format (a good choice).
With 

pdftotext file.pdf file.txt

convert all to text.
when the pdf-file has some scanned-text, which are saved as images
you can convert these from pbm to tiff and than running an OCR
program.


Herbert

-- 
[EMAIL PROTECTED]
http://perce.de/lyx/





Re: reading text out of ps/pdf

2001-01-13 Thread Matej Cepl

Christopher Jones wrote:
> So my question is: is there any software out there which attempts to look at
> bitmaps and guess what the ascii would be-- something like those programs which
> read books through a scanner and try to match font characters to the image. And
> I say this question is a reach, because I know that those programs which I have
> heard about are either very expensive or very innacurate.

I am afraid, that you have not much choice, than try any of these
programs. Some of them are now much better, than they used to be.
Try to find anybody with scanner - most of these programs should
be able to scan documents from the external file. I am afraid,
that there is nothing better to offer you.

Matej