Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Matt Sherman
That worked pretty well.  There is still come clean up I have to do
but [A-z]^p[A-z] to [A-z] [A-z] did a lot of the cleanup.

On Tue, Aug 4, 2015 at 12:17 PM, Kyle Banerjee  wrote:
> On Tue, Aug 4, 2015 at 6:09 AM, Matt Sherman 
> wrote:
>
>> I am on Windows machines, so I don't have quite the easy access to
>> that useful command.  Someone had earlier put the OCR in a doc file so
>> I've been playing with that more than with the raw PDF OCR.
>>
>>
> Versions of the unix utilities that run on Windows are available, but you
> can just use Microsoft Word to do what you want. Just use the find/replace
> function. In Word, you can search for a paragraph marker by looking for
> "^p" (caret p)
>
> Because you undoubtedly have real paragraphs in the document which you
> don't want to remove, I'd recommend substituting double paragraph marks
> with something unique (e.g. "@ZZZ@") before replacing all the other
> paragraph marks with a space. Then replace your unique marker with a
> paragraph.
>
> HTH,
>
> kyle


Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Kyle Banerjee
On Tue, Aug 4, 2015 at 6:09 AM, Matt Sherman 
wrote:

> I am on Windows machines, so I don't have quite the easy access to
> that useful command.  Someone had earlier put the OCR in a doc file so
> I've been playing with that more than with the raw PDF OCR.
>
>
Versions of the unix utilities that run on Windows are available, but you
can just use Microsoft Word to do what you want. Just use the find/replace
function. In Word, you can search for a paragraph marker by looking for
"^p" (caret p)

Because you undoubtedly have real paragraphs in the document which you
don't want to remove, I'd recommend substituting double paragraph marks
with something unique (e.g. "@ZZZ@") before replacing all the other
paragraph marks with a space. Then replace your unique marker with a
paragraph.

HTH,

kyle


Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Matt Sherman
Hm, doing a little looking on someone's suggestion it turns out I was
wrong, they are not line breaks, they are paragraph marks.

On Tue, Aug 4, 2015 at 9:21 AM, Scancella, John  wrote:
> Matt,
>
> A word document does funny things to the text since it is actually html (try 
> opening a .doc in a plain text editor and you will see it is html). I would 
> try and get the plain ASCII text instead, and then install Cygwin which 
> contains Sed and a bunch of other usful Unix/Linux commands.
> see http://stackoverflow.com/a/127567/2896744 for more info.
> 
> From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman 
> [[email protected]]
> Sent: Tuesday, August 04, 2015 9:09 AM
> To: [email protected]
> Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> I am on Windows machines, so I don't have quite the easy access to
> that useful command.  Someone had earlier put the OCR in a doc file so
> I've been playing with that more than with the raw PDF OCR.
>
> On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John  wrote:
>> Matt,
>>
>> There are probably a dozen ways to do this, but it would be really helpful 
>> to know what operating system you are on? For example, if you are using 
>> Linux, you can run it through sed using
>>   cat  | sed 's/\n//' >> 
>> see http://stackoverflow.com/a/800644/2896744 for more info
>> 
>> From: Code for Libraries [[email protected]] On Behalf Of Matt 
>> Sherman [[email protected]]
>> Sent: Monday, August 03, 2015 10:29 PM
>> To: [email protected]
>> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>>
>> Hi Code4Lib folks,
>>
>> I was wondering if anyone had some experience cleaning up OCR text.
>> Particularly I am trying to figure out how I can deal with the random
>> line breaks that come from OCR.  I am trying to parse out a
>> bibliography with regex.  I think I've figured out which queries I
>> need to run to break it up so I can make it into a tab delimited text
>> file but I noticed that the text does the classic thing of OCR
>> inserting line breaks where they physically are on the page.  This
>> will obviously be a bit of an issue since it would break the
>> annotation into a bunch of lines rather than leaving it one block so I
>> can manipulate it into a database.  So I am wondering if anyone who
>> has worked with OCR text before has a suggested way to clean up those
>> line breaks without doing 300 + pages by hand?  Any thoughts would be
>> welcome.
>>
>> Matt Sherman


Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Scancella, John
Matt,

A word document does funny things to the text since it is actually html (try 
opening a .doc in a plain text editor and you will see it is html). I would try 
and get the plain ASCII text instead, and then install Cygwin which contains 
Sed and a bunch of other usful Unix/Linux commands.
see http://stackoverflow.com/a/127567/2896744 for more info.

From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman 
[[email protected]]
Sent: Tuesday, August 04, 2015 9:09 AM
To: [email protected]
Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

I am on Windows machines, so I don't have quite the easy access to
that useful command.  Someone had earlier put the OCR in a doc file so
I've been playing with that more than with the raw PDF OCR.

On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John  wrote:
> Matt,
>
> There are probably a dozen ways to do this, but it would be really helpful to 
> know what operating system you are on? For example, if you are using Linux, 
> you can run it through sed using
>   cat  | sed 's/\n//' >> 
> see http://stackoverflow.com/a/800644/2896744 for more info
> 
> From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman 
> [[email protected]]
> Sent: Monday, August 03, 2015 10:29 PM
> To: [email protected]
> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> Hi Code4Lib folks,
>
> I was wondering if anyone had some experience cleaning up OCR text.
> Particularly I am trying to figure out how I can deal with the random
> line breaks that come from OCR.  I am trying to parse out a
> bibliography with regex.  I think I've figured out which queries I
> need to run to break it up so I can make it into a tab delimited text
> file but I noticed that the text does the classic thing of OCR
> inserting line breaks where they physically are on the page.  This
> will obviously be a bit of an issue since it would break the
> annotation into a bunch of lines rather than leaving it one block so I
> can manipulate it into a database.  So I am wondering if anyone who
> has worked with OCR text before has a suggested way to clean up those
> line breaks without doing 300 + pages by hand?  Any thoughts would be
> welcome.
>
> Matt Sherman


Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Matt Sherman
I am on Windows machines, so I don't have quite the easy access to
that useful command.  Someone had earlier put the OCR in a doc file so
I've been playing with that more than with the raw PDF OCR.

On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John  wrote:
> Matt,
>
> There are probably a dozen ways to do this, but it would be really helpful to 
> know what operating system you are on? For example, if you are using Linux, 
> you can run it through sed using
>   cat  | sed 's/\n//' >> 
> see http://stackoverflow.com/a/800644/2896744 for more info
> 
> From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman 
> [[email protected]]
> Sent: Monday, August 03, 2015 10:29 PM
> To: [email protected]
> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
>
> Hi Code4Lib folks,
>
> I was wondering if anyone had some experience cleaning up OCR text.
> Particularly I am trying to figure out how I can deal with the random
> line breaks that come from OCR.  I am trying to parse out a
> bibliography with regex.  I think I've figured out which queries I
> need to run to break it up so I can make it into a tab delimited text
> file but I noticed that the text does the classic thing of OCR
> inserting line breaks where they physically are on the page.  This
> will obviously be a bit of an issue since it would break the
> annotation into a bunch of lines rather than leaving it one block so I
> can manipulate it into a database.  So I am wondering if anyone who
> has worked with OCR text before has a suggested way to clean up those
> line breaks without doing 300 + pages by hand?  Any thoughts would be
> welcome.
>
> Matt Sherman


Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-04 Thread Scancella, John
Matt,

There are probably a dozen ways to do this, but it would be really helpful to 
know what operating system you are on? For example, if you are using Linux, you 
can run it through sed using 
  cat  | sed 's/\n//' >> 
see http://stackoverflow.com/a/800644/2896744 for more info

From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman 
[[email protected]]
Sent: Monday, August 03, 2015 10:29 PM
To: [email protected]
Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

Hi Code4Lib folks,

I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR.  I am trying to parse out a
bibliography with regex.  I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page.  This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database.  So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand?  Any thoughts would be
welcome.

Matt Sherman


[CODE4LIB] Looking for Ideas on Line Breaks in OCR Text

2015-08-03 Thread Matt Sherman
Hi Code4Lib folks,

I was wondering if anyone had some experience cleaning up OCR text.
Particularly I am trying to figure out how I can deal with the random
line breaks that come from OCR.  I am trying to parse out a
bibliography with regex.  I think I've figured out which queries I
need to run to break it up so I can make it into a tab delimited text
file but I noticed that the text does the classic thing of OCR
inserting line breaks where they physically are on the page.  This
will obviously be a bit of an issue since it would break the
annotation into a bunch of lines rather than leaving it one block so I
can manipulate it into a database.  So I am wondering if anyone who
has worked with OCR text before has a suggested way to clean up those
line breaks without doing 300 + pages by hand?  Any thoughts would be
welcome.

Matt Sherman