Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
That worked pretty well. There is still come clean up I have to do but [A-z]^p[A-z] to [A-z] [A-z] did a lot of the cleanup. On Tue, Aug 4, 2015 at 12:17 PM, Kyle Banerjee wrote: > On Tue, Aug 4, 2015 at 6:09 AM, Matt Sherman > wrote: > >> I am on Windows machines, so I don't have quite the easy access to >> that useful command. Someone had earlier put the OCR in a doc file so >> I've been playing with that more than with the raw PDF OCR. >> >> > Versions of the unix utilities that run on Windows are available, but you > can just use Microsoft Word to do what you want. Just use the find/replace > function. In Word, you can search for a paragraph marker by looking for > "^p" (caret p) > > Because you undoubtedly have real paragraphs in the document which you > don't want to remove, I'd recommend substituting double paragraph marks > with something unique (e.g. "@ZZZ@") before replacing all the other > paragraph marks with a space. Then replace your unique marker with a > paragraph. > > HTH, > > kyle
Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
On Tue, Aug 4, 2015 at 6:09 AM, Matt Sherman wrote: > I am on Windows machines, so I don't have quite the easy access to > that useful command. Someone had earlier put the OCR in a doc file so > I've been playing with that more than with the raw PDF OCR. > > Versions of the unix utilities that run on Windows are available, but you can just use Microsoft Word to do what you want. Just use the find/replace function. In Word, you can search for a paragraph marker by looking for "^p" (caret p) Because you undoubtedly have real paragraphs in the document which you don't want to remove, I'd recommend substituting double paragraph marks with something unique (e.g. "@ZZZ@") before replacing all the other paragraph marks with a space. Then replace your unique marker with a paragraph. HTH, kyle
Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
Hm, doing a little looking on someone's suggestion it turns out I was wrong, they are not line breaks, they are paragraph marks. On Tue, Aug 4, 2015 at 9:21 AM, Scancella, John wrote: > Matt, > > A word document does funny things to the text since it is actually html (try > opening a .doc in a plain text editor and you will see it is html). I would > try and get the plain ASCII text instead, and then install Cygwin which > contains Sed and a bunch of other usful Unix/Linux commands. > see http://stackoverflow.com/a/127567/2896744 for more info. > > From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman > [[email protected]] > Sent: Tuesday, August 04, 2015 9:09 AM > To: [email protected] > Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text > > I am on Windows machines, so I don't have quite the easy access to > that useful command. Someone had earlier put the OCR in a doc file so > I've been playing with that more than with the raw PDF OCR. > > On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John wrote: >> Matt, >> >> There are probably a dozen ways to do this, but it would be really helpful >> to know what operating system you are on? For example, if you are using >> Linux, you can run it through sed using >> cat | sed 's/\n//' >> >> see http://stackoverflow.com/a/800644/2896744 for more info >> >> From: Code for Libraries [[email protected]] On Behalf Of Matt >> Sherman [[email protected]] >> Sent: Monday, August 03, 2015 10:29 PM >> To: [email protected] >> Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text >> >> Hi Code4Lib folks, >> >> I was wondering if anyone had some experience cleaning up OCR text. >> Particularly I am trying to figure out how I can deal with the random >> line breaks that come from OCR. I am trying to parse out a >> bibliography with regex. I think I've figured out which queries I >> need to run to break it up so I can make it into a tab delimited text >> file but I noticed that the text does the classic thing of OCR >> inserting line breaks where they physically are on the page. This >> will obviously be a bit of an issue since it would break the >> annotation into a bunch of lines rather than leaving it one block so I >> can manipulate it into a database. So I am wondering if anyone who >> has worked with OCR text before has a suggested way to clean up those >> line breaks without doing 300 + pages by hand? Any thoughts would be >> welcome. >> >> Matt Sherman
Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
Matt, A word document does funny things to the text since it is actually html (try opening a .doc in a plain text editor and you will see it is html). I would try and get the plain ASCII text instead, and then install Cygwin which contains Sed and a bunch of other usful Unix/Linux commands. see http://stackoverflow.com/a/127567/2896744 for more info. From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman [[email protected]] Sent: Tuesday, August 04, 2015 9:09 AM To: [email protected] Subject: Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text I am on Windows machines, so I don't have quite the easy access to that useful command. Someone had earlier put the OCR in a doc file so I've been playing with that more than with the raw PDF OCR. On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John wrote: > Matt, > > There are probably a dozen ways to do this, but it would be really helpful to > know what operating system you are on? For example, if you are using Linux, > you can run it through sed using > cat | sed 's/\n//' >> > see http://stackoverflow.com/a/800644/2896744 for more info > > From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman > [[email protected]] > Sent: Monday, August 03, 2015 10:29 PM > To: [email protected] > Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text > > Hi Code4Lib folks, > > I was wondering if anyone had some experience cleaning up OCR text. > Particularly I am trying to figure out how I can deal with the random > line breaks that come from OCR. I am trying to parse out a > bibliography with regex. I think I've figured out which queries I > need to run to break it up so I can make it into a tab delimited text > file but I noticed that the text does the classic thing of OCR > inserting line breaks where they physically are on the page. This > will obviously be a bit of an issue since it would break the > annotation into a bunch of lines rather than leaving it one block so I > can manipulate it into a database. So I am wondering if anyone who > has worked with OCR text before has a suggested way to clean up those > line breaks without doing 300 + pages by hand? Any thoughts would be > welcome. > > Matt Sherman
Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
I am on Windows machines, so I don't have quite the easy access to that useful command. Someone had earlier put the OCR in a doc file so I've been playing with that more than with the raw PDF OCR. On Tue, Aug 4, 2015 at 8:19 AM, Scancella, John wrote: > Matt, > > There are probably a dozen ways to do this, but it would be really helpful to > know what operating system you are on? For example, if you are using Linux, > you can run it through sed using > cat | sed 's/\n//' >> > see http://stackoverflow.com/a/800644/2896744 for more info > > From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman > [[email protected]] > Sent: Monday, August 03, 2015 10:29 PM > To: [email protected] > Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text > > Hi Code4Lib folks, > > I was wondering if anyone had some experience cleaning up OCR text. > Particularly I am trying to figure out how I can deal with the random > line breaks that come from OCR. I am trying to parse out a > bibliography with regex. I think I've figured out which queries I > need to run to break it up so I can make it into a tab delimited text > file but I noticed that the text does the classic thing of OCR > inserting line breaks where they physically are on the page. This > will obviously be a bit of an issue since it would break the > annotation into a bunch of lines rather than leaving it one block so I > can manipulate it into a database. So I am wondering if anyone who > has worked with OCR text before has a suggested way to clean up those > line breaks without doing 300 + pages by hand? Any thoughts would be > welcome. > > Matt Sherman
Re: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
Matt, There are probably a dozen ways to do this, but it would be really helpful to know what operating system you are on? For example, if you are using Linux, you can run it through sed using cat | sed 's/\n//' >> see http://stackoverflow.com/a/800644/2896744 for more info From: Code for Libraries [[email protected]] On Behalf Of Matt Sherman [[email protected]] Sent: Monday, August 03, 2015 10:29 PM To: [email protected] Subject: [CODE4LIB] Looking for Ideas on Line Breaks in OCR Text Hi Code4Lib folks, I was wondering if anyone had some experience cleaning up OCR text. Particularly I am trying to figure out how I can deal with the random line breaks that come from OCR. I am trying to parse out a bibliography with regex. I think I've figured out which queries I need to run to break it up so I can make it into a tab delimited text file but I noticed that the text does the classic thing of OCR inserting line breaks where they physically are on the page. This will obviously be a bit of an issue since it would break the annotation into a bunch of lines rather than leaving it one block so I can manipulate it into a database. So I am wondering if anyone who has worked with OCR text before has a suggested way to clean up those line breaks without doing 300 + pages by hand? Any thoughts would be welcome. Matt Sherman
[CODE4LIB] Looking for Ideas on Line Breaks in OCR Text
Hi Code4Lib folks, I was wondering if anyone had some experience cleaning up OCR text. Particularly I am trying to figure out how I can deal with the random line breaks that come from OCR. I am trying to parse out a bibliography with regex. I think I've figured out which queries I need to run to break it up so I can make it into a tab delimited text file but I noticed that the text does the classic thing of OCR inserting line breaks where they physically are on the page. This will obviously be a bit of an issue since it would break the annotation into a bunch of lines rather than leaving it one block so I can manipulate it into a database. So I am wondering if anyone who has worked with OCR text before has a suggested way to clean up those line breaks without doing 300 + pages by hand? Any thoughts would be welcome. Matt Sherman
