Re: [CODE4LIB] Looking for a script to clean up OCR text files

2014-11-23 Thread Monica Rivero

Hi Erica,

We are working on a similar project converting  concert performances  
from the past 20 years for our School of Music. though we use simple  
OCR for PDFs (supporting full text searching), we are selectively  
cleaning up OCR for metadata purposes. That is taking the first page  
of PDFs, extracting text and converting said text to titles and dates.  
We use simple regular expressions to remove line breaks and extra  
white spacing.


Here are our working guidelines http://bit.ly/1v0c7w2. Perhaps there  
might be something here that could be of help to you?


Best of luck with your project!

kind regards,
Monica

Quoting Kevin Hawkins kevin.s.hawk...@ultraslavonic.info:


It sounds like there are two sorts of things you need to clean up:

a) OCR errors

b) Formatting (like unnecessary line breaks)

For the former, I understand that Adobe Acrobat and ABBYY FineReader  
have tools built in to spellchecking.  PrimeOCR, an expensive OCR  
package, has a related package called PrimeVerify that does this.


If you don't have any of these, you could simply open the OCR output  
in a text editor with spellchecking to look for things to fix.  You  
could even copy and paste into Microsoft Word and use its  
spellchecker; you'd probably need to correct the source file in  
parallel to scanning it in Word.


As for formatting, this one is harder.  But instead of trying to  
solve that, I wonder if you're sure it's worth doing.  If you're  
only using the OCR to drive search of the scanned page images, why  
does it matter if there are some unnecessary line breaks in your OCR  
text?


Kevin

On 11/22/14 12:44 PM, scott bacon wrote:

Erica,

You may find what you need from OpenRefine: http://openrefine.org/



On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY eri...@multco.us wrote:


Greetings,

I am working on a project to digitize concert programs. These are the type
of programs you get when attending a musical concert that list performers
and details about the concert.

Since these items are text heavy we have decided to use OCR software to
output a text file that will enable full text searching in our platform.

These text files are for the most part accurate, but often have unnecessary
line breaks and pockets of extra characters and/or incorrect
capitalization. I would like to pretty them up a little bit if possible.

I am wondering if there is a script I can use on multiple files to clean
these type of things up. I don't want to have the digitization staff
manually edit each text file or have to open each one to run a macro in a
text editor.

I have been searching online and so far haven't found anything that will
work for my situation.

thanks in advance,

*Erica Findley*
Cataloging/Metadata Librarian
Multnomah County Library
Phone: 503.988.5466
eri...@multco.us
www.multcolib.org




Digital Curation Coordinator
Digital Scholarship Services
Fondren Library, Rice University


Re: [CODE4LIB] Automated Embedded Metadata Extraction in Photographs: Possible or Pipedream?

2013-12-17 Thread Monica Rivero

Hi Shea,

Well, one option you might explore is extracting metadata from images 
using exiftool (http://www.sno.phy.queensu.ca/~phil/exiftool/) to a CSV 
or TXT file and then convert this file to what ever tool or file format 
(xml) you use for batch import to your CMS. So semi-automated.


We currently do the reverse, embed metadata into images and then ingest 
to our IR (DSpace).


hope this helps,
Monica

On 12/17/2013 3:37 PM, Swauger,Shea wrote:

Hi all,

I'm wondering if there is a systematic method that can extract metadata 
embedded in digital photographs and then ingest that metadata into a CMS and 
relate them to their corresponding images. We currently use DigiTool, if that 
makes a difference.

Thanks!

Shea Swauger
Data Management Librarian
Colorado State Univeristy



Re: [CODE4LIB] Question for Institutional Repository Folks

2013-10-28 Thread Monica Rivero
If you have adobe acrobat professional software, you can use the option 
FileCreateCombine files into one single PDF. This will combine the 
password-protected PDF plus a coversheet PDF containing the metadata you 
are looking to add.


Good luck!

Monica

On 10/28/2013 1:16 PM, Matthew Sherman wrote:

Correct, it is locked only to editing.  The professor is around so I
probably should contact him as you suggest.  I was asking in the case I ran
into something where I could not contact the professor, but asking him
directly is probably the best move.  As for adding it to the metadata I am
just a bit unsure as the e-mail they sent me requested that I Please add
this text to the pdf file:


On Mon, Oct 28, 2013 at 2:04 PM, Jim DelRosso jd...@cornell.edu wrote:


Just to clarify: the password's only necessary to *edit *the PDF?

In my experience, most publishers are fine with required statements going
in the metadata, so long as the metadata is visible to users. That being
said, it does depend on the publisher, and their specific request.

Is it possible to contact the author directly about getting the password,
or a PDF that's not password-locked?

Jim

*Jim DelRosso, MPA, MSLIS
Digital Projects Coordinator*
*Hospitality, Labor, and Management Library*
Catherwood Library
ILR School
Cornell University
239D Ives Hall
Ithaca, NY 14853
p 607.255.8688
f 607.255.9641
e jd...@cornell.edu
www.ilr.cornell.edu
*Advancing the World of Work*


On Mon, Oct 28, 2013 at 1:50 PM, Matthew Sherman
matt.r.sher...@gmail.comwrote:


We use DSpace for our repository so any editing to the PDFs have to be

done

in Acrobat before uploading.  I can add a note to the metadata in DSpace,
but I am not sure if that fulfills the permissions agreement.  I was
recently hired for this position so I do not know who provided us the

file

to upload in the first place.  That is why I am asking if anyone else has
dealt with this since I am unsure if I can ever get the password.


On Mon, Oct 28, 2013 at 1:18 PM, Jim DelRosso jd...@cornell.edu wrote:


Matt,

Does the software you use generate cover pages that you can edit? Or

can

you add the note to the metadata page associated with the document?

Jim

*Jim DelRosso, MPA, MSLIS
Digital Projects Coordinator*
*Hospitality, Labor, and Management Library*
Catherwood Library
ILR School
Cornell University
239D Ives Hall
Ithaca, NY 14853
p 607.255.8688
f 607.255.9641
e jd...@cornell.edu
www.ilr.cornell.edu
*Advancing the World of Work*


On Mon, Oct 28, 2013 at 1:13 PM, Matthew Sherman
matt.r.sher...@gmail.comwrote:


Hello Code4libbers,

I had a question for for others who work with institutional

repositories.

I have a file given by the a professor that I have permission to post

if

I

add a note to the PDF, but the file is password locked.  Has anyone

else

run into this problem before?  Can anyone give me some advice in how

I

can

edit this to add the required note to the top of the PDF?  Any advice

is

welcome.

Matt Sherman








Re: [CODE4LIB] Tool to highlight differences in two files

2013-04-23 Thread Monica Rivero

Hi Wilhelmina,

We've used oXygen and Text Wrangler (but only for macs).

regards,
Monica

On 4/23/2013 3:24 PM, Wilhelmina Randtke wrote:

I would like to compare versions of a website scraped at different times to
see what paragraphs on a page have changed.  Does anyone here know of a
tool for holding two files side by side and noting what is the same and
what is different between the files?

It seems like any simple script to note differences in two strings of text
would work, but I don't know a tool to use.

-Wilhelmina Randtke