Re: [CODE4LIB] Looking for a script to clean up OCR text files
Hi Erica, We are working on a similar project converting concert performances from the past 20 years for our School of Music. though we use simple OCR for PDFs (supporting full text searching), we are selectively cleaning up OCR for metadata purposes. That is taking the first page of PDFs, extracting text and converting said text to titles and dates. We use simple regular expressions to remove line breaks and extra white spacing. Here are our working guidelines http://bit.ly/1v0c7w2. Perhaps there might be something here that could be of help to you? Best of luck with your project! kind regards, Monica Quoting Kevin Hawkins kevin.s.hawk...@ultraslavonic.info: It sounds like there are two sorts of things you need to clean up: a) OCR errors b) Formatting (like unnecessary line breaks) For the former, I understand that Adobe Acrobat and ABBYY FineReader have tools built in to spellchecking. PrimeOCR, an expensive OCR package, has a related package called PrimeVerify that does this. If you don't have any of these, you could simply open the OCR output in a text editor with spellchecking to look for things to fix. You could even copy and paste into Microsoft Word and use its spellchecker; you'd probably need to correct the source file in parallel to scanning it in Word. As for formatting, this one is harder. But instead of trying to solve that, I wonder if you're sure it's worth doing. If you're only using the OCR to drive search of the scanned page images, why does it matter if there are some unnecessary line breaks in your OCR text? Kevin On 11/22/14 12:44 PM, scott bacon wrote: Erica, You may find what you need from OpenRefine: http://openrefine.org/ On Fri, Nov 21, 2014 at 5:15 PM, Erica FINDLEY eri...@multco.us wrote: Greetings, I am working on a project to digitize concert programs. These are the type of programs you get when attending a musical concert that list performers and details about the concert. Since these items are text heavy we have decided to use OCR software to output a text file that will enable full text searching in our platform. These text files are for the most part accurate, but often have unnecessary line breaks and pockets of extra characters and/or incorrect capitalization. I would like to pretty them up a little bit if possible. I am wondering if there is a script I can use on multiple files to clean these type of things up. I don't want to have the digitization staff manually edit each text file or have to open each one to run a macro in a text editor. I have been searching online and so far haven't found anything that will work for my situation. thanks in advance, *Erica Findley* Cataloging/Metadata Librarian Multnomah County Library Phone: 503.988.5466 eri...@multco.us www.multcolib.org Digital Curation Coordinator Digital Scholarship Services Fondren Library, Rice University
Re: [CODE4LIB] Automated Embedded Metadata Extraction in Photographs: Possible or Pipedream?
Hi Shea, Well, one option you might explore is extracting metadata from images using exiftool (http://www.sno.phy.queensu.ca/~phil/exiftool/) to a CSV or TXT file and then convert this file to what ever tool or file format (xml) you use for batch import to your CMS. So semi-automated. We currently do the reverse, embed metadata into images and then ingest to our IR (DSpace). hope this helps, Monica On 12/17/2013 3:37 PM, Swauger,Shea wrote: Hi all, I'm wondering if there is a systematic method that can extract metadata embedded in digital photographs and then ingest that metadata into a CMS and relate them to their corresponding images. We currently use DigiTool, if that makes a difference. Thanks! Shea Swauger Data Management Librarian Colorado State Univeristy
Re: [CODE4LIB] Question for Institutional Repository Folks
If you have adobe acrobat professional software, you can use the option FileCreateCombine files into one single PDF. This will combine the password-protected PDF plus a coversheet PDF containing the metadata you are looking to add. Good luck! Monica On 10/28/2013 1:16 PM, Matthew Sherman wrote: Correct, it is locked only to editing. The professor is around so I probably should contact him as you suggest. I was asking in the case I ran into something where I could not contact the professor, but asking him directly is probably the best move. As for adding it to the metadata I am just a bit unsure as the e-mail they sent me requested that I Please add this text to the pdf file: On Mon, Oct 28, 2013 at 2:04 PM, Jim DelRosso jd...@cornell.edu wrote: Just to clarify: the password's only necessary to *edit *the PDF? In my experience, most publishers are fine with required statements going in the metadata, so long as the metadata is visible to users. That being said, it does depend on the publisher, and their specific request. Is it possible to contact the author directly about getting the password, or a PDF that's not password-locked? Jim *Jim DelRosso, MPA, MSLIS Digital Projects Coordinator* *Hospitality, Labor, and Management Library* Catherwood Library ILR School Cornell University 239D Ives Hall Ithaca, NY 14853 p 607.255.8688 f 607.255.9641 e jd...@cornell.edu www.ilr.cornell.edu *Advancing the World of Work* On Mon, Oct 28, 2013 at 1:50 PM, Matthew Sherman matt.r.sher...@gmail.comwrote: We use DSpace for our repository so any editing to the PDFs have to be done in Acrobat before uploading. I can add a note to the metadata in DSpace, but I am not sure if that fulfills the permissions agreement. I was recently hired for this position so I do not know who provided us the file to upload in the first place. That is why I am asking if anyone else has dealt with this since I am unsure if I can ever get the password. On Mon, Oct 28, 2013 at 1:18 PM, Jim DelRosso jd...@cornell.edu wrote: Matt, Does the software you use generate cover pages that you can edit? Or can you add the note to the metadata page associated with the document? Jim *Jim DelRosso, MPA, MSLIS Digital Projects Coordinator* *Hospitality, Labor, and Management Library* Catherwood Library ILR School Cornell University 239D Ives Hall Ithaca, NY 14853 p 607.255.8688 f 607.255.9641 e jd...@cornell.edu www.ilr.cornell.edu *Advancing the World of Work* On Mon, Oct 28, 2013 at 1:13 PM, Matthew Sherman matt.r.sher...@gmail.comwrote: Hello Code4libbers, I had a question for for others who work with institutional repositories. I have a file given by the a professor that I have permission to post if I add a note to the PDF, but the file is password locked. Has anyone else run into this problem before? Can anyone give me some advice in how I can edit this to add the required note to the top of the PDF? Any advice is welcome. Matt Sherman
Re: [CODE4LIB] Tool to highlight differences in two files
Hi Wilhelmina, We've used oXygen and Text Wrangler (but only for macs). regards, Monica On 4/23/2013 3:24 PM, Wilhelmina Randtke wrote: I would like to compare versions of a website scraped at different times to see what paragraphs on a page have changed. Does anyone here know of a tool for holding two files side by side and noting what is the same and what is different between the files? It seems like any simple script to note differences in two strings of text would work, but I don't know a tool to use. -Wilhelmina Randtke