Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-23 Thread Fitchett, Deborah
For turning a bibliography into RIS format, I wrote a tool based on a whole 
pile of regex commands bundled into sed files wrapped in an AppleScript app:

Webpage: http://deborahfitchett.com/toys/ref2ris/ 
Code4Lib article: http://journal.code4lib.org/articles/6286

Let me know if you've got questions about using/adapting it. Both of those 
links also list other tools I found trying to do similar things.

Deborah

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Friday, 19 June 2015 5:04 a.m.
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

On Jun 18, 2015, at 12:02 PM, Matt Sherman  wrote:

> I am working with colleague on a side project which involves some 
> scanned bibliographies and making them more web 
> searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects 
> we need, but I am at a bit of a loss on how to automate the process of 
> putting the bibliography in a more structured format so that we can 
> avoid going through hundreds of pages by hand.  I am pretty sure 
> regular expressions are needed, but I have not had an instance where I 
> need to automate extracting data from one file type (PDF OCR or text 
> extracted to Word doc) and place it into another (either a database or 
> an XML file) with some enrichment.  I would appreciate any suggestions 
> for approaches or tools to look into.  Thanks for any help/thoughts people 
> can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric


P Please consider the environment before you print this email.
"The contents of this e-mail (including any attachments) may be confidential 
and/or subject to copyright. Any unauthorised use, distribution, or copying of 
the contents is expressly prohibited. If you have received this e-mail in 
error, please advise the sender by return e-mail or telephone and then delete 
this e-mail together with all attachments from your system."


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-19 Thread Kevin Hawkins
See also http://wiki.tei-c.org/index.php/Heuristics , which discusses 
this problem more broadly conceived.  I've just added a link to the 
archives of this very discussion.  --Kevin


On 6/18/15 12:52 PM, Matt Sherman wrote:

The hope is to take these bibliographies put it into more of a web
searchable/sortable format for researchers to make use out of them.  My
colleague was taking some inspiration from the Marlowe Bibliography (
https://marlowebibliography.org/), though we are hoping to possibly get a
bit more robust with the bibliography we are working on.  The important
first step it to be able to parse the existing OCRed bibliography scans we
have into a database, possibly a custom XML format but a database will
probably be easier to append and expand down the road.

On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee 
wrote:


How you want to preprocess and structure the data depends on what you hope
to achieve. Can you say more about what you want the end product to look
like?

kyle

On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman 
wrote:


That is a pretty good summation of it yes.  I appreciate the suggestions,
this is a bit of a new realm for me and while I know what I want it to do
and the structure I want to put it in, the conversion process has been
eluding me so thanks for giving me some tools to look into.

On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan 

wrote:



On Jun 18, 2015, at 12:02 PM, Matt Sherman 
wrote:


I am working with colleague on a side project which involves some

scanned

bibliographies and making them more web

searchable/sortable/browse-able.

While I am quite familiar with the metadata and organization aspects

we

need, but I am at a bit of a loss on how to automate the process of

putting

the bibliography in a more structured format so that we can avoid

going

through hundreds of pages by hand.  I am pretty sure regular

expressions

are needed, but I have not had an instance where I need to automate
extracting data from one file type (PDF OCR or text extracted to Word

doc)

and place it into another (either a database or an XML file) with

some

enrichment.  I would appreciate any suggestions for approaches or

tools

to

look into.  Thanks for any help/thoughts people can give.



If I understand your question correctly, then you have two problems to
address: 1) converting PDF, Word, etc. files into plain text, and 2)
marking up the result (which is a bibliography) into structure data.
Correct?

If so, then if your PDF documents have already been OCRed, or if you

have

other files, then you can probably feed them to TIKA to quickly and

easily

extract the underlying plain text. [1] I wrote a brain-dead shell

script

to

run TIKA in server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good
luck. I think such an application is something Library Land sought for

a

long time. “Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script -
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric







Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-19 Thread Sylvain Machefert
Hi all,
As Matt's problem is related to parsing citations, I would definitely have a 
look at the tools cited by Cindy because going with regexp will quickly become 
a nightmare. Even if citations have been created following a common reference 
style: there will necessarily be incoherence, amplified by the OCR process. 
This kind of tool already tries to deal with that, just give it a try (FreeCite 
lists other tools or libraries trying to accomplish this).

Looks like a fun project btw!

Regards,
Sylvain

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Harper, 
Cynthia
Sent: 18 June 2015 19:49
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

Eric or others, do you know of any utility that converts a PDF and retains 
coding for where font or font-style changes? Or converts a web page with 
associated CSS and notes where font-style and HTML text block stops and starts? 
 It seems that would be the starting point for recognizing citation entities.  
I've seen websites for FreeCite http://freecite.library.brown.edu/ and Parscit 
http://aye.comp.nus.edu.sg/parsCit/ through web searches, but don't know how 
close they got to the Grail before becoming legend.

Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, June 18, 2015 1:04 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

On Jun 18, 2015, at 12:02 PM, Matt Sherman  wrote:

> I am working with colleague on a side project which involves some
> scanned bibliographies and making them more web 
> searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects
> we need, but I am at a bit of a loss on how to automate the process of
> putting the bibliography in a more structured format so that we can
> avoid going through hundreds of pages by hand.  I am pretty sure
> regular expressions are needed, but I have not had an instance where I
> need to automate extracting data from one file type (PDF OCR or text
> extracted to Word doc) and place it into another (either a database or
> an XML file) with some enrichment.  I would appreciate any suggestions
> for approaches or tools to look into.  Thanks for any help/thoughts people 
> can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric


This email and any files transmitted with it were intended solely for the 
addressee. If you have received this email in error please let the sender know 
by return.

Please think before you print.


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Owen Stephens
It may depend on the format of the PDF, but I’ve used the Scraperwiki Python 
Module ‘pdf2xml’ function to extract text data from PDFs in the past. There is 
a write up (not by me) at 
http://schoolofdata.org/2013/08/16/scraping-pdfs-with-python-and-the-scraperwiki-module/
 
,
 and an example of how I’ve used it at 
https://github.com/ostephens/british_library_directory_of_library_codes/blob/master/scraper.py
 


Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

> On 18 Jun 2015, at 17:02, Matt Sherman  wrote:
> 
> Hi Code4Libbers,
> 
> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.
> 
> Matt Sherman


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
Thanks, that is interesting since we can export from the PDFs, and while
the OCR text is a little messy it is in decent shape.  I'll certainly look
into that.

On Thu, Jun 18, 2015 at 3:13 PM, Gordon, Bonnie 
wrote:

> We¹re actually also working on getting a bibliography from a Word Doc to a
> more structured format. We¹re using regular expressions in LibreOffice
> Writer to mark up the citations, then insert tabs between the elements,
> and then copy into a spreadsheet (similar to what¹s described in
> http://programminghistorian.org/lessons/understanding-regular-expressions
> ).
>  However, our bibliography was originally a Word Doc, not OCRed text. This
> method is pretty reliant on consistent formatting, though, so messy OCR
> could complicate things. Another thing to note is that it¹s easiest when
> you know what format the citation is for (e.g., a book or article), since
> that impacts how the citation is structured.  I¹d be happy to provide a
> sample citation in each step of the process.
>
> All the best,
> Bonnie
>
>
>
> On 6/18/15, 1:52 PM, "Matt Sherman"  wrote:
>
> >The hope is to take these bibliographies put it into more of a web
> >searchable/sortable format for researchers to make use out of them.  My
> >colleague was taking some inspiration from the Marlowe Bibliography (
> >https://marlowebibliography.org/), though we are hoping to possibly get a
> >bit more robust with the bibliography we are working on.  The important
> >first step it to be able to parse the existing OCRed bibliography scans we
> >have into a database, possibly a custom XML format but a database will
> >probably be easier to append and expand down the road.
> >
> >On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee 
> >wrote:
> >
> >> How you want to preprocess and structure the data depends on what you
> >>hope
> >> to achieve. Can you say more about what you want the end product to look
> >> like?
> >>
> >> kyle
> >>
> >> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman
> >>
> >> wrote:
> >>
> >> > That is a pretty good summation of it yes.  I appreciate the
> >>suggestions,
> >> > this is a bit of a new realm for me and while I know what I want it
> >>to do
> >> > and the structure I want to put it in, the conversion process has been
> >> > eluding me so thanks for giving me some tools to look into.
> >> >
> >> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan 
> >> wrote:
> >> >
> >> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman
> >>
> >> > > wrote:
> >> > >
> >> > > > I am working with colleague on a side project which involves some
> >> > scanned
> >> > > > bibliographies and making them more web
> >> > searchable/sortable/browse-able.
> >> > > > While I am quite familiar with the metadata and organization
> >>aspects
> >> we
> >> > > > need, but I am at a bit of a loss on how to automate the process
> >>of
> >> > > putting
> >> > > > the bibliography in a more structured format so that we can avoid
> >> going
> >> > > > through hundreds of pages by hand.  I am pretty sure regular
> >> > expressions
> >> > > > are needed, but I have not had an instance where I need to
> >>automate
> >> > > > extracting data from one file type (PDF OCR or text extracted to
> >>Word
> >> > > doc)
> >> > > > and place it into another (either a database or an XML file) with
> >> some
> >> > > > enrichment.  I would appreciate any suggestions for approaches or
> >> tools
> >> > > to
> >> > > > look into.  Thanks for any help/thoughts people can give.
> >> > >
> >> > >
> >> > > If I understand your question correctly, then you have two problems
> >>to
> >> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> >> > > marking up the result (which is a bibliography) into structure data.
> >> > > Correct?
> >> > >
> >> > > If so, then if your PDF documents have already been OCRed, or if you
> >> have
> >> > > other files, then you can probably feed them to TIKA to quickly and
> >> > easily
> >> > > extract the underlying plain text. [1] I wrote a brain-dead shell
> >> script
> >> > to
> >> > > run TIKA in server mode and then convert Word (.docx) files. [2]
> >> > >
> >> > > When it comes to marking up the result into structured data, well,
> >>good
> >> > > luck. I think such an application is something Library Land sought
> >>for
> >> a
> >> > > long time. ³Can you say Holy Grail?"
> >> > >
> >> > > [1] Tika - https://tika.apache.org
> >> > > [2] brain-dead script -
> >> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> >> > >
> >> > > ‹
> >> > > Eric
> >> > >
> >> >
> >>
>


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Gordon, Bonnie
We¹re actually also working on getting a bibliography from a Word Doc to a
more structured format. We¹re using regular expressions in LibreOffice
Writer to mark up the citations, then insert tabs between the elements,
and then copy into a spreadsheet (similar to what¹s described in
http://programminghistorian.org/lessons/understanding-regular-expressions).
 However, our bibliography was originally a Word Doc, not OCRed text. This
method is pretty reliant on consistent formatting, though, so messy OCR
could complicate things. Another thing to note is that it¹s easiest when
you know what format the citation is for (e.g., a book or article), since
that impacts how the citation is structured.  I¹d be happy to provide a
sample citation in each step of the process.

All the best,
Bonnie



On 6/18/15, 1:52 PM, "Matt Sherman"  wrote:

>The hope is to take these bibliographies put it into more of a web
>searchable/sortable format for researchers to make use out of them.  My
>colleague was taking some inspiration from the Marlowe Bibliography (
>https://marlowebibliography.org/), though we are hoping to possibly get a
>bit more robust with the bibliography we are working on.  The important
>first step it to be able to parse the existing OCRed bibliography scans we
>have into a database, possibly a custom XML format but a database will
>probably be easier to append and expand down the road.
>
>On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee 
>wrote:
>
>> How you want to preprocess and structure the data depends on what you
>>hope
>> to achieve. Can you say more about what you want the end product to look
>> like?
>>
>> kyle
>>
>> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman
>>
>> wrote:
>>
>> > That is a pretty good summation of it yes.  I appreciate the
>>suggestions,
>> > this is a bit of a new realm for me and while I know what I want it
>>to do
>> > and the structure I want to put it in, the conversion process has been
>> > eluding me so thanks for giving me some tools to look into.
>> >
>> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan 
>> wrote:
>> >
>> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman
>>
>> > > wrote:
>> > >
>> > > > I am working with colleague on a side project which involves some
>> > scanned
>> > > > bibliographies and making them more web
>> > searchable/sortable/browse-able.
>> > > > While I am quite familiar with the metadata and organization
>>aspects
>> we
>> > > > need, but I am at a bit of a loss on how to automate the process
>>of
>> > > putting
>> > > > the bibliography in a more structured format so that we can avoid
>> going
>> > > > through hundreds of pages by hand.  I am pretty sure regular
>> > expressions
>> > > > are needed, but I have not had an instance where I need to
>>automate
>> > > > extracting data from one file type (PDF OCR or text extracted to
>>Word
>> > > doc)
>> > > > and place it into another (either a database or an XML file) with
>> some
>> > > > enrichment.  I would appreciate any suggestions for approaches or
>> tools
>> > > to
>> > > > look into.  Thanks for any help/thoughts people can give.
>> > >
>> > >
>> > > If I understand your question correctly, then you have two problems
>>to
>> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
>> > > marking up the result (which is a bibliography) into structure data.
>> > > Correct?
>> > >
>> > > If so, then if your PDF documents have already been OCRed, or if you
>> have
>> > > other files, then you can probably feed them to TIKA to quickly and
>> > easily
>> > > extract the underlying plain text. [1] I wrote a brain-dead shell
>> script
>> > to
>> > > run TIKA in server mode and then convert Word (.docx) files. [2]
>> > >
>> > > When it comes to marking up the result into structured data, well,
>>good
>> > > luck. I think such an application is something Library Land sought
>>for
>> a
>> > > long time. ³Can you say Holy Grail?"
>> > >
>> > > [1] Tika - https://tika.apache.org
>> > > [2] brain-dead script -
>> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>> > >
>> > > ‹
>> > > Eric
>> > >
>> >
>>


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Harper, Cynthia
Eric or others, do you know of any utility that converts a PDF and retains 
coding for where font or font-style changes? Or converts a web page with 
associated CSS and notes where font-style and HTML text block stops and starts? 
 It seems that would be the starting point for recognizing citation entities.  
I've seen websites for FreeCite http://freecite.library.brown.edu/ and Parscit 
http://aye.comp.nus.edu.sg/parsCit/ through web searches, but don't know how 
close they got to the Grail before becoming legend.

Cindy Harper

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, June 18, 2015 1:04 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata 
and/or a Database

On Jun 18, 2015, at 12:02 PM, Matt Sherman  wrote:

> I am working with colleague on a side project which involves some 
> scanned bibliographies and making them more web 
> searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects 
> we need, but I am at a bit of a loss on how to automate the process of 
> putting the bibliography in a more structured format so that we can 
> avoid going through hundreds of pages by hand.  I am pretty sure 
> regular expressions are needed, but I have not had an instance where I 
> need to automate extracting data from one file type (PDF OCR or text 
> extracted to Word doc) and place it into another (either a database or 
> an XML file) with some enrichment.  I would appreciate any suggestions 
> for approaches or tools to look into.  Thanks for any help/thoughts people 
> can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

—
Eric


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
The hope is to take these bibliographies put it into more of a web
searchable/sortable format for researchers to make use out of them.  My
colleague was taking some inspiration from the Marlowe Bibliography (
https://marlowebibliography.org/), though we are hoping to possibly get a
bit more robust with the bibliography we are working on.  The important
first step it to be able to parse the existing OCRed bibliography scans we
have into a database, possibly a custom XML format but a database will
probably be easier to append and expand down the road.

On Thu, Jun 18, 2015 at 1:11 PM, Kyle Banerjee 
wrote:

> How you want to preprocess and structure the data depends on what you hope
> to achieve. Can you say more about what you want the end product to look
> like?
>
> kyle
>
> On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman 
> wrote:
>
> > That is a pretty good summation of it yes.  I appreciate the suggestions,
> > this is a bit of a new realm for me and while I know what I want it to do
> > and the structure I want to put it in, the conversion process has been
> > eluding me so thanks for giving me some tools to look into.
> >
> > On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan 
> wrote:
> >
> > > On Jun 18, 2015, at 12:02 PM, Matt Sherman 
> > > wrote:
> > >
> > > > I am working with colleague on a side project which involves some
> > scanned
> > > > bibliographies and making them more web
> > searchable/sortable/browse-able.
> > > > While I am quite familiar with the metadata and organization aspects
> we
> > > > need, but I am at a bit of a loss on how to automate the process of
> > > putting
> > > > the bibliography in a more structured format so that we can avoid
> going
> > > > through hundreds of pages by hand.  I am pretty sure regular
> > expressions
> > > > are needed, but I have not had an instance where I need to automate
> > > > extracting data from one file type (PDF OCR or text extracted to Word
> > > doc)
> > > > and place it into another (either a database or an XML file) with
> some
> > > > enrichment.  I would appreciate any suggestions for approaches or
> tools
> > > to
> > > > look into.  Thanks for any help/thoughts people can give.
> > >
> > >
> > > If I understand your question correctly, then you have two problems to
> > > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> > > marking up the result (which is a bibliography) into structure data.
> > > Correct?
> > >
> > > If so, then if your PDF documents have already been OCRed, or if you
> have
> > > other files, then you can probably feed them to TIKA to quickly and
> > easily
> > > extract the underlying plain text. [1] I wrote a brain-dead shell
> script
> > to
> > > run TIKA in server mode and then convert Word (.docx) files. [2]
> > >
> > > When it comes to marking up the result into structured data, well, good
> > > luck. I think such an application is something Library Land sought for
> a
> > > long time. “Can you say Holy Grail?"
> > >
> > > [1] Tika - https://tika.apache.org
> > > [2] brain-dead script -
> > > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> > >
> > > —
> > > Eric
> > >
> >
>


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Kyle Banerjee
How you want to preprocess and structure the data depends on what you hope
to achieve. Can you say more about what you want the end product to look
like?

kyle

On Thu, Jun 18, 2015 at 10:08 AM, Matt Sherman 
wrote:

> That is a pretty good summation of it yes.  I appreciate the suggestions,
> this is a bit of a new realm for me and while I know what I want it to do
> and the structure I want to put it in, the conversion process has been
> eluding me so thanks for giving me some tools to look into.
>
> On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan  wrote:
>
> > On Jun 18, 2015, at 12:02 PM, Matt Sherman 
> > wrote:
> >
> > > I am working with colleague on a side project which involves some
> scanned
> > > bibliographies and making them more web
> searchable/sortable/browse-able.
> > > While I am quite familiar with the metadata and organization aspects we
> > > need, but I am at a bit of a loss on how to automate the process of
> > putting
> > > the bibliography in a more structured format so that we can avoid going
> > > through hundreds of pages by hand.  I am pretty sure regular
> expressions
> > > are needed, but I have not had an instance where I need to automate
> > > extracting data from one file type (PDF OCR or text extracted to Word
> > doc)
> > > and place it into another (either a database or an XML file) with some
> > > enrichment.  I would appreciate any suggestions for approaches or tools
> > to
> > > look into.  Thanks for any help/thoughts people can give.
> >
> >
> > If I understand your question correctly, then you have two problems to
> > address: 1) converting PDF, Word, etc. files into plain text, and 2)
> > marking up the result (which is a bibliography) into structure data.
> > Correct?
> >
> > If so, then if your PDF documents have already been OCRed, or if you have
> > other files, then you can probably feed them to TIKA to quickly and
> easily
> > extract the underlying plain text. [1] I wrote a brain-dead shell script
> to
> > run TIKA in server mode and then convert Word (.docx) files. [2]
> >
> > When it comes to marking up the result into structured data, well, good
> > luck. I think such an application is something Library Land sought for a
> > long time. “Can you say Holy Grail?"
> >
> > [1] Tika - https://tika.apache.org
> > [2] brain-dead script -
> > https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
> >
> > —
> > Eric
> >
>


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
That is a pretty good summation of it yes.  I appreciate the suggestions,
this is a bit of a new realm for me and while I know what I want it to do
and the structure I want to put it in, the conversion process has been
eluding me so thanks for giving me some tools to look into.

On Thu, Jun 18, 2015 at 1:04 PM, Eric Lease Morgan  wrote:

> On Jun 18, 2015, at 12:02 PM, Matt Sherman 
> wrote:
>
> > I am working with colleague on a side project which involves some scanned
> > bibliographies and making them more web searchable/sortable/browse-able.
> > While I am quite familiar with the metadata and organization aspects we
> > need, but I am at a bit of a loss on how to automate the process of
> putting
> > the bibliography in a more structured format so that we can avoid going
> > through hundreds of pages by hand.  I am pretty sure regular expressions
> > are needed, but I have not had an instance where I need to automate
> > extracting data from one file type (PDF OCR or text extracted to Word
> doc)
> > and place it into another (either a database or an XML file) with some
> > enrichment.  I would appreciate any suggestions for approaches or tools
> to
> > look into.  Thanks for any help/thoughts people can give.
>
>
> If I understand your question correctly, then you have two problems to
> address: 1) converting PDF, Word, etc. files into plain text, and 2)
> marking up the result (which is a bibliography) into structure data.
> Correct?
>
> If so, then if your PDF documents have already been OCRed, or if you have
> other files, then you can probably feed them to TIKA to quickly and easily
> extract the underlying plain text. [1] I wrote a brain-dead shell script to
> run TIKA in server mode and then convert Word (.docx) files. [2]
>
> When it comes to marking up the result into structured data, well, good
> luck. I think such an application is something Library Land sought for a
> long time. “Can you say Holy Grail?"
>
> [1] Tika - https://tika.apache.org
> [2] brain-dead script -
> https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff
>
> —
> Eric
>


Re: [CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Eric Lease Morgan
On Jun 18, 2015, at 12:02 PM, Matt Sherman  wrote:

> I am working with colleague on a side project which involves some scanned
> bibliographies and making them more web searchable/sortable/browse-able.
> While I am quite familiar with the metadata and organization aspects we
> need, but I am at a bit of a loss on how to automate the process of putting
> the bibliography in a more structured format so that we can avoid going
> through hundreds of pages by hand.  I am pretty sure regular expressions
> are needed, but I have not had an instance where I need to automate
> extracting data from one file type (PDF OCR or text extracted to Word doc)
> and place it into another (either a database or an XML file) with some
> enrichment.  I would appreciate any suggestions for approaches or tools to
> look into.  Thanks for any help/thoughts people can give.


If I understand your question correctly, then you have two problems to address: 
1) converting PDF, Word, etc. files into plain text, and 2) marking up the 
result (which is a bibliography) into structure data. Correct?

If so, then if your PDF documents have already been OCRed, or if you have other 
files, then you can probably feed them to TIKA to quickly and easily extract 
the underlying plain text. [1] I wrote a brain-dead shell script to run TIKA in 
server mode and then convert Word (.docx) files. [2]

When it comes to marking up the result into structured data, well, good luck. I 
think such an application is something Library Land sought for a long time. 
“Can you say Holy Grail?"

[1] Tika - https://tika.apache.org
[2] brain-dead script - 
https://gist.github.com/ericleasemorgan/c4e34ffad96c0221f1ff

— 
Eric


[CODE4LIB] Desiring Advice for Converting OCR Text into Metadata and/or a Database

2015-06-18 Thread Matt Sherman
Hi Code4Libbers,

I am working with colleague on a side project which involves some scanned
bibliographies and making them more web searchable/sortable/browse-able.
While I am quite familiar with the metadata and organization aspects we
need, but I am at a bit of a loss on how to automate the process of putting
the bibliography in a more structured format so that we can avoid going
through hundreds of pages by hand.  I am pretty sure regular expressions
are needed, but I have not had an instance where I need to automate
extracting data from one file type (PDF OCR or text extracted to Word doc)
and place it into another (either a database or an XML file) with some
enrichment.  I would appreciate any suggestions for approaches or tools to
look into.  Thanks for any help/thoughts people can give.

Matt Sherman