Re: [CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401

2010-04-20 Thread Jose Manuel Barrueco


	We have been using this software with great performance in our 
citation extraction project: CitEc (Citations in Economics) 
(http://citec.repec.org). The only problem we have is related to the 
quality of the input data. We are using a commercial OCR engine from 
Vividata Inc, but it's not able to deal with all types of PDFs. Does 
anyone have experience with conversion from PDF to ASCII? Thanks for your 
help. Regards,





On Mon, 19 Apr 2010, Min-Yen Kan wrote:


Dear all:

The ParsCit team has also been updating the ParsCit package, and is
happy to announce a new version that improves on classification
accuracy.  This version also adds a fully-integrated module that adds
document logical structure parsing so that that each line of the input
is classified among 23 logical structure categories (e.g., page
number, title, section header, figure, table, figureCaption, etc.) can
be extracted from either plain text or XML output files that come from
an OCR engine.  The version also benefits from a number of user
contributed fixes and training data.

You can either download a copy of ParsCit for your own use, or use it
through a web services interface. We welcome your feedback and hope
that if you use ParsCit or any other freely available reference string
parsing tool that you can contribute annotated data to help make these
models more robust.

ParsCit (and its online demos) are available from:
http://wing.comp.nus.edu.sg/parsCit/
Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip

Cheers,

Min





---
José Manuel Barruecohttp://www.uv.es/=barrueco


[CODE4LIB] SV: [CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401

2010-04-20 Thread Jacob Larsen
Linux has ps2ascii which extracts ascii text from either .ps or .pdf. Also 
nutch (http://lucene.apache.org/nutch/) comes with a pdf parse plugin.

/Jacob Larsen




 -Oprindelig meddelelse-
 Fra: Code for Libraries [mailto:code4...@listserv.nd.edu] På vegne af Jose
 Manuel Barrueco
 Sendt: 20. april 2010 08:57
 Til: CODE4LIB@LISTSERV.ND.EDU
 Emne: Re: [CODE4LIB] Reference string parsing and document logical structure
 software available: ParsCit 100401
 
 
   We have been using this software with great performance in our
 citation extraction project: CitEc (Citations in Economics)
 (http://citec.repec.org). The only problem we have is related to the
 quality of the input data. We are using a commercial OCR engine from
 Vividata Inc, but it's not able to deal with all types of PDFs. Does
 anyone have experience with conversion from PDF to ASCII? Thanks for your
 help. Regards,
 
 
 
 
 On Mon, 19 Apr 2010, Min-Yen Kan wrote:
 
  Dear all:
 
  The ParsCit team has also been updating the ParsCit package, and is
  happy to announce a new version that improves on classification
  accuracy.  This version also adds a fully-integrated module that adds
  document logical structure parsing so that that each line of the input
  is classified among 23 logical structure categories (e.g., page
  number, title, section header, figure, table, figureCaption, etc.) can
  be extracted from either plain text or XML output files that come from
  an OCR engine.  The version also benefits from a number of user
  contributed fixes and training data.
 
  You can either download a copy of ParsCit for your own use, or use it
  through a web services interface. We welcome your feedback and hope
  that if you use ParsCit or any other freely available reference string
  parsing tool that you can contribute annotated data to help make these
  models more robust.
 
  ParsCit (and its online demos) are available from:
  http://wing.comp.nus.edu.sg/parsCit/
  Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip
 
  Cheers,
 
  Min
 
 
 
 
 ---
 José Manuel Barrueco
   http://www.uv.es/=barrueco


[CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401

2010-04-19 Thread Min-Yen Kan
Dear all:

The ParsCit team has also been updating the ParsCit package, and is
happy to announce a new version that improves on classification
accuracy.  This version also adds a fully-integrated module that adds
document logical structure parsing so that that each line of the input
is classified among 23 logical structure categories (e.g., page
number, title, section header, figure, table, figureCaption, etc.) can
be extracted from either plain text or XML output files that come from
an OCR engine.  The version also benefits from a number of user
contributed fixes and training data.

You can either download a copy of ParsCit for your own use, or use it
through a web services interface. We welcome your feedback and hope
that if you use ParsCit or any other freely available reference string
parsing tool that you can contribute annotated data to help make these
models more robust.

ParsCit (and its online demos) are available from:
http://wing.comp.nus.edu.sg/parsCit/
Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip

Cheers,

Min