Re: [CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401
We have been using this software with great performance in our citation extraction project: CitEc (Citations in Economics) (http://citec.repec.org). The only problem we have is related to the quality of the input data. We are using a commercial OCR engine from Vividata Inc, but it's not able to deal with all types of PDFs. Does anyone have experience with conversion from PDF to ASCII? Thanks for your help. Regards, On Mon, 19 Apr 2010, Min-Yen Kan wrote: Dear all: The ParsCit team has also been updating the ParsCit package, and is happy to announce a new version that improves on classification accuracy. This version also adds a fully-integrated module that adds document logical structure parsing so that that each line of the input is classified among 23 logical structure categories (e.g., page number, title, section header, figure, table, figureCaption, etc.) can be extracted from either plain text or XML output files that come from an OCR engine. The version also benefits from a number of user contributed fixes and training data. You can either download a copy of ParsCit for your own use, or use it through a web services interface. We welcome your feedback and hope that if you use ParsCit or any other freely available reference string parsing tool that you can contribute annotated data to help make these models more robust. ParsCit (and its online demos) are available from: http://wing.comp.nus.edu.sg/parsCit/ Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip Cheers, Min --- José Manuel Barruecohttp://www.uv.es/=barrueco
[CODE4LIB] SV: [CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401
Linux has ps2ascii which extracts ascii text from either .ps or .pdf. Also nutch (http://lucene.apache.org/nutch/) comes with a pdf parse plugin. /Jacob Larsen -Oprindelig meddelelse- Fra: Code for Libraries [mailto:code4...@listserv.nd.edu] På vegne af Jose Manuel Barrueco Sendt: 20. april 2010 08:57 Til: CODE4LIB@LISTSERV.ND.EDU Emne: Re: [CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401 We have been using this software with great performance in our citation extraction project: CitEc (Citations in Economics) (http://citec.repec.org). The only problem we have is related to the quality of the input data. We are using a commercial OCR engine from Vividata Inc, but it's not able to deal with all types of PDFs. Does anyone have experience with conversion from PDF to ASCII? Thanks for your help. Regards, On Mon, 19 Apr 2010, Min-Yen Kan wrote: Dear all: The ParsCit team has also been updating the ParsCit package, and is happy to announce a new version that improves on classification accuracy. This version also adds a fully-integrated module that adds document logical structure parsing so that that each line of the input is classified among 23 logical structure categories (e.g., page number, title, section header, figure, table, figureCaption, etc.) can be extracted from either plain text or XML output files that come from an OCR engine. The version also benefits from a number of user contributed fixes and training data. You can either download a copy of ParsCit for your own use, or use it through a web services interface. We welcome your feedback and hope that if you use ParsCit or any other freely available reference string parsing tool that you can contribute annotated data to help make these models more robust. ParsCit (and its online demos) are available from: http://wing.comp.nus.edu.sg/parsCit/ Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip Cheers, Min --- José Manuel Barrueco http://www.uv.es/=barrueco
[CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 100401
Dear all: The ParsCit team has also been updating the ParsCit package, and is happy to announce a new version that improves on classification accuracy. This version also adds a fully-integrated module that adds document logical structure parsing so that that each line of the input is classified among 23 logical structure categories (e.g., page number, title, section header, figure, table, figureCaption, etc.) can be extracted from either plain text or XML output files that come from an OCR engine. The version also benefits from a number of user contributed fixes and training data. You can either download a copy of ParsCit for your own use, or use it through a web services interface. We welcome your feedback and hope that if you use ParsCit or any other freely available reference string parsing tool that you can contribute annotated data to help make these models more robust. ParsCit (and its online demos) are available from: http://wing.comp.nus.edu.sg/parsCit/ Current Distribution: http://wing.comp.nus.edu.sg/parsCit/parscit-100401.zip Cheers, Min