Dear Avram, I'm returning to this thread to shamelessly plug the citation parser I wrote in the last couple of weeks:
https://github.com/inukshuk/anystyle-parser I had to parse about 8000 references and was not satisfied by the results I got using ParsCit and FreeCite. The Parser follows the same general approach, but I've extended and improved (I hope) much of the feature elicitation; also, I'm using wapiti instead of libcrf++ which, IMO, has a much cleaner codebase and because I personally preferred a C over C++ implementation. In any case, wapiti is extremely fast and my models produced very encouraging results for my data once I trained about 30 references (in addition to the CORA dataset). Picking up on your idea, it would be extremely easy to adapt CSL styles to generate tagged output. Thus, we could automate the process of producing valid training data, as you suggest. Anyway, I thought I'd let you (and anyone interested in parsing citation references) know about the project. If you want to try out the parser but encounter any problems, don't hesitate to contact me for help. A word of caution: if your results are not accurate right away, try to tag one or two references and train the parser – I tried to make training the parser with new references very easy. /end shameless plug Best, Sylvester On Jul 26, 2011, at 11:51 PM, Avram Lyon wrote: > On Tue, Jul 26, 2011 at 10:36 PM, Simon Kornblith <[email protected]> wrote: >> So, I have a crazy idea of how to shift as much of the complexity of >> generating CSL away from the user as possible. Essentially, I want to be >> able to copy and paste bibliography entries from a journal's reference list >> into a box and end up with a formatted style. >> As far as the implementation goes, we would need to: >> 1) Convert the bibliography entries to a series of labeled fields using a >> parser such as FreeCite. > > I just spent some time getting FreeCite running locally. The project > has been largely dormant for two years or so, but there's someone > who's been committing to a fork on Github lately, and I was able to > get it to work on my machine pretty quickly, once I remembered my > Rails mambo. It works somewhat better than the current hosted version > at Brown-- it at least recognizes post-1999 dates. If we could build > some capability for the user to override the tags, an interactive > review, then I think it'd make a reasonable platform. > > I think one of the issues that FreeCite struggles with is limited > training data-- we should be able to provide strong data on things > like author names, place names, publishers and the like (from the data > stores of Zotero and perhaps Mendeley), that might make the tagging > more accurate. We can also produce tagged training data using > citeproc-js and known inputs to give good, comprehensive descriptions > of major patterns in citation formatting. > > Avram > > ------------------------------------------------------------------------------ > Got Input? Slashdot Needs You. > Take our quick survey online. Come on, we don't ask for help often. > Plus, you'll get a chance to win $100 to spend on ThinkGeek. > http://p.sf.net/sfu/slashdot-survey > _______________________________________________ > xbiblio-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/xbiblio-devel ------------------------------------------------------------------------------ Doing More with Less: The Next Generation Virtual Desktop What are the key obstacles that have prevented many mid-market businesses from deploying virtual desktops? How do next-generation virtual desktops provide companies an easier-to-deploy, easier-to-manage and more affordable virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/ _______________________________________________ xbiblio-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xbiblio-devel
