On Sep 8, 2011, at 4:03 PM, Bruce D'Arcus wrote: > On Thu, Sep 8, 2011 at 8:08 AM, Sylvester Keil <[email protected]> wrote: >> Dear Avram, >> >> I'm returning to this thread to shamelessly plug the citation parser I wrote >> in the last couple of weeks: >> >> https://github.com/inukshuk/anystyle-parser > > Cool! > >> I had to parse about 8000 references and was not satisfied by the results I >> got using ParsCit and FreeCite. The Parser follows the same general >> approach, but I've extended and improved (I hope) much of the feature >> elicitation; also, I'm using wapiti instead of libcrf++ which, IMO, has a >> much cleaner codebase and because I personally preferred a C over C++ >> implementation. In any case, wapiti is extremely fast and my models produced >> very encouraging results for my data once I trained about 30 references (in >> addition to the CORA dataset). >> >> Picking up on your idea, it would be extremely easy to adapt CSL styles to >> generate tagged output. Thus, we could automate the process of producing >> valid training data, as you suggest. > > So just to understand, are you volunteering to work up a > proof-of-concept of Simon's idea with your new tool? :-)
It is on my (ever growing) list of ideas to try out, yes. :) However, for the time being, I get satisfying results by just tagging a few representative references. Carles just made the suggestion to have the Cite processor produce tagged output (instead of altering the CSL style), which, now that I think of it, would be a better approach, because it would not involve changing individual styles. I remember that there is at least one feature in CSL which involves tracking the currently processed item and monitoring which of its attributes are being requested. Perhaps it would be possible to use a similar approach during processing and then inject tags at the end. Having said that, however, I am not convinced that this is really necessary. In my (brief) experience, it is more efficient to improve the feature elicitation and/or the statistical model than to have a well-nigh unlimited supply of well-formed training data. In fact, it may be that badly formatted data makes for more valuable input. Sylvester ------------------------------------------------------------------------------ Doing More with Less: The Next Generation Virtual Desktop What are the key obstacles that have prevented many mid-market businesses from deploying virtual desktops? How do next-generation virtual desktops provide companies an easier-to-deploy, easier-to-manage and more affordable virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/ _______________________________________________ xbiblio-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xbiblio-devel
