On Sep 8, 2011, at 4:03 PM, Bruce D'Arcus wrote:

> On Thu, Sep 8, 2011 at 8:08 AM, Sylvester Keil <[email protected]> wrote:
>> Dear Avram,
>> 
>> I'm returning to this thread to shamelessly plug the citation parser I wrote 
>> in the last couple of weeks:
>> 
>> https://github.com/inukshuk/anystyle-parser
> 
> Cool!
> 
>> I had to parse about 8000 references and was not satisfied by the results I 
>> got using ParsCit and FreeCite. The Parser follows the same general 
>> approach, but I've extended and improved (I hope) much of the feature 
>> elicitation; also, I'm using wapiti instead of libcrf++ which, IMO, has a 
>> much cleaner codebase and because I personally preferred a C over C++ 
>> implementation. In any case, wapiti is extremely fast and my models produced 
>> very encouraging results for my data once I trained about 30 references (in 
>> addition to the CORA dataset).
>> 
>> Picking up on your idea, it would be extremely easy to adapt CSL styles to 
>> generate tagged output. Thus, we could automate the process of producing 
>> valid training data, as you suggest.
> 
> So just to understand, are you volunteering to work up a
> proof-of-concept of Simon's idea with your new tool? :-)

It is on my (ever growing) list of ideas to try out, yes.  :) However, for the 
time being, I get satisfying results by just tagging  a few representative 
references. Carles just made the suggestion to have the Cite processor produce 
tagged output (instead of altering the CSL style), which, now that I think of 
it, would be a better approach, because it would not involve changing 
individual styles.

I remember that there is at least one feature in CSL which involves tracking 
the currently processed item and monitoring which of its attributes are being 
requested. Perhaps it would be possible to use a similar approach during 
processing and then inject tags at the end.

Having said that, however, I am not convinced that this is really necessary. In 
my (brief) experience, it is more efficient to improve the feature elicitation 
and/or the statistical model than to have a well-nigh unlimited supply of 
well-formed training data. In fact, it may be that badly formatted data makes 
for more valuable input.

Sylvester


------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop 
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
xbiblio-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xbiblio-devel

Reply via email to