On Jan 20, 2012, at 4:05 PM, Bruce D'Arcus wrote: > On Fri, Jan 20, 2012 at 9:35 AM, Sylvester Keil <[email protected]> wrote: > >> I wrote anystyle-parser as a freecite replacement; my idea, going forward, >> was to turn it into a web service, like freecite, too. The ML model and the >> feature dictionary was optimized for my use cases, but could be easily >> improved. > > So just to clarify, the relevance here is in this approach, we'd need > a really smart parser, that would allow us to deconstruct a formatting > bibliographic entry into their component parts, and then to match that > against CSL macros fragments, to piece together a new style. > > This library can provide that.
Basically. The parser is not really smart, but based on a machine learning
model. It is currently trained mostly on a bibliography that I had to parse and
yielded very good results. Because it is extremely hard to achieve perfection,
I wanted it to be really easy for everyone to train the model. (The model
itself could be further improved, too, as well as the feature extraction
algorithms).
Anyway, here's a quick example:
Anystyle.parse "Harrison, Lowell H. (1975). The Civil War in Kentucky. The
University Press of Kentucky. pp. 20, 22. ISBN 0-8131-1419-5."
Returns:
=> [{:author=>"Harrison, Lowell H.", :title=>"The Civil War in Kentucky",
:publisher=>"The University Press of Kentucky", :pages=>["pp.", "22."],
:volume=>20, :isbn=>"0-8131-1419-5", :year=>1975, "unmatched-pages"=>"22.",
:type=>:book}]
So this is pretty close, but volume 20 is wrong.
Anystyle.parse 'Craig, Berry F. (August 1979). "Henry C. Burnett: Champion of
Southern Rights". The Register of the Kentucky Historical Society 77: pp.
266–274.'
This one is spot on:
=> [{:author=>"Craig, Berry F.", :title=>"Henry C. Burnett: Champion of
Southern Rights", :journal=>"The Register of the Kentucky Historical Society",
:volume=>77, :pages=>"266--274", :month=>8, :year=>1979, :type=>:article}]
But:
Anystyle.parse 'Craig, Berry F. (Autumn 2001). "The Jackson Purchase Considers
Secession: The 1861 Mayfield Convention". The Register of the Kentucky
Historical Society 99 (4): pp. 339–361.'
Returns:
=> [{:author=>"Craig, Berry F.", :date=>"(Autumn", :title=>"2001). "The Jackson
Purchase Considers Secession: The 1861 Mayfield Convention", :journal=>"The
Register of the Kentucky Historical Society", :volume=>99, :pages=>"339--361",
:number=>4, :type=>:article}]
So here the year wasn't picked up. What you need to is train the model to
become smarter at recognizing the season-year combination like this:
Anystyle.parser.train '<author> Craig, Berry F. </author> <date> (August 1979).
</date> <title> "Henry C. Burnett: Champion of Southern Rights". </title>
<journal> The Register of the Kentucky Historical Society </journal> <volume>
77: </volume> <pages> pp. 266–274. </pages>'
Now, the results are improved for this entry (obviously), but more importantly,
also for similarly formatted entries.
The parser is well suited for reference parsing (especially when combined with
discovery). If you want truly perfect results, the machine learning approach is
probably not the best.
>
>> Also, in rewriting citeproc-ruby I have started to extract all the CSL
>> functionality into a separate multi-purpose CSL API. This could be extremely
>> useful for a style editor, obviously, but it's far from finished.
>>
>> https://github.com/inukshuk/csl-ruby
>
> I was wondering about that. So what's the relationship between the
> rewritten citeproc-ruby an csl-ruby?
citeproc-ruby became really difficult to maintain and refactor, because I
originally added a lot of functionality for managing the JSON format on the one
hand and CSL elements on the other. For example, you could use the CSL locale
classes to ordinalize numbers (with gender support) etc.
So now my approach is to have a separate processor API (which contains all the
JSON functionality, like date parsing etc.), a processor (the ruby processor
which needs to be rewritten, or the citeproc-js embedded into ruby), and the
CSL API. Basically it allows you to parse, create and interact with the
individual CSL elements.
signature.asc
Description: Message signed with OpenPGP using GPGMail
------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________ xbiblio-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xbiblio-devel
