On Jan 20, 2012, at 4:05 PM, Bruce D'Arcus wrote:

> On Fri, Jan 20, 2012 at 9:35 AM, Sylvester Keil <[email protected]> wrote:
> 
>> I wrote anystyle-parser as a freecite replacement; my idea, going forward, 
>> was to turn it into a web service, like freecite, too. The ML model and the 
>> feature dictionary was optimized for my use cases, but could be easily 
>> improved.
> 
> So just to clarify, the relevance here is in this approach, we'd need
> a really smart parser, that would allow us to deconstruct a formatting
> bibliographic entry into their component parts, and then to match that
> against CSL macros fragments, to piece together a new style.
> 
> This library can provide that.

Basically. The parser is not really smart, but based on a machine learning 
model. It is currently trained mostly on a bibliography that I had to parse and 
yielded very good results. Because it is extremely hard to achieve perfection, 
I wanted it to be really easy for everyone to train the model. (The model 
itself could be further improved, too, as well as the feature extraction 
algorithms).

Anyway, here's a quick example:

Anystyle.parse "Harrison, Lowell H. (1975). The Civil War in Kentucky. The 
University Press of Kentucky. pp. 20, 22. ISBN 0-8131-1419-5."

Returns:

=> [{:author=>"Harrison, Lowell H.", :title=>"The Civil War in Kentucky", 
:publisher=>"The University Press of Kentucky", :pages=>["pp.", "22."], 
:volume=>20, :isbn=>"0-8131-1419-5", :year=>1975, "unmatched-pages"=>"22.", 
:type=>:book}]

So this is pretty close, but volume 20 is wrong.

Anystyle.parse 'Craig, Berry F. (August 1979). "Henry C. Burnett: Champion of 
Southern Rights". The Register of the Kentucky Historical Society 77: pp. 
266–274.'

This one is spot on:

=> [{:author=>"Craig, Berry F.", :title=>"Henry C. Burnett: Champion of 
Southern Rights", :journal=>"The Register of the Kentucky Historical Society", 
:volume=>77, :pages=>"266--274", :month=>8, :year=>1979, :type=>:article}]

But:

Anystyle.parse 'Craig, Berry F. (Autumn 2001). "The Jackson Purchase Considers 
Secession: The 1861 Mayfield Convention". The Register of the Kentucky 
Historical Society 99 (4): pp. 339–361.'

Returns:

=> [{:author=>"Craig, Berry F.", :date=>"(Autumn", :title=>"2001). "The Jackson 
Purchase Considers Secession: The 1861 Mayfield Convention", :journal=>"The 
Register of the Kentucky Historical Society", :volume=>99, :pages=>"339--361", 
:number=>4, :type=>:article}]

So here the year wasn't picked up. What you need to is train the model to 
become smarter at recognizing the season-year combination like this:

Anystyle.parser.train '<author> Craig, Berry F. </author> <date> (August 1979). 
</date> <title> "Henry C. Burnett: Champion of Southern Rights". </title> 
<journal> The Register of the Kentucky Historical Society </journal> <volume> 
77: </volume> <pages> pp. 266–274. </pages>'

Now, the results are improved for this entry (obviously), but more importantly, 
also for similarly formatted entries.

The parser is well suited for reference parsing (especially when combined with 
discovery). If you want truly perfect results, the machine learning approach is 
probably not the best.
> 
>> Also, in rewriting citeproc-ruby I have started to extract all the CSL 
>> functionality into a separate multi-purpose CSL API. This could be extremely 
>> useful for a style editor, obviously, but it's far from finished.
>> 
>> https://github.com/inukshuk/csl-ruby
> 
> I was wondering about that. So what's the relationship between the
> rewritten citeproc-ruby an csl-ruby?

citeproc-ruby became really difficult to maintain and refactor, because I 
originally added a lot of functionality for managing the JSON format on the one 
hand and CSL elements on the other. For example, you could use the CSL locale 
classes to ordinalize numbers (with gender support) etc.

So now my approach is to have a separate processor API (which contains all the 
JSON functionality, like date parsing etc.), a processor (the ruby processor 
which needs to be rewritten, or the citeproc-js embedded into ruby), and the 
CSL API. Basically it allows you to parse, create and interact with the 
individual CSL elements.


Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
xbiblio-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/xbiblio-devel

Reply via email to