Re: Introducing BookSpotter Enhancement Engine by Sztaki

Mihály Héder Thu, 06 Sep 2012 08:47:21 -0700

Hi!

On 5 September 2012 19:03, harish suvarna <hsuva...@gmail.com> wrote:

> Hi,
> Nice work and thanks for sharing.
>
> You had quite a good store of book titles of around 5.6million. Why is it
> that the recall is around 50%.?
>

Well this 5.6M is a rather small set. No one knows the total number books
ever written, but google estimates (conservatively) that it is at least 130
million [1].

And as you can imagine there is a long tail effect if we talk about how
well known certain books are. This is why you won't easily cover, say 90%
of the books with even a 50M data set.
The 5.6 million set is the smallest one I experimented with - I like this
size because it is easy to handle. To tell you the truth I was quite happy
with the 50% :)

Anyway, in the long run, it would be much more important to include book
sets for different languages. Of course, both BNB and OL has some foreign
titles but they are mostly for English.

> Are the dropped titles (60-28-13=19) missing in the book bank?
>
Most of them are missing, some of them are dropped because the author is
not mentioned (explicitly).

> Are you
> trying any more heuristics to reduce the false positives?
>

The number of false positives is not a really good marker: the associated
confidence measure of those annotations is even more important. There is no
real problem with a false positive that has 0.001 confidence. We should
have displayed that info (next time).

Anyway, there are two things on my agenda:
1) restricting by author names. This is a typical false positive from text
22: http://openlibrary.org/works/OL15987840W/New_Haven
It is marked as found (confidence 0.2) because both some parts of the title
and the author can be found (New Haven Area Heritage Association: New
Haven). That is a dumb thing to do because: a) the author includes the
title b) the author and the title occurrence overlap. This can be fixed
easily.

2) better understanding of role of order and the token distance between
author and title. I will probably experiment with different numbers and see
how the test results change.

These will happen in the next couple of weeks. Will let you know about the
results.

Cheers
Mihály

Thanks,
> Harish
>

[1]
http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html

> On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ
> <christ.fab...@googlemail.com>wrote:
>
> > Hi,
> >
> > nice engine ;) Thanks for sharing!
> >
> > Best,
> >  - Fabian
> >
> > 2012/9/3 Anuj Kumar <anujs...@gmail.com>:
> > > That's great! Thanks for the info.
> > >
> > > Regards,
> > > Anuj
> > >
> > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <hederm...@gmail.com>
> > wrote:
> > >
> > >> Hi!
> > >>
> > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run
> > >> the whole stanbol with -Xmx2500M without issues.
> > >>
> > >> In earlier iterations I have used ehcache + sophisticated custom hit
> > >> and miss handlers to save memory, but I had to realize that it creates
> > >> more performance issues than it solves in everyday setups, to I gave
> > >> up on that.
> > >>
> > >> Cheers
> > >> Mihály
> > >>
> > >> On 3 September 2012 15:58, Anuj Kumar <anujs...@gmail.com> wrote:
> > >> > Hi Mihály,
> > >> >
> > >> > Thanks a lot for sharing this. Looks good.
> > >> >
> > >> > I was curious to know the memory requirements to load the 5.6million
> > >> titles
> > >> > and the whole system to run. If you have any stats, can you please
> > share
> > >> > that?
> > >> >
> > >> > Regards,
> > >> > Anuj
> > >> >
> > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <hederm...@gmail.com>
> > >> wrote:
> > >> >
> > >> >> Hi!
> > >> >>
> > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki:
> > >> >>
> > >> >>
> > >>
> >
> http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/
> > >> >>
> > >> >> Bookspotter uses a selection of 5.6M titles from the British
> National
> > >> >> Bibliography and the Open Library.
> > >> >> It scans the incoming text, looking for titles, and in case the
> > author
> > >> >> is also mentioned, it produces the corresponding entity annotations
> > >> >> that refer to the proper resource uris of either BNB or OL.
> > >> >>
> > >> >> You can check the system out here:
> > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter
> > >> >>
> > >> >> Thanks to the Early Adopter Program, I was able to buy some student
> > >> >> work hours for data cleaning and for some basic testing.
> > >> >> You might want to read the report on our test set of 25 tests:
> > >> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf
> > >> >>
> > >> >> For details, see the blog post!
> > >> >>
> > >> >> Any comments are much appreciated!
> > >> >> Cheers,
> > >> >> Mihály
> > >> >>
> > >>
> >
> >
> >
> > --
> > Fabian
> > http://twitter.com/fctwitt
> >
>
>
>
> --
> Thanks
> Harish
>

Re: Introducing BookSpotter Enhancement Engine by Sztaki

Reply via email to