Hi! On 5 September 2012 19:03, harish suvarna <hsuva...@gmail.com> wrote:
> Hi, > Nice work and thanks for sharing. > > You had quite a good store of book titles of around 5.6million. Why is it > that the recall is around 50%.? > Well this 5.6M is a rather small set. No one knows the total number books ever written, but google estimates (conservatively) that it is at least 130 million [1]. And as you can imagine there is a long tail effect if we talk about how well known certain books are. This is why you won't easily cover, say 90% of the books with even a 50M data set. The 5.6 million set is the smallest one I experimented with - I like this size because it is easy to handle. To tell you the truth I was quite happy with the 50% :) Anyway, in the long run, it would be much more important to include book sets for different languages. Of course, both BNB and OL has some foreign titles but they are mostly for English. > Are the dropped titles (60-28-13=19) missing in the book bank? > Most of them are missing, some of them are dropped because the author is not mentioned (explicitly). > Are you > trying any more heuristics to reduce the false positives? > The number of false positives is not a really good marker: the associated confidence measure of those annotations is even more important. There is no real problem with a false positive that has 0.001 confidence. We should have displayed that info (next time). Anyway, there are two things on my agenda: 1) restricting by author names. This is a typical false positive from text 22: http://openlibrary.org/works/OL15987840W/New_Haven It is marked as found (confidence 0.2) because both some parts of the title and the author can be found (New Haven Area Heritage Association: New Haven). That is a dumb thing to do because: a) the author includes the title b) the author and the title occurrence overlap. This can be fixed easily. 2) better understanding of role of order and the token distance between author and title. I will probably experiment with different numbers and see how the test results change. These will happen in the next couple of weeks. Will let you know about the results. Cheers Mihály Thanks, > Harish > [1] http://booksearch.blogspot.hu/2010/08/books-of-world-stand-up-and-be-counted.html > On Wed, Sep 5, 2012 at 2:22 AM, Fabian Christ > <christ.fab...@googlemail.com>wrote: > > > Hi, > > > > nice engine ;) Thanks for sharing! > > > > Best, > > - Fabian > > > > 2012/9/3 Anuj Kumar <anujs...@gmail.com>: > > > That's great! Thanks for the info. > > > > > > Regards, > > > Anuj > > > > > > On Mon, Sep 3, 2012 at 8:49 PM, Mihály Héder <hederm...@gmail.com> > > wrote: > > > > > >> Hi! > > >> > > >> Sure, the 5.6M titles in a HashMap take about 1.3-1.5 G ram, so I run > > >> the whole stanbol with -Xmx2500M without issues. > > >> > > >> In earlier iterations I have used ehcache + sophisticated custom hit > > >> and miss handlers to save memory, but I had to realize that it creates > > >> more performance issues than it solves in everyday setups, to I gave > > >> up on that. > > >> > > >> Cheers > > >> Mihály > > >> > > >> On 3 September 2012 15:58, Anuj Kumar <anujs...@gmail.com> wrote: > > >> > Hi Mihály, > > >> > > > >> > Thanks a lot for sharing this. Looks good. > > >> > > > >> > I was curious to know the memory requirements to load the 5.6million > > >> titles > > >> > and the whole system to run. If you have any stats, can you please > > share > > >> > that? > > >> > > > >> > Regards, > > >> > Anuj > > >> > > > >> > On Mon, Sep 3, 2012 at 7:14 PM, Mihály Héder <hederm...@gmail.com> > > >> wrote: > > >> > > > >> >> Hi! > > >> >> > > >> >> let me introduce BookSpotter Enhancement Engige by Sztaki: > > >> >> > > >> >> > > >> > > > http://blog.iks-project.eu/introducing-bookspotter-enhancement-engine-by-sztaki/ > > >> >> > > >> >> Bookspotter uses a selection of 5.6M titles from the British > National > > >> >> Bibliography and the Open Library. > > >> >> It scans the incoming text, looking for titles, and in case the > > author > > >> >> is also mentioned, it produces the corresponding entity annotations > > >> >> that refer to the proper resource uris of either BNB or OL. > > >> >> > > >> >> You can check the system out here: > > >> >> http://pedia2.sztaki.hu:9090/enhancer/chain/bookspotter > > >> >> > > >> >> Thanks to the Early Adopter Program, I was able to buy some student > > >> >> work hours for data cleaning and for some basic testing. > > >> >> You might want to read the report on our test set of 25 tests: > > >> >> http://pedia2.sztaki.hu/stanbol/bookspotter/Bookspotter_tests.pdf > > >> >> > > >> >> For details, see the blog post! > > >> >> > > >> >> Any comments are much appreciated! > > >> >> Cheers, > > >> >> Mihály > > >> >> > > >> > > > > > > > > -- > > Fabian > > http://twitter.com/fctwitt > > > > > > -- > Thanks > Harish >