We where also experimenting with using MoreLikeThis queries to disambiguate extracted Entities during the Semantic / NLP Hackathon at the Berlin Buzzwords [1]. Basically you use the current context within the enhanced text to perform a MLT query based on all occurrences of an Entity within Wikipedia. This would allow Stanbol Enhancement Engines to suggest Entities not only because of the label, type and the ranking, but also because of the context in the enhanced text. Maybe one could even use this to suggest related Entities that are not even mentioned in the text (similar to categories)
As far as I can remember this will be based on intermediated results of the data set Olivier uses for his "Universal Topic Classification experiment". So if Olivier finishes his work on this there is also a good change that we will also have the required data to continue work on the Entity disambiguation. best Rupert [1] http://berlinbuzzwords.de/wiki/semantic-nlp-hackathon On 18.11.2011, at 17:13, Alex Lopez wrote: > I wanted to share my 2 cents about the classification using Stanbol as I had > relatively good results applying Olivier's method (using MoreLikeThis to > compare the input text with wikipedia abstracts) within my Stanbol instance > running a dbpedia index: > > Using RemoteStreaming to classify remote plain text (in this example some RFC > about mail) on a default Stanbol using full launcher: > > http://stanbolserver/solr/default/dbpedia_43k/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/rdfs:comment/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ > > or if a better index has been loaded (dbpedia) with indexed abstracts: > > http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://www.rfc-editor.org/rfc/rfc6409.txt&mlt.fl=@en/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/rdf:type/+ref/dc:subject/ > > Then process results: infer common broader categories, etc. > > Just to make some tests I extracted the most-repeated broader categories > using all dc:subject with the above text and yielded: > > Internet > Email > Internet_protocols > World_Wide_Web > Application_layer_protocols > > Another example using a Portuguese text (bible fragment): > > http://stanbolserver/solr/default/dbpedia/mlt?stream.url=http://scrapmaker.com/data/wordlists/genesis/portuguese.txt&mlt.fl=@pt/dbp-ont:abstract/&mlt.interestingTerms=list&mlt.mintf=0&fl=ref/dc:subject/ > > Categories_named_after_religious_texts > Christian_liturgy,_rites,_and_worship_services > Christian_theology > > It works for me :) > > However in the on-line instances I tested, the SOLR server didn't seem to be > exposed (as it is in last Stanbol revisions) so I can't give any ready-to-see > working example. > You are right the version on dev.iks-project.eu do not yet include this feature. We had a lot of important demonstrations over the last few weeks and therefore decided to not update the version as frequently as usually. best Rupert > Thanks Olivier for the great idea! > > Em 18-11-2011 15:52, Olivier Grisel escreveu: >> 2011/11/18 Reto Bachmann-Gmür<[email protected]>: >>> On Tue, Nov 15, 2011 at 2:09 PM, Bertrand Delacretaz<[email protected] >>>> wrote: >>> >>>> On Tue, Nov 15, 2011 at 12:45 PM, Stefane Fermigier<[email protected]> wrote: >>>>> Is online here: >>>>> >>>>> >>>> http://www.slideshare.net/nuxeo/apache-stanbol-and-the-web-of-data-apachecon-2011 >>>> >>>> I attended Olivier's presentation and was impressed by the results of >>>> his Universal Topic Classification experiment (starting at slide 38). >>>> >>> The results look very impressive. Is there some documentation on how to set >>> up this effective topic classification? >> >> Right now it's still a prototype using solr directly. I need to >> refactor a bunch of stuff but that will likely be impacted by the new >> RDF Path mapper / indexer we are gonna work on during the hackathon. >>
