Hi all and welcome Rafa Haro great to see all that interest in this very important topic!
First let me admit that I am not a specialist in that topic. Nonetheless I will try to contribute some bits to this discussion that might help you to identify suiting methods/algorithms. (A) single/multiple vocabulary: One very important aspect the Stanbol Enhancer is the adaptability to specific domains (and even more specific company settings). Because of that it is important to keep in mind that suggested Entities will come from multiple vocabularies. While disambiguation within a single vocabulary is still very important (and useful) one need also to consider situations where an Engine would need to decide/rank over Entities originating from different vocabularies. 1. Spotting and Disambiguation in a single vocabulary/knowledge base: (e.g. extraction against a Company Thesaurus that contains two projects/customers with the same name). 2. Disambiguation of Entities from multiple Vocabularies (e.g. a Customer of the Company thesaurus that has the same name as a Place from DBPedia) (B) Pre-requirements: Datasets in Stanbol are very heterogeneous. So it will be important to provide also disambiguation algorithms that operate on data that are very easy to obtain. The most powerful algorithm if not not much help to a user that can not provide the required data for it. In the following I will try to come up with three examples: 1. using literals and/or relations present within the vocabulary/knowledge base: This kind of information would be very easily available (e.g. following links to other entities and using their labels to build a text corpus used for disambiguation). 2. using mentions of entities in documents: This is e.g. available in Wikipedia. But with more and more pages using RDFa this might be also become available for other datasets. Still one could not assume this to be available for most Company related vocabularies. 3. using some kind of feedback service to learn disambiguation (similar as already implemented for the topic classification engine by ogrisel). Such a service could even span entities of multiple vocabularies. For public datasets/documents one could even try to share such examples with others. (C) What is the goal: 1. decide between Entity A and B (or in other words - correctly rank them) 2. provide a better confidence estimation (especially important if no human is in the loop) 3. Grouping of Entities (could be interesting if there already exists some RDFa in parsed content and we want to exploit this to detect further entities) Possible first steps (feedback very welcome) * I assume the case (A.1/B.1/C.1) as the lowest hanging fruit and would naively try to implement it by using Solr MLT. BTW i am currently in the middle of adding new functionality to the Entityhub Indexing Tool that would allow to build semantic contexts (as described by (B-1)). So tests in this direction should be possible in the near future. * The other algorithm mentioned by STANBOL-223 [1] could also work for (A.2). However it would require a normalized way to obtain the "disambiguation-context" for entities originating from different vocabularies. For this I think it would be very helpful to normalize the retrieval of the "disambiguation-context" across datasets (e.g. via an ontology or a service). For (B.1) scenarios one could add support for this to the Entityhub Indexing tool. For (B.2/B.3) scenarios would could think about a registry like system where different "disambiguation-context" provider could contribute information for Entities. WDYT? BTW: I have no idea how such a "disambiguation-context" could look like. hope this is some food for thought best Rupert [1] https://issues.apache.org/jira/browse/STANBOL-223 On Mon, Apr 23, 2012 at 4:46 PM, Rafa Haro <[email protected]> wrote: > Hi all, > > First of all, I want to introduce myself. I'm Rafa Haro from Spain and I > just arrived to the mailing list. I'm currently working in integrate Stanbol > in Alfresco and at the same time I'm doing a research on Entity Linking for > my PhD. By coincidence, the firsts emails I have received are about this > field :-). > > As it has been notice, Entity Disambiguation is a challenging task. There > are some simple approaches that usually don't work well with complex > documents. As a response to Fabian suggestion regarding a scientific network > in this field, you should take a look to Entity Linking task in Knowledge > Base Population (KBP) track at NIST Conference: > http://www.nist.gov/tac/2012/KBP/index.html > > This year is the fourth edition. You might be interested in take a look of > the best proposals in last three years. We are participating this year. > > I wouldn't mind to get involved in bringing Entity Disambiguation to Stanbol > and to collaborate in general in the project. Is that possible? > > Regards > > > El 23/04/12 16:04, Pablo Mendes escribió: > >> Hi all, >> >> I think you should start with a really simple solution for this and then >>> >>> improve this first simple algorithm. >> >> >> This was exactly the approach taken by the DBpedia Spotlight project. We >> have built a few entity linkers (a.k.a. disambiguators) based on Lucene >> first, and started incrementally making them more sophisticated. If you >> are >> a fan of not repeating work, please feel free to look at what we've done. >> >> http://spotlight.dbpedia.org >> >> Our disambiguators will be integrated as EnhancementEngines in Stanbol >> within the next couple of months. >> >> If you're a fan of reimplementing things to make them better, I'd say you >> should look elsewhere. There are some interesting approaches out there >> that >> have not been open sourced, but that have papers describing their >> algorithms. Implementing them would be probably more beneficial for the >> community than reimplementing what we did. >> >> Cheers, >> Pablo >> >> On Mon, Apr 23, 2012 at 3:56 PM, kritarth >> anand<[email protected]>wrote: >> >>> Thanks a lot Fabian for your inputs. I'll definitely add on them in my >>> proposal. >>> >>> On Mon, Apr 23, 2012 at 7:23 PM, Fabian Christ< >>> [email protected] >>>> >>>> wrote: >>>> Hi Kritarth, >>>> >>>> I have read your proposal and building such a disambiguation engine is >>>> a challenging task. Here are some thoughts: >>>> >>>> - Did you think about restriction for the domain, or the kind of text >>>> that this engine would/should work best for? It is often the case that >>>> you can not implement the single engine that always works well. So >>>> maybe you should think a little bit about the kind of content that you >>>> would like to support. >>>> >>>> - Do you have access to any scientific network? Perhaps looking in the >>>> scientific world for published papers about entity disambiguation may >>>> give you some ideas and would widen your view on the problem. >>>> >>>> - I think you should start with a really simple solution for this and >>>> then improve this first simple algorithm. Having a simple trivial >>>> solution makes it more easy to have something to compare. Sometimes it >>>> happens that the advanced algorithms are not any better than the >>>> trivial ones. So try it ;) >>>> >>>> Best, >>>> - Fabian >>>> >>>> Am 18. April 2012 11:02 schrieb kritarth anand<[email protected] >>>> : >>>>> >>>>> Hi guys, >>>>> >>>>> Hope your doing well. I was advised by my supervisor Dr. Rupert that to >>>>> interest people in my application, I should provide little summary of >>> >>> my >>>>> >>>>> proposal. Please do have a look at it below, in case you do find it >>>>> interesting or if you might want to suggest something on that. You may >>>> >>>> rad >>>>> >>>>> the entire documents >>>>> >>>>> My proposal is Entity Disambiguation as an Enhancement engine in >>>> >>>> Stanbol. >>>>> >>>>> You can have a look at it JIRA page, >>>> >>>> https://issues.apache.org/jira/browse/* >>>>> >>>>> STANBOL*-223 . I propose to build it during the summers as a part of >>>> >>>> Google >>>>> >>>>> Summer of Code. Any advice from you guys is most welcome >>>>> >>>>> Kritarth >>>>> >>>>> On Tue, Apr 17, 2012 at 8:36 PM, kritarth anand< >>>> >>>> [email protected]>wrote: >>>>>> >>>>>> Hi Guys, >>>>>> >>>>>> Hope your doing well. Please do take out few minutes and have a look >>> >>> at >>>> >>>> my >>>>>> >>>>>> proposal. Your feedback is extremely valuable for me. >>>>>> >>>>>> Kritarth >>>>>> >>>>>> >>>>>> On Mon, Apr 16, 2012 at 12:23 AM, kritarth anand< >>>> >>>> [email protected] >>>>>>> >>>>>>> wrote: >>>>>>> Dear Fabian, >>>>>>> >>>>>>> Thanks for pointing it out. >>>>>>> >>>>>>> @All >>>>>>> >>>>>>> I have attached the PDF versions of my proposal and Background Info >>>> >>>> with >>>>>>> >>>>>>> this mail. You may also find the proposal on this Google Document >>>>>>> >>>>>>> >>>>>>> >>> >>> https://docs.google.com/document/d/1BA0x9craA2kiFn0jM-66HSS7SFCk5Q5U5gyEWaRftIk/edit >>>>>>> >>>>>>> It is editable so you might add on comments there itself so that you >>>> >>>> can >>>>>>> >>>>>>> add on some one elses advice too. You can anyways mail me. >>>>>>> >>>>>>> Kritarth Anand >>>>>>> >>>>>>> On Mon, Apr 16, 2012 at 12:13 AM, Fabian Christ< >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Kritarth, >>>>>>>> >>>>>>>> and welcome to Stanbol. Could you share the proposal in any open >>>>>>>> format like PDF, HTML, plain text or via an URL? Not all of us have >>>>>>>> access to the newest M$ office suite. >>>>>>>> >>>>>>>> Thanks, and looking forward for your contribution! >>>>>>>> >>>>>>>> Best, >>>>>>>> - Fabian >>>>>>>> >>>>>>>> Am 15. April 2012 10:21 schrieb kritarth anand< >>>> >>>> [email protected] >>>>>>>>> >>>>>>>>> : >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I would like to convey my warm greetings to the entire Stanbol >>>>>>>> >>>>>>>> community. My >>>>>>>>> >>>>>>>>> name is Kritarth Anand. I study Computer Science and Indian >>>> >>>> Institute >>>>>>>> >>>>>>>> of >>>>>>>>> >>>>>>>>> Technology Delhi. I am a potential candidate working on “Entity >>>>>>>>> disambiguation in Stanbol enhancement engines” as part of Google >>>>>>>> >>>>>>>> Summer of >>>>>>>>> >>>>>>>>> Code. If I am successful, I‘ll be coordinating with you guys. >>>>>>>>> >>>>>>>>> >>>>>>>>> I write to you all to request for some feedback on my proposal, I >>>> >>>> have >>>>>>>> >>>>>>>> given >>>>>>>>> >>>>>>>>> out below. You might be able to give me valuable suggestions to >>>>>>>> >>>>>>>> improve my >>>>>>>>> >>>>>>>>> proposal, incorporate details, omit unnecessary ones and get a >>> >>> more >>>>>>>>> >>>>>>>>> realistic with timeline that I have suggested. >>>>>>>>> >>>>>>>>> >>>>>>>>> Please feel free to discuss any matters whenever you might like. I >>>> >>>> have >>>>>>>>> >>>>>>>>> attached two documents with this mail. One of the of two is the >>>>>>>> >>>>>>>> proposal >>>>>>>>> >>>>>>>>> suggested and the other little bit details about my background. >>>>>>>>> >>>>>>>>> >>>>>>>>> Kritarth Anand >>>>>>>>> >>>>>>>>> www.cse.iitd.ac.in/~cs5080213< >>> >>> http://www.cse.iitd.ac.in/%7Ecs5080213>< >>>> >>>> http://www.cse.iitd.ac.in/%7Ecs5080213> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Fabian >>>>>>>> http://twitter.com/fctwitt >>>>>>>> >>>>>>> >>>> >>>> >>>> -- >>>> Fabian >>>> http://twitter.com/fctwitt >>>> > This message should be regarded as confidential. If you have received this > email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard copy > by an authorised signatory. > > Zaizi Ltd is registered in England and Wales with the registration number > 6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, > London W10 5JJ, UK. > -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
