Re: Greetings and Feedback on proposal for Google Summer of Code

Rafa Haro Wed, 25 Apr 2012 07:22:59 -0700

Hi all,

Thanks for the welcome


El 23/04/12 20:14, Rupert Westenthaler escribió:

Hi all and welcome Rafa Haro

great to see all that interest in this very important topic!

First let me admit that I am not a specialist in that topic.
Nonetheless I will try to contribute some bits to this discussion that
might help you to identify suiting methods/algorithms.


(A) single/multiple vocabulary:

One very important aspect the Stanbol Enhancer is the adaptability to
specific domains (and even more specific company settings). Because of
that it is important to keep in mind that suggested Entities will come
from multiple vocabularies. While disambiguation within a single
vocabulary is still very important (and useful) one need also to
consider situations where an Engine would need to decide/rank over
Entities originating from different vocabularies.

1. Spotting and Disambiguation in a single vocabulary/knowledge base:
(e.g. extraction against a Company Thesaurus that contains two
projects/customers with the same name).
2. Disambiguation of Entities from multiple Vocabularies (e.g. a
Customer of the Company thesaurus that has the same name as a Place
from DBPedia)

I'm totally agree. It's sure that you will need different disambiguationalgorithms for different domains. It will depends of the kind ofdisambiguation information you could harvest in each domain. However,the whole process, i.e. the architecture of the disambiguation systemcould be the same. In fact, there are some proposals at the literaturethat define Knowledge Base independent disambiguation systems.


(B) Pre-requirements:

Datasets in Stanbol are very heterogeneous. So it will be important to
provide also disambiguation algorithms that operate on data that are
very easy to obtain. The most powerful algorithm if not not much help
to a user that can not provide the required data for it. In the
following I will try to come up with three examples:

1. using literals and/or relations present within the
vocabulary/knowledge base: This kind of information would be very
easily available (e.g. following links to other entities and using
their labels to build a text corpus used for disambiguation).

That would be a good first approach. However, the distribution of theautogenerated data could be not consistent with the real dataset, sincethe data generation process can only create some types of traininginstances. For example, you can find labels that are always linked tothe same entry in the KB having others entries suitable to be linkedtoo. Also you could find false positives too. In this paper<http://aclweb.org/anthology/I/I11/I11-1063.pdf> you can find othersapproaches to generate such dataset.

2. using mentions of entities in documents: This is e.g. available in
Wikipedia. But with more and more pages using RDFa this might be also
become available for other datasets. Still one could not assume this
to be available for most Company related vocabularies.
3. using some kind of feedback service to learn disambiguation
(similar as already implemented for the topic classification engine by
ogrisel). Such a service could even span entities of multiple
vocabularies. For public datasets/documents one could even try to
share such examples with others.

(C) What is the goal:

1. decide between Entity A and B (or in other words - correctly rank them)
2. provide a better confidence estimation (especially important if no
human is in the loop)
3. Grouping of Entities (could be interesting if there already exists
some RDFa in parsed content and we want to exploit this to detect
further entities)



Possible first steps (feedback very welcome)

* I assume the case (A.1/B.1/C.1) as the lowest hanging fruit and
would naively try to implement it by using Solr MLT. BTW i am
currently in the middle of adding new functionality to the Entityhub
Indexing Tool that would allow to build semantic contexts (as
described by (B-1)). So tests in this direction should be possible in
the near future.
* The other algorithm mentioned by STANBOL-223 [1] could also work for
(A.2). However it would require a normalized way to obtain the
"disambiguation-context" for entities originating from different
vocabularies. For this I think it would be very helpful to normalize
the retrieval of the "disambiguation-context" across datasets (e.g.
via an ontology or a service). For (B.1) scenarios one could add
support for this to the Entityhub Indexing tool. For (B.2/B.3)
scenarios would could think about a registry like system where
different "disambiguation-context" provider could contribute
information for Entities. WDYT? BTW: I have no idea how such a
"disambiguation-context" could look like.

It depends a lot of the knowledge base but, in general, such"disambiguation-context" must rely in mainly two concepts:

1. Similarity between entities' mentions contexts in the documents andentries' content in the knowledge base.

2. Due to a semantic coherence principle, the information of an entitydepends on the information of other entities. In the same way, therelations between entities in the KB generate a semantic context whichcould be used as disambiguation information.


Regards

Rafa Haro


hope this is some food for thought

best
Rupert

[1] https://issues.apache.org/jira/browse/STANBOL-223

On Mon, Apr 23, 2012 at 4:46 PM, Rafa Haro<[email protected]>  wrote:

Hi all,

First of all, I want to introduce myself. I'm Rafa Haro from Spain and I
just arrived to the mailing list. I'm currently working in integrate Stanbol
in Alfresco and at the same time I'm doing a research on Entity Linking for
my PhD. By coincidence, the firsts emails I have received are about this
field :-).

As it has been notice, Entity Disambiguation is a challenging task. There
are some simple approaches that usually don't work well with complex
documents. As a response to Fabian suggestion regarding a scientific network
in this field, you should take a look to Entity Linking task in Knowledge
Base Population (KBP) track at NIST Conference:
http://www.nist.gov/tac/2012/KBP/index.html

This year is the fourth edition. You might be interested in take a look of
the best proposals in last three years. We are participating this year.

I wouldn't mind to get involved in bringing Entity Disambiguation to Stanbol
and to collaborate in general in the project. Is that possible?

Regards


El 23/04/12 16:04, Pablo Mendes escribió:

Hi all,

  I think you should start with a really simple solution for this and then

improve this first simple algorithm.


This was exactly the approach taken by the DBpedia Spotlight project. We
have built a few entity linkers (a.k.a. disambiguators) based on Lucene
first, and started incrementally making them more sophisticated. If you
are
a fan of not repeating work, please feel free to look at what we've done.

http://spotlight.dbpedia.org

Our disambiguators will be integrated as EnhancementEngines in Stanbol
within the next couple of months.

If you're a fan of reimplementing things to make them better, I'd say you
should look elsewhere. There are some interesting approaches out there
that
have not been open sourced, but that have papers describing their
algorithms. Implementing them would be probably more beneficial for the
community than reimplementing what we did.

Cheers,
Pablo

On Mon, Apr 23, 2012 at 3:56 PM, kritarth
anand<[email protected]>wrote:

Thanks a lot Fabian for your inputs. I'll definitely add on them in my
proposal.

On Mon, Apr 23, 2012 at 7:23 PM, Fabian Christ<
[email protected]

wrote:
Hi Kritarth,

I have read your proposal and building such a disambiguation engine is
a challenging task. Here are some thoughts:

- Did you think about restriction for the domain, or the kind of text
that this engine would/should work best for? It is often the case that
you can not implement the single engine that always works well. So
maybe you should think a little bit about the kind of content that you
would like to support.

- Do you have access to any scientific network? Perhaps looking in the
scientific world for published papers about entity disambiguation may
give you some ideas and would widen your view on the problem.

- I think you should start with a really simple solution for this and
then improve this first simple algorithm. Having a simple trivial
solution makes it more easy to have something to compare. Sometimes it
happens that the advanced algorithms are not any better than the
trivial ones. So try it ;)

Best,
  - Fabian

Am 18. April 2012 11:02 schrieb kritarth anand<[email protected]
:

Hi guys,

Hope your doing well. I was advised by my supervisor Dr. Rupert that to
interest people in my application, I should provide little summary of

my

proposal. Please do have a look at it below, in case you do find it
interesting or if you might want to suggest something on that. You may

rad

the entire documents

My proposal is Entity Disambiguation as an Enhancement engine in

  Stanbol.

You can have a look at it JIRA page,

https://issues.apache.org/jira/browse/*

STANBOL*-223 . I propose to build it during the summers as a part of

Google

Summer of Code. Any advice from you guys is most welcome

Kritarth

On Tue, Apr 17, 2012 at 8:36 PM, kritarth anand<

[email protected]>wrote:

Hi Guys,

Hope your doing well. Please do take out few minutes and have a look

at

my

proposal. Your feedback is extremely valuable for me.

Kritarth


On Mon, Apr 16, 2012 at 12:23 AM, kritarth anand<

[email protected]

wrote:
Dear Fabian,

Thanks for pointing it out.

@All

I have attached the PDF versions of my proposal and Background Info

with

this mail. You may also find the proposal on this Google Document

https://docs.google.com/document/d/1BA0x9craA2kiFn0jM-66HSS7SFCk5Q5U5gyEWaRftIk/edit

It is editable so you might add on comments there itself  so that you

can

add on some one elses advice too. You can anyways mail me.

Kritarth Anand

On Mon, Apr 16, 2012 at 12:13 AM, Fabian Christ<
[email protected]>    wrote:

Hi Kritarth,

and welcome to Stanbol. Could you share the proposal in any open
format like PDF, HTML, plain text or via an URL? Not all of us have
access to the newest M$ office suite.

Thanks, and looking forward for your contribution!

Best,
  - Fabian

Am 15. April 2012 10:21 schrieb kritarth anand<

[email protected]

:
Hi,

I would like to convey my warm greetings to the entire Stanbol

community. My

name is Kritarth Anand. I study Computer Science and Indian

Institute

of

Technology Delhi. I am a potential candidate working on “Entity
disambiguation in Stanbol enhancement engines” as part of Google

Summer of

Code. If I am successful, I‘ll be coordinating with you guys.


I write to you all to request for some feedback on my proposal, I

have

given

out below. You might be able to give me valuable suggestions to

improve my

proposal, incorporate details, omit unnecessary ones and get a

more

realistic with timeline that I have suggested.


Please feel free to discuss any matters whenever you might like. I

have

attached two documents with this mail. One of the of two is the

proposal

suggested and the other little bit details about my background.


Kritarth Anand

www.cse.iitd.ac.in/~cs5080213<

http://www.cse.iitd.ac.in/%7Ecs5080213><

http://www.cse.iitd.ac.in/%7Ecs5080213>



--
Fabian
http://twitter.com/fctwitt


--
Fabian
http://twitter.com/fctwitt

This message should be regarded as confidential. If you have received this
email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard copy
by an authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road,
London W10 5JJ, UK.


This message should be regarded as confidential. If you have received this 
email in error please notify the sender and destroy it immediately. Statements 
of intent shall only become binding when confirmed in hard copy by an 
authorised signatory.

Zaizi Ltd is registered in England and Wales with the registration number 
6440931. The Registered Office is 222 Westbourne Studios, 242 Acklam Road, 
London W10 5JJ, UK.

Re: Greetings and Feedback on proposal for Google Summer of Code

Reply via email to