Hi Rupert,
thanks for all your observations. My comments in the body of the message.
Bests,
Luca
On 01/03/2012 23:14, Rupert Westenthaler wrote:
Hi Luca
A really interesting Scenario.
On Thu, Mar 1, 2012 at 3:44 PM, Luca Dini<[email protected]> wrote:
The provision to Stanbol of classes allowing the connection with
Linguagrid (www.linguagrid.org) and possibly LanguageGrid
(http://langrid.org/en/index.html).
The verification of the extensibility of Stanbol to languages other than
English (The project will concern CVs written in French).
Ok this answers my question of the other Email. Can you maybe provide
some additional information (links) about this services. What is the
License of Language Grid. I was not able to find information related
to that.
The reason is that licensing vary according to the service provider. As
you have seen we are not the only providers via linguagrid. As far as
our services are concerned, they are open access but not open source. In
short, this means:
1) unlimited access for research/educational purposes, with support for
integration ecc..
2) free access for "commercial purposes", with no service level guarantee.
3) Paying access (subscription or pay per use) if some service level
guarantee is needed, Prices vary of course depending on volumes,
constraint, time of response etc.
Concerning standbol, as IKS it is a research project we are willing to
give unlimited access to all standbol instances. Of course the
limitation is represented by the computational power of the AmazonWS
instances where the linguagrid and related services are hosted. In front
of a massive adoption and the need of activating many instances (they
have a costs) we will be forced to impose some kind of fee. But this is
a future scenario, as currently the linguagrid seems to scale rather well.
The basic goal is to provide them with an open
source document management system able to deal in an intelligent way with
non structured CV (or "resumes"), i.e. CVs which comes in Microsoft Word,
pdf, Open Office etc.
Apache Stanbol has now two EnhancementEngines for processing non plain
text documents
* MetaxaEngine (mainly based on aperture.sourceforge.net)
* TikaEngine (Apache Tika)
Therefore the kind of documents you mentioned should be supported by Stanbol.
Great.
This might represent:
experiences of the candidate
skills of the candidate
Education level
reference data (name, address etc.)
contact data
Some of these data might be slightly more structured than just named
entities, but definitely in the representation power of rdf. Some of them
could be even more semantically enriched, by providing external information
on companies, places, specific technologies etc.
It is very easy to import data that are available as RDF into stanbol
and used it for Entity Extraction and Linking. There is also support
for importing existing vCard files. Such data are converted to RDF by
using the schema.org schema.
As Oliver said I think that the crucial thing will be to identify the
right reference schema. In some cases (e.g. skills) I guess that we will
be forced to have a mixed approach, as there is non standard vocabulary
for representing them.
As a result of this personnel at the HR department would be able to
formulate queries such as (just an exemplification):
All CV of people living in Paris older then 27 years
All CV of people with skills in SQL server and Java
All people who have worked in an high tech company since november 2011.
Do you plan to use the Apache Contenthub for Semantic Search, or does
the CMS you use already support such kind of searches?
On this matter, I will write a separate email. Basically the answer is
that we are open for suggestions.
Challenges
From a technical point of view the most interesting challenge consists in
integrating the set of Stanbol enhancer, with the semantic web services
provided at www.linguagrid.org. In principle it should not be a different
integration than what has already been made with OpenCalais WS and Zemanta
WS. However there are at least two major challenges:
Multilinguality. The extraction will consider French documents rather
than English ones. Moreover, in a second phase (not covered by the present
project, the whole system could be extended to Italian and French.
Stanbol already nicely supports multi lingual scenarios. The LangId
engine can be used to detect the language of a Document (internally
used Apache Tika) and stores the detected language in the metadata.
Other engines can use this language for further processing.
That's great: probably my consideration of multilinguality as a
challenge was due to the fact that that most integrated linguistic
engines where dealing with English. I was also wondering if the
strategies for matching a given named entity with e.g. dbpedia url are
completely language independent.
When dealing with French you might want to update the Configuration of
the SolrCore used to store the Controlled vocabulary with French
specific configurations such as stop words, stemmers ... This will
improve the results for the NamedEntityTaggingEngine and
KeywordLinkingEngine engine.
I understand this for the KeywordLinkingEngine, but not completely for
the NamedEntityTaggingEngine. In our view we will have to integrate a
new French/Italian NamedEntityTaggingEngine which will handle stop words
and all other language related aspects internally. But this believe
might just be due to the fact that our knowledge of the whole system is
still limited.
Ontological extension. While CVs typically contains quite a lot of named
entities which are already covered by Stanbol (e.g. geographical names, time
expressions, Company names, person names) there are entities which will need
some ontology extension such as skills and education.
Structural Complexity. In a CV instances of entities are linked each
other in a structurally complex way. For instance places are not just a flat
list of geographical entities, but their are likely to be connected with
periods, with job types, with companies, etc. Handling this structural
complexity represents an important challenge.
This might be indeed a challenge. I would start to split up the
content in smaller pieces (e.g. sentences) and try to group Entities
extracted from such parts.
If you than build a semantic index that stores such pieces as own
documents even searches for a job type at a specific company could
work quite nicely.
We will follow the approach you describe: if I understand correctly you
propose to make use use of atomic information (e.g. an experienceLine)
as a kind of document in such a way that it is possible to formulate
query such as "all documents of type experienceLine which contains a job
X and a company Y" right?
Such a System would not really "understand" the structural complexity
but still should be able to present Users with good search results.
Thanks,
Luca
best
Rupert
--
*************************************
Luca Dini
CELI France SAS
Grenoble:
12-14 rue Claude Genin
38000 Grenoble
Paris:
33 Avenue Philippe Auguste
75011 Paris
tel: 00 33 476 24 23 80
www.celi-france.com/
www.celi.it/
research.celi.it
*************************************