Re: CV Mining (Early adopter program)

Luca Dini Fri, 02 Mar 2012 01:54:19 -0800

Hi Rupert,
thanks for all your observations. My comments in the body of the message.
Bests,
Luca
On 01/03/2012 23:14, Rupert Westenthaler wrote:

Hi Luca


A really interesting Scenario.

On Thu, Mar 1, 2012 at 3:44 PM, Luca Dini<[email protected]>  wrote:

    The provision to Stanbol of classes allowing the connection with
Linguagrid (www.linguagrid.org) and possibly LanguageGrid
(http://langrid.org/en/index.html).
    The verification of the extensibility of Stanbol to languages other than
English (The project will concern CVs written in French).

Ok this answers my question of the other Email. Can you maybe provide
some additional information (links) about this services. What is the
License of Language Grid. I was not able to find information related
to that.

The reason is that licensing vary according to the service provider. Asyou have seen we are not the only providers via linguagrid. As far asour services are concerned, they are open access but not open source. Inshort, this means:1) unlimited access for research/educational purposes, with support forintegration ecc..

2) free access for "commercial purposes", with no service level guarantee.

3) Paying access (subscription or pay per use) if some service levelguarantee is needed, Prices vary of course depending on volumes,constraint, time of response etc.

Concerning standbol, as IKS it is a research project we are willing togive unlimited access to all standbol instances. Of course thelimitation is represented by the computational power of the AmazonWSinstances where the linguagrid and related services are hosted. In frontof a massive adoption and the need of activating many instances (theyhave a costs) we will be forced to impose some kind of fee. But this isa future scenario, as currently the linguagrid seems to scale rather well.

The basic goal is to provide them with an open
source document management system able to deal in an intelligent way with
non structured CV (or "resumes"), i.e. CVs which comes in Microsoft Word,
pdf, Open Office etc.

Apache Stanbol has now two EnhancementEngines for processing non plain
text documents

* MetaxaEngine (mainly based on aperture.sourceforge.net)
* TikaEngine (Apache Tika)

Therefore the kind of documents you mentioned should be supported by Stanbol.

Great.

This might represent:

    experiences of the candidate
    skills of the candidate
    Education level
    reference data (name, address etc.)
    contact data

Some of these data might be slightly more structured than just named
entities, but definitely in the representation power of rdf. Some of them
could be even more semantically enriched, by providing external information
on companies, places, specific technologies etc.

It is very easy to import data that are available as RDF into stanbol
and used it for Entity Extraction and Linking. There is also support
for importing existing vCard files. Such data are converted to RDF by
using the schema.org schema.

As Oliver said I think that the crucial thing will be to identify theright reference schema. In some cases (e.g. skills) I guess that we willbe forced to have a mixed approach, as there is non standard vocabularyfor representing them.

As a result of this personnel at the HR department would be able to
formulate queries such as (just an exemplification):

    All CV of people living in Paris older then 27 years
    All CV of people with skills in SQL server and Java
    All people who have worked in an high tech company since november 2011.

Do you plan to use the Apache Contenthub for Semantic Search, or does
the CMS you use already support such kind of searches?

On this matter, I will write a separate email. Basically the answer isthat we are open for suggestions.

Challenges

 From a technical point of view the most interesting challenge consists in
integrating the set of Stanbol enhancer, with the semantic web services
provided at www.linguagrid.org. In principle it should not be a different
integration than what has already been made with OpenCalais WS and Zemanta
WS. However there are at least two major challenges:

    Multilinguality. The extraction will consider French documents rather
than English ones. Moreover, in a second phase (not covered by the present
project, the whole system could be extended to Italian and French.

Stanbol already nicely supports multi lingual scenarios. The LangId
engine can be used to detect the language of a Document (internally
used Apache Tika) and stores the detected language in the metadata.
Other engines can use this language for further processing.

That's great: probably my consideration of multilinguality as achallenge was due to the fact that that most integrated linguisticengines where dealing with English. I was also wondering if thestrategies for matching a given named entity with e.g. dbpedia url arecompletely language independent.

When dealing with French you might want to update the Configuration of
the SolrCore used to store the Controlled vocabulary with French
specific configurations such as stop words, stemmers ... This will
improve the results for the NamedEntityTaggingEngine and
KeywordLinkingEngine engine.

I understand this for the KeywordLinkingEngine, but not completely forthe NamedEntityTaggingEngine. In our view we will have to integrate anew French/Italian NamedEntityTaggingEngine which will handle stop wordsand all other language related aspects internally. But this believemight just be due to the fact that our knowledge of the whole system isstill limited.

    Ontological extension. While CVs typically contains quite a lot of named
entities which are already covered by Stanbol (e.g. geographical names, time
expressions, Company names, person names) there are entities which will need
some ontology extension such as skills and education.
    Structural Complexity. In a CV instances of entities are linked each
other in a structurally complex way. For instance places are not just a flat
list of geographical entities, but their are likely to be connected with
periods, with job types, with companies, etc. Handling this structural
complexity represents an important challenge.

This might be indeed a challenge. I would start to split up the
content in smaller pieces (e.g. sentences) and try to group Entities
extracted from such parts.
If you than build a semantic index that stores such pieces as own
documents even searches for a job type at a specific company could
work quite nicely.

We will follow the approach you describe: if I understand correctly youpropose to make use use of atomic information (e.g. an experienceLine)as a kind of document in such a way that it is possible to formulatequery such as "all documents of type experienceLine which contains a jobX and a company Y" right?

Such a System would not really "understand" the structural complexity
but still should be able to present Users with good search results.

Thanks,
Luca


best
Rupert



--
*************************************
Luca Dini
CELI France SAS

Grenoble:
12-14 rue Claude Genin
38000 Grenoble

Paris:
33 Avenue Philippe Auguste
75011 Paris

tel: 00 33 476 24 23 80
www.celi-france.com/
www.celi.it/
research.celi.it

*************************************

Re: CV Mining (Early adopter program)

Reply via email to