Hi Rupert,
thank you very much for your answer, it was very helpful and let me
understand deeper how the entity hub indexing work.
I wrote a possible candidate for the text_it field in schema.xml for the
various indexers in Stambol :
<fieldType name="*text_it*" class="solr.TextField"
positionIncrementGap="100" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms_it.txt" ignoreCase="true" expand="false"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory"/>
* <filter class="solr.StopFilterFactory" ignoreCase="true"
words="italian_stop.txt" enablePositionIncrements="true" />*
* <filter class="solr.SnowballPorterFilterFactory"
language="Italian" />*
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
you can see that I used the SnowballPorterFactory as suggested in
http://wiki.apache.org/solr/LanguageAnalysis#Italian,
the stopwords list for Italian can be found at this link :
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/italian_stop.txt
If someone more skilled than me in Solr is reading feel free to correct or
suggest a better definition for the field.
I'm going to test indexing with this configuration, I'll let you know about
my progress (if any).
Regards,
Stefano
On Thu, Mar 1, 2012 at 7:47 PM, Rupert Westenthaler <
[email protected]> wrote:
> Hi Stefano, Luca
>
> See my comments inline.
>
> On 01.03.2012, at 15:59, Luca Dini wrote:
>
> > Dear Stefano,
> > I am new as well on the list, and we are also working in the context of
> the early adoption program. If I understand correctly, the problem is that
> without an appropriate Named Entities extraction engine for Italian, I am
> afraid that the result would always be disappointing. In the context of our
> project we will integrate enhancement services of NER for Italian and
> French (and possibly keyword extraction), so, hopefully, you will be able
> to profit of the power of Stanbol. There might be some problems in terms of
> timing, as it is not clear if in the short project window, there will be
> the possibility of feeding our integration into yours. Is the
> unavailability of Italian NER a blocking factor for you or you can go on
> with development while waiting for the integration?
> >
>
> Thats true. For Datasets such as DBpedia the combination of "NER +
> NamedEntityTaggingEngine" is the way to go. Thats simple because DBpedia
> defines Entities for nearly all natural language words. Therefore "keyword
> extraction" (used by the KeywordLinkingEngine) does not really work.
>
> However note that the KeywordLinkingEngine as support for POS (Part of
> Speech) taggers. So if a POS tagger is available for a given language than
> it will use this information to only lookup Nouns (see [1] for a more
> detailed information on the used algorithm). The bad news are that there is
> no POS tagger available for italian :(
>
> [1]
> http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/keywordlinkingengine.html
>
>
> The final possibility to improve results of the KeywordLinkingEngine with
> DBPedia is to filter all entities with other types than Persons,
> Organizations and Places. However this has also a big disadvantage. because
> this will also exclude all redirects and such entities are very important
> as they allow to link Entities that are mentioned by alternate names.
> However if you would like to try this you should have a look at the
>
> org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter
>
> This filter is included in the default configuration of the DBpedia
> indexer and can be activated by changing the configuration within the
>
> {indexing-dir}/indexing/config/entityTypes.properties
>
>
>
> @ Luca
> > we will integrate enhancement services of NER for Italian and French
>
> That would be really great. Is the framework you integrate open source?
> Can you provide a link?
>
>
> > Cheers,
> > Luca
> >
> > On 01/03/2012 14:49, Stefano Norcia wrote:
> >> Hi all,
> >>
> >> My name is Stefano Norcia and I'm working on the early adoption project
> for
> >> Etcware.
> >>
> >> For our early adoption project (Etcware Early Adoption project) we need
> to
> >> use a DBPedia index in Italian
> >> language in the enhancement and enrichment process enabled by the
> Stanbol
> >> engines.
> >>
> >> The main problem is that the NLP module does not support italian
> language
> >> directly, so if you put an italian
> >> text in the enhancement engine the dbpedia engine does not detect any
> >> concept/place/people.
> >>
>
> The NER engine uses the language as detected by the LangID engine and
> deactivates itself if no NER model id available for the detected language.
> In such case the NamedEntityTaggingEngine will also link no Entities
> because the are no NamedEntities detected within the text.
>
> However this dies not mean that no Italien labels are present in the
> DBpedia index. In fact Italien labels ARE present in the all DBpedia
> indexes. No need to build your own indexes unless you have some special
> requirement.
>
> You can try this even on the Test server. Simple send some Italien text
> first to
>
> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-ner
>
> This engine uses "NER + NamedEntityTaggingEngine" so you will not get any
> results - as expected. Than you can try the same text with
>
> http://dev.iks-project.eu:8081/enhancer/chain/dbpedia-keyword
>
> this will return linked entities. But as I mentioned above and you already
> experienced yourself it also gives a lot of false positives.
>
>
> >> We have done some experiments to perform this goal:
> >>
> >> First attempt was to rebuild the dbpedia index following the
> instructions
> >> found in the stanbol/
> >> entityhub/indexing/dbpedia folder. In this folder there is a shell
> script
> >> (fetch_prepare.sh) that
> >> describe how to prepare the dbpedia datasets before creating the index.
> We
> >> followed those
> >> instructions and tried to create a new index to replace the standard
> >> English dbpedia index and
> >> "site" starting from the italian dbpedia datasets. We are aware that the
> >> italian datasets are not
> >> complete and that some packages are missing (like persondata_en.nt.bz2
> and
> >> so on).
> >> These are the packages we used to create the index (
> >> http://downloads.dbpedia.org/3.7/it/) :
> >>
> >> o dbpedia_3.7.owl.bz2
> >> o geo_coordinates_it.nt.bz2
> >> o instance_types_it.nt.bz2
> >> o labels_it.nt.bz2
> >> o long_abstracts_it.nt.bz2
> >> o short_abstracts_it.nt.bz2
> >>
>
> You should always include the english versions as such include a lot of
> information that are also very useful for other languages.
>
> >> We are also able to create the incoming_links text file from the package
> >> page_links_it.nt.bz2.
> >> After rebuilding the index we replaced the DBPedia english index in
> stanbol
> >> with our custom
> >> one (simply replacing the old one with the new one and restarting
> stanbol).
> >>
> >> Sadly, after that, the results produced by the enhancement engines are
> >> exactly the same as before,
> >> neither italian concepts are detected nor possible enhancements are
> >> returned from all the other
> >> enhancement engines.
> >>
>
> I assume that this index was completely fine. The reason why you where not
> getting any results was because the NER engine deactivates itself for
> italian texts.
>
> Note also the the
>
> * NamedEntityTaggingEngine and
> * KeywordLinkingEngine
>
> do use the exact same DBPedia index. So you can/should use the same index
> for both. This is also the case on the "http://dev.iks-project.eu:8081"
>
> Also Note that the DBpedia indexer and the generic RDF indexer do create
> the same type of indexes. The DBpedia indexer only contains a configuration
> that is optimized for DBpedia.
>
> >> As a second attempt, we decided to use the generic RDF indexer (combined
> >> with the standard
> >> Keyword Linking Engine) to process the italian DBPedia datasets; in this
> >> case the indexing process
> >> succeeded and we were able to get a lot of results testing the
> enhancement
> >> engines with italian
> >> content. This time the problem is that the results are simply too much
> and
> >> contain also stopwords.
> >>
> >> For example you can find a sample text introduced for enhancement and
> the
> >> results shown by the
> >> Keyword Linking Engine in attachment.
> >>
> >> The terms shown in bold are clearly stopwords. I don’t know if the
> problem
> >> is in dataset indexing,
> >> or if there is a way to eliminate them after the creation of the index.
>
> Using stop words would in fact improve the performance of the
> KeywordLinkingEngine. The current default Solr configuration includes
> optimized sold field configurations for english and german.
>
> If you can provide such a configuration for Italien it would be great if
> you could contribute such a configuration to Stanbol! I would be happy to
> work on that!
>
> >>
> >> We have also made an attempt to change the stopwords filter in the
> solyard
> >> base index zip
> >> (/stanbol/entityhub/yard/solr/
> >> src/main/resources/solr/core/default/default.solrindex.zip
>
> >>
> >> and simple.solrindex.zip) and rebuild the content hub (and dbpedia
> indexer
> >> too with mvn
> >> assembly:single in contenthub/indexer/dbpedia ) with the right
> stopwords.
> >>
>
> This would be the place where a Stanbol committer would change the
> configuration. If you use the DPpedia Indexer you can simple change the
> Solr Configuration in
>
> {indexing-root}/indexing/config/dbpedia/conf/schema.xml
>
> If you use the generic RDF indexer you should extract the
> "default.solrindex.zip" to
>
> {indexing-root}/indexing/config/
>
> and than rename the directory to the same name as the name of your site
> (this is the value of the "name" property in the
> "/indexing/config/indexing.properties" file).
>
> >> We've checked the generated JAR and the italian stopwords are there, as
> a
> >> file inside the solr config
> >> folder, but the results were always the same as before (still stopwords
> in
> >> the enhancement results).
> >>
>
> If you use the RDF indexer the Solr Configuration is taken
>
> * from the directory "{indexing-root}/indexing/config/{name}" or if not
> present
> * from the class path used by the indexer
>
> so the reason why it had not worked for you was that you have not creates
> a new RDF indexer version after you changed the "default.solrindex.zip" and
> rebuilded the Entityhub. For that you would have also needed to re-create
> the indexer by using "mvn assembly:single".
>
> But as I mentioned above there is a simpler solution for adding italian
> stop words by simple editing the SolrConf contained in
>
> {indexing-root}/indexing/config/dbpedia/conf/
>
> of the DBPedia Indexer.
>
>
> Hopefully that answers all your questions. If you have additional
> questions feel free to ask.
>
> best
> Rupert Westenthaler
>
>
> >> Do you have any suggestions on how to perform these tasks?
> >>
> >> Thanks in advance.
> >>
> >> -Stefano
> >>
> >> PS follow an enrichment example from the rdf index we built from dpedia
> >> with simplerdfindexer and dblp :
> >>
> >> text:
> >>
> >> *Infermiera con tbc, troppi dettagli sui media. Il Garante apre
> >> un'istruttoria
> >>
> >> Il Garante Privacy ha aperto un'istruttoria in seguito alla
> pubblicazione
> >> di notizie da parte di agenzie di stampa e quotidiani - anche on line -
> >> che, nel riferire di un caso di una infermiera in servizio presso il
> >> reparto di neonatologia del Policlinico Gemelli, risultata positiva ai
> test
> >> sulla tubercolosi, hanno riportato il nome della donna, l'iniziale del
> >> cognome e l'età.
> >>
> >> Il diritto-dovere dei giornalisti di informare sugli sviluppi della
> >> vicenda, di sicura rilevanza per l'opinione pubblica, considerato
> l'elevato
> >> numero di neonati e di famiglie coinvolte, deve essere comunque
> bilanciato,
> >> secondo i principi stabiliti dal Codice deontologico con il rispetto
> delle
> >> persone.
> >>
> >> Il Garante ricorda che, anche quando questi dettagli fossero stati
> forniti
> >> in una sede pubblica, i mezzi di informazione sono tenuti a valutare con
> >> scrupolo l'interesse pubblico delle singole informazioni diffuse.
> >>
> >> I media evitino dunque di riportare informazioni non essenziali che
> possano
> >> ledere la riservatezza delle persone e nello stesso tempo possano
> indurre
> >> ulteriori stati di allarme e di preoccupazione in coloro che si sono
> >> avvalsi dei servizi sanitari dell'ospedale o sono altrimenti entrati in
> >> contatto con la persona.
> >>
> >> Roma, 24 agosto 2011*
> >>
> >> Enrichments :
> >>
> >> 2011 2011
> >>
> >> Agosto Agosto
> >>
> >> *Alla Alla*
> >>
> >> *Anché Anché*
> >>
> >> *Che? Che?*
> >>
> >> Cognome Cognome
> >>
> >> *CON CON*
> >>
> >> *Dal' Dal'*
> >>
> >> Problema dei servizi Problema dei servizi
> >>
> >> *Dell Dell*
> >>
> >> Diritto Diritto
> >>
> >> Donna Donna
> >>
> >> Essere Essere
> >>
> >> Il nome della rosa Il nome della rosa
> >>
> >> Informazione Informazione
> >>
> >> Interesse pubblico Interesse pubblico
> >>
> >> Media Media
> >>
> >> Mezzi di produzione Mezzi di produzione
> >>
> >> *Nello Nello*
> >>
> >> Neonatologia Neonatologia
> >>
> >> *NON NON*
> >>
> >> Numero di coordinazione (chimica) Numero di coordinazione (chimica)
> >>
> >> Opinione pubblica Opinione pubblica
> >>
> >> Ospedale Ospedale
> >>
> >> *PER PER*
> >>
> >> Persona Persona
> >>
> >> Privacy Privacy
> >>
> >> Pubblicazione di matrimonio Pubblicazione di matrimonio
> >>
> >> Secondo Secondo
> >>
> >> Servizio Servizio
> >>
> >> Stampa Stampa
> >>
> >> Stati di immaginazione Stati di immaginazione
> >>
> >> *SUI SUI*
> >>
> >> TBC TBC
> >>
> >> Tempo Tempo
> >>
> >> .test .test
> >>
> >> Tubercolosi Tubercolosi
> >>
> >> *UNA UNA*
> >>
> >>
> >> The ones in bold are stopwords, the other results are good ones but
> anyway
> >> the stopwords where not eliminated in dataset indexing, or maybe there
> is a
> >> way to eliminate them from the datasets but I don't know how.
> >>
> >
>
>