Re: dbpedia solr index dump

Rupert Westenthaler Sun, 19 Aug 2012 21:44:37 -0700

Hi

On Thu, Aug 16, 2012 at 3:01 AM, harish suvarna <[email protected]> wrote:
> Thanks Rupert. I am making some progress here. I am finding that paoding
> breaks words into small segments, espcially foreign names. For ex, motorola
> is broken into two parts (mot, rola), similarly
> michael is borken into (mik, kael). Now the ngram based dbpedia lookup
> looks for these in the dbpedia index and cannot find.
> My segmentation process and dbpedia solr index must both use the same
> segmenter. There is a paoding analyzer for solr too. I just need to create
> the solr index for dbpedia using that.
> Actually now, I have more dbpedia hits in character ngram based dbpedia
> lookup for chinese than the number of hits I get if I use paoding.
> We dont know what language analyzers have been used by ogrisel is creating
> the solr dbpedia dump of 1.19gb.
>


have a look at

    {dbpedia-indexing-root}/indexing/config/dbpedia/conf/schema.xml

The comments within the file should provide all information needed for
your adaptions.

I assume ogrisel used the default settings. If you would like to
contribute chinese specific field configurations I am very happy to
include them even in the default Solr configuration for the Entityhub.
This would add support for special indexing of chinese text for all
Stanbol Entityhub indexes.

Note also that I have excluded some dependencies of Solr in the
org.apache.stanbol.commons.solr.core module

    !org.apache.lucene.analysis.cn.smart.*,
    !org.apache.lucene.analysis.pl.*,
    !org.apache.lucene.analysis.stempel.*,
    !org.egothor.stemmer.*,

So if you plane to use components of those packages in your field
configurations we will most likely need to remove those exclusions.

> I also experimented with contenthub search for chinese. Right now it does
> not work. I need to debug that part also. Even the UI in the contenthub
> does not display the chinese characters. The enhancer UI does display the
> characters well.
>

Can you please create an issue about this.

> Also for English Stanbol, I did play with contenthub. I took a small text
> as follows.
> ==============
>  United States produced an Olympic-record time to win gold in the women's
> 200m freestyle relay final. A brilliant final leg from Allison Schmitt led
> the Americans home, ahead of Australia, in a time of seven minutes 42.92
> seconds. Missy Franklin gave them a great start, while Dana Vollmer and
> Shannon Vreeland also produced fast times.
> =====================================================================
>
> The above text is properly processed and I get the dbpedia links for all
> persons, countries in the above. Hoewver, the above piece is related to
> 'swimming' and this word does not appear at all in the text. In the dbpedia
> link of Allison Scmitt, the dbpedia categories do tell us that it is in
> swimming category. Did anyone try to process the categories inside the link
> and add them as metadata for this content. If we add this, then we add more
> value than a simple solr based search in content store. Some one in IKS
> conference demoed this as a semantic search. Any hints/clues on this work ?
>

This would be most likely Suat. As this thread becomes really long I
am not so sure if a lot of people are still reading it. So maybe it is
better to start a new thread with the contenthub related questions.

best
Rupert


-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: dbpedia solr index dump

Reply via email to