Re: Custom dbpedia indexing without Entity Scores

Alex Lopez Thu, 03 Nov 2011 05:05:14 -0700

Thanks for the comments Rupert.

Yes I already had moved the dumps elsewhere, at least I don't have toreload all the dumps again :)I think I'll try again to run the script to calculate the scores to getthe boosts on the entities.


Em 03-11-2011 11:46, Rupert Westenthaler escreveu:

Hi Alex

see comments inline

On 03.11.2011, at 12:06, Alex Lopez wrote:

Hi stanbolers,

I'm in the middle of the process of creating a custom dbpedia index for 
Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and some 
custom mappings (in specific some special treating for Portuguese text plus 
some additional properties I'd like to see indexed).

I'm following this file:

http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md

and this for processing the broken images_en file

http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh

The process went well up to the point after all triples (some 80M) where loaded 
into tdb.

The problem is that the process stops after that and outputs a

Exception in thread "Thread-3" java.lang.IllegalStateException: The file with 
the Entity Scores is missing
        at 
org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424)
...
10:12:22,077 [Thread-2] INFO  solryard.SolrYardIndexingDestination - ... create 
SolrYard

And nothing more happens.

If you use the combination of "entityDataProvider" and "entityIdIterator" than 
you need file with the scores, because this is used to lookup the IDs.

Of course, the file is missing because I didn't need it, since I want to index 
all entities. I tried to generate it anyway once but after a lot of time of 
processing it failed with some outOfMem exception (I think in the process of 
sorting).


Even if you plan to index all entities you might want to use entity scores, 
because such scores are also used to boost entities within the Entityhub.

So if you search than results will be sorted by the number of incoming links within 
dbpedia. This ensures that a search for "Paris" returns Paris France as best 
result. Without such boosts Paris Texas could be also returned as best result.

Is there a way to instruct the indexer to ignore the Entity Scores file? Or write some 
simple one in a way that says "all entities are to be indexed"?

If you want to index all entities without Entity scores than you need to change the
configuration within the "indexing.properties" file as follows.

comment the properties

* entityDataProvider
* entityIdIterator
* scoreNormalizer

instead of this add the following two lines

# use the Jena TDB as source for indexing the RDF data located within
# "indexing/resource/rdfdata"
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
# The EntityScore Provider needs to provide the scores for indexed entities
# use the NoEntityScoreProvider if no scores are available
entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider
This will activate the indexing mode that Iterates over the data and and lookup
scores for the IDs. The NoEntityScoreProvider is a dummy implementation that
retunes null as score for every requested Entity ID. Therefore this
configuration will result in all entities to be indexed and no scores to be
used.

Under [1] you can find an indexing.properties file that uses this
configuration. It also provides - as comments - some more background
information about the different configuration options.

Let me also add that the RDF data you have already imported to the TDB store can be also
used for this indexing mode. Therefore I would recommend to you to move RDF data that you
have already imported from the "{indexing-root}/indexing/resources/rdfdata" to
some other directory. This will avoid to re-import them on every new indexing process.

best
Rupert Westenthaler

[1]
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties

Thanks, I can send the complete log if it is needed.
Best,
Alex

Re: Custom dbpedia indexing without Entity Scores

Reply via email to