Hi Alex

see comments inline

On 03.11.2011, at 12:06, Alex Lopez wrote:

> Hi stanbolers,
> 
> I'm in the middle of the process of creating a custom dbpedia index for 
> Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and 
> some custom mappings (in specific some special treating for Portuguese text 
> plus some additional properties I'd like to see indexed).
> 
> I'm following this file:
> 
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md
> 
> and this for processing the broken images_en file
> 
> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh
> 
> The process went well up to the point after all triples (some 80M) where 
> loaded into tdb.
> 
> The problem is that the process stops after that and outputs a
> 
> Exception in thread "Thread-3" java.lang.IllegalStateException: The file with 
> the Entity Scores is missing
>        at 
> org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424)
> ...
> 10:12:22,077 [Thread-2] INFO  solryard.SolrYardIndexingDestination - ... 
> create SolrYard
> 
> And nothing more happens.
> 
If you use the combination of "entityDataProvider" and "entityIdIterator" than 
you need file with the scores, because this is used to lookup the IDs.

> Of course, the file is missing because I didn't need it, since I want to 
> index all entities. I tried to generate it anyway once but after a lot of 
> time of processing it failed with some outOfMem exception (I think in the 
> process of sorting).

Even if you plan to index all entities you might want to use entity scores, 
because such scores are also used to boost entities within the Entityhub.

So if you search than results will be sorted by the number of incoming links 
within dbpedia. This ensures that a search for "Paris" returns Paris France as 
best result. Without such boosts Paris Texas could be also returned as best 
result. 

> Is there a way to instruct the indexer to ignore the Entity Scores file? Or 
> write some simple one in a way that says "all entities are to be indexed"?
> 

If you want to index all entities without Entity scores than you need to change 
the configuration within the "indexing.properties" file as follows.

comment the properties

* entityDataProvider
* entityIdIterator
* scoreNormalizer

instead of this add the following two lines

# use the Jena TDB as source for indexing the RDF data located within
# "indexing/resource/rdfdata"
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
# The EntityScore Provider needs to provide the scores for indexed entities
# use the NoEntityScoreProvider if no scores are available
entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider
This will activate the indexing mode that Iterates over the data and and lookup 
scores for the IDs. The NoEntityScoreProvider is a dummy implementation that 
retunes null as score for every requested Entity ID. Therefore this 
configuration will result in all entities to be indexed and no scores to be 
used.

Under [1] you can find an indexing.properties file that uses this 
configuration. It also provides - as comments - some more background 
information about the different configuration options.

Let me also add that the RDF data you have already imported to the TDB store 
can be also used for this indexing mode. Therefore I would recommend to you to 
move RDF data that you have already imported from the 
"{indexing-root}/indexing/resources/rdfdata" to some other directory. This will 
avoid to re-import them on every new indexing process.

best
Rupert Westenthaler

[1] 
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
 

> Thanks, I can send the complete log if it is needed.
> Best,
> Alex

Reply via email to