Thanks for the comments Rupert.
Yes I already had moved the dumps elsewhere, at least I don't have to
reload all the dumps again :)
I think I'll try again to run the script to calculate the scores to get
the boosts on the entities.
Em 03-11-2011 11:46, Rupert Westenthaler escreveu:
Hi Alex
see comments inline
On 03.11.2011, at 12:06, Alex Lopez wrote:
Hi stanbolers,
I'm in the middle of the process of creating a custom dbpedia index for
Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and some
custom mappings (in specific some special treating for Portuguese text plus
some additional properties I'd like to see indexed).
I'm following this file:
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md
and this for processing the broken images_en file
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh
The process went well up to the point after all triples (some 80M) where loaded
into tdb.
The problem is that the process stops after that and outputs a
Exception in thread "Thread-3" java.lang.IllegalStateException: The file with
the Entity Scores is missing
at
org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424)
...
10:12:22,077 [Thread-2] INFO solryard.SolrYardIndexingDestination - ... create
SolrYard
And nothing more happens.
If you use the combination of "entityDataProvider" and "entityIdIterator" than
you need file with the scores, because this is used to lookup the IDs.
Of course, the file is missing because I didn't need it, since I want to index
all entities. I tried to generate it anyway once but after a lot of time of
processing it failed with some outOfMem exception (I think in the process of
sorting).
Even if you plan to index all entities you might want to use entity scores,
because such scores are also used to boost entities within the Entityhub.
So if you search than results will be sorted by the number of incoming links within
dbpedia. This ensures that a search for "Paris" returns Paris France as best
result. Without such boosts Paris Texas could be also returned as best result.
Is there a way to instruct the indexer to ignore the Entity Scores file? Or write some
simple one in a way that says "all entities are to be indexed"?
If you want to index all entities without Entity scores than you need to change the
configuration within the "indexing.properties" file as follows.
comment the properties
* entityDataProvider
* entityIdIterator
* scoreNormalizer
instead of this add the following two lines
# use the Jena TDB as source for indexing the RDF data located within
# "indexing/resource/rdfdata"
entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
# The EntityScore Provider needs to provide the scores for indexed entities
# use the NoEntityScoreProvider if no scores are available
entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider
This will activate the indexing mode that Iterates over the data and and lookup
scores for the IDs. The NoEntityScoreProvider is a dummy implementation that
retunes null as score for every requested Entity ID. Therefore this
configuration will result in all entities to be indexed and no scores to be
used.
Under [1] you can find an indexing.properties file that uses this
configuration. It also provides - as comments - some more background
information about the different configuration options.
Let me also add that the RDF data you have already imported to the TDB store can be also
used for this indexing mode. Therefore I would recommend to you to move RDF data that you
have already imported from the "{indexing-root}/indexing/resources/rdfdata" to
some other directory. This will avoid to re-import them on every new indexing process.
best
Rupert Westenthaler
[1]
http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
Thanks, I can send the complete log if it is needed.
Best,
Alex