Hi Alex see comments inline
On 03.11.2011, at 12:06, Alex Lopez wrote: > Hi stanbolers, > > I'm in the middle of the process of creating a custom dbpedia index for > Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and > some custom mappings (in specific some special treating for Portuguese text > plus some additional properties I'd like to see indexed). > > I'm following this file: > > http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md > > and this for processing the broken images_en file > > http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh > > The process went well up to the point after all triples (some 80M) where > loaded into tdb. > > The problem is that the process stops after that and outputs a > > Exception in thread "Thread-3" java.lang.IllegalStateException: The file with > the Entity Scores is missing > at > org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424) > ... > 10:12:22,077 [Thread-2] INFO solryard.SolrYardIndexingDestination - ... > create SolrYard > > And nothing more happens. > If you use the combination of "entityDataProvider" and "entityIdIterator" than you need file with the scores, because this is used to lookup the IDs. > Of course, the file is missing because I didn't need it, since I want to > index all entities. I tried to generate it anyway once but after a lot of > time of processing it failed with some outOfMem exception (I think in the > process of sorting). Even if you plan to index all entities you might want to use entity scores, because such scores are also used to boost entities within the Entityhub. So if you search than results will be sorted by the number of incoming links within dbpedia. This ensures that a search for "Paris" returns Paris France as best result. Without such boosts Paris Texas could be also returned as best result. > Is there a way to instruct the indexer to ignore the Entity Scores file? Or > write some simple one in a way that says "all entities are to be indexed"? > If you want to index all entities without Entity scores than you need to change the configuration within the "indexing.properties" file as follows. comment the properties * entityDataProvider * entityIdIterator * scoreNormalizer instead of this add the following two lines # use the Jena TDB as source for indexing the RDF data located within # "indexing/resource/rdfdata" entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata # The EntityScore Provider needs to provide the scores for indexed entities # use the NoEntityScoreProvider if no scores are available entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider This will activate the indexing mode that Iterates over the data and and lookup scores for the IDs. The NoEntityScoreProvider is a dummy implementation that retunes null as score for every requested Entity ID. Therefore this configuration will result in all entities to be indexed and no scores to be used. Under [1] you can find an indexing.properties file that uses this configuration. It also provides - as comments - some more background information about the different configuration options. Let me also add that the RDF data you have already imported to the TDB store can be also used for this indexing mode. Therefore I would recommend to you to move RDF data that you have already imported from the "{indexing-root}/indexing/resources/rdfdata" to some other directory. This will avoid to re-import them on every new indexing process. best Rupert Westenthaler [1] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties > Thanks, I can send the complete log if it is needed. > Best, > Alex
