Re: Custom dbpedia indexing without Entity Scores

Rupert Westenthaler Thu, 03 Nov 2011 06:28:11 -0700

Hi

I think Olivie has already calculated page ranks for dbpedia 3.7. If he still 
has the file around he could upload the compressed file to


http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/

However one would need to uncompress it before it can be used for indexing

best
Rupert

On 03.11.2011, at 13:04, Alex Lopez wrote:

> Thanks for the comments Rupert.
> 
> Yes I already had moved the dumps elsewhere, at least I don't have to reload 
> all the dumps again :)
> I think I'll try again to run the script to calculate the scores to get the 
> boosts on the entities.
> 
> Em 03-11-2011 11:46, Rupert Westenthaler escreveu:
>> Hi Alex
>> 
>> see comments inline
>> 
>> On 03.11.2011, at 12:06, Alex Lopez wrote:
>> 
>>> Hi stanbolers,
>>> 
>>> I'm in the middle of the process of creating a custom dbpedia index for 
>>> Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and 
>>> some custom mappings (in specific some special treating for Portuguese text 
>>> plus some additional properties I'd like to see indexed).
>>> 
>>> I'm following this file:
>>> 
>>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md
>>> 
>>> and this for processing the broken images_en file
>>> 
>>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh
>>> 
>>> The process went well up to the point after all triples (some 80M) where 
>>> loaded into tdb.
>>> 
>>> The problem is that the process stops after that and outputs a
>>> 
>>> Exception in thread "Thread-3" java.lang.IllegalStateException: The file 
>>> with the Entity Scores is missing
>>>        at 
>>> org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424)
>>> ...
>>> 10:12:22,077 [Thread-2] INFO  solryard.SolrYardIndexingDestination - ... 
>>> create SolrYard
>>> 
>>> And nothing more happens.
>>> 
>> If you use the combination of "entityDataProvider" and "entityIdIterator" 
>> than you need file with the scores, because this is used to lookup the IDs.
>> 
>>> Of course, the file is missing because I didn't need it, since I want to 
>>> index all entities. I tried to generate it anyway once but after a lot of 
>>> time of processing it failed with some outOfMem exception (I think in the 
>>> process of sorting).
>> 
>> Even if you plan to index all entities you might want to use entity scores, 
>> because such scores are also used to boost entities within the Entityhub.
>> 
>> So if you search than results will be sorted by the number of incoming links 
>> within dbpedia. This ensures that a search for "Paris" returns Paris France 
>> as best result. Without such boosts Paris Texas could be also returned as 
>> best result.
>> 
>>> Is there a way to instruct the indexer to ignore the Entity Scores file? Or 
>>> write some simple one in a way that says "all entities are to be indexed"?
>>> 
>> 
>> If you want to index all entities without Entity scores than you need to 
>> change the configuration within the "indexing.properties" file as follows.
>> 
>> comment the properties
>> 
>> * entityDataProvider
>> * entityIdIterator
>> * scoreNormalizer
>> 
>> instead of this add the following two lines
>> 
>> # use the Jena TDB as source for indexing the RDF data located within
>> # "indexing/resource/rdfdata"
>> entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata
>> # The EntityScore Provider needs to provide the scores for indexed entities
>> # use the NoEntityScoreProvider if no scores are available
>> entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider
>> This will activate the indexing mode that Iterates over the data and and 
>> lookup scores for the IDs. The NoEntityScoreProvider is a dummy 
>> implementation that retunes null as score for every requested Entity ID. 
>> Therefore this configuration will result in all entities to be indexed and 
>> no scores to be used.
>> 
>> Under [1] you can find an indexing.properties file that uses this 
>> configuration. It also provides - as comments - some more background 
>> information about the different configuration options.
>> 
>> Let me also add that the RDF data you have already imported to the TDB store 
>> can be also used for this indexing mode. Therefore I would recommend to you 
>> to move RDF data that you have already imported from the 
>> "{indexing-root}/indexing/resources/rdfdata" to some other directory. This 
>> will avoid to re-import them on every new indexing process.
>> 
>> best
>> Rupert Westenthaler
>> 
>> [1] 
>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties
>> 
>>> Thanks, I can send the complete log if it is needed.
>>> Best,
>>> Alex
>>

Re: Custom dbpedia indexing without Entity Scores

Reply via email to