Hi I think Olivie has already calculated page ranks for dbpedia 3.7. If he still has the file around he could upload the compressed file to
http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.7/ However one would need to uncompress it before it can be used for indexing best Rupert On 03.11.2011, at 13:04, Alex Lopez wrote: > Thanks for the comments Rupert. > > Yes I already had moved the dumps elsewhere, at least I don't have to reload > all the dumps again :) > I think I'll try again to run the script to calculate the scores to get the > boosts on the entities. > > Em 03-11-2011 11:46, Rupert Westenthaler escreveu: >> Hi Alex >> >> see comments inline >> >> On 03.11.2011, at 12:06, Alex Lopez wrote: >> >>> Hi stanbolers, >>> >>> I'm in the middle of the process of creating a custom dbpedia index for >>> Stanbol, using some 24 dumps from dbpedia 3.7, english and portuguese, and >>> some custom mappings (in specific some special treating for Portuguese text >>> plus some additional properties I'd like to see indexed). >>> >>> I'm following this file: >>> >>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/README.md >>> >>> and this for processing the broken images_en file >>> >>> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/dbpedia/fetch_prepare.sh >>> >>> The process went well up to the point after all triples (some 80M) where >>> loaded into tdb. >>> >>> The problem is that the process stops after that and outputs a >>> >>> Exception in thread "Thread-3" java.lang.IllegalStateException: The file >>> with the Entity Scores is missing >>> at >>> org.apache.stanbol.entityhub.indexing.core.source.LineBasedEntityIterator.initialise(LineBasedEntityIterator.java:424) >>> ... >>> 10:12:22,077 [Thread-2] INFO solryard.SolrYardIndexingDestination - ... >>> create SolrYard >>> >>> And nothing more happens. >>> >> If you use the combination of "entityDataProvider" and "entityIdIterator" >> than you need file with the scores, because this is used to lookup the IDs. >> >>> Of course, the file is missing because I didn't need it, since I want to >>> index all entities. I tried to generate it anyway once but after a lot of >>> time of processing it failed with some outOfMem exception (I think in the >>> process of sorting). >> >> Even if you plan to index all entities you might want to use entity scores, >> because such scores are also used to boost entities within the Entityhub. >> >> So if you search than results will be sorted by the number of incoming links >> within dbpedia. This ensures that a search for "Paris" returns Paris France >> as best result. Without such boosts Paris Texas could be also returned as >> best result. >> >>> Is there a way to instruct the indexer to ignore the Entity Scores file? Or >>> write some simple one in a way that says "all entities are to be indexed"? >>> >> >> If you want to index all entities without Entity scores than you need to >> change the configuration within the "indexing.properties" file as follows. >> >> comment the properties >> >> * entityDataProvider >> * entityIdIterator >> * scoreNormalizer >> >> instead of this add the following two lines >> >> # use the Jena TDB as source for indexing the RDF data located within >> # "indexing/resource/rdfdata" >> entityDataIterable=org.apache.stanbol.entityhub.indexing.source.jenatdb.RdfIndexingSource,source:rdfdata >> # The EntityScore Provider needs to provide the scores for indexed entities >> # use the NoEntityScoreProvider if no scores are available >> entityScoreProvider=org.apache.stanbol.entityhub.indexing.core.source.NoEntityScoreProvider >> This will activate the indexing mode that Iterates over the data and and >> lookup scores for the IDs. The NoEntityScoreProvider is a dummy >> implementation that retunes null as score for every requested Entity ID. >> Therefore this configuration will result in all entities to be indexed and >> no scores to be used. >> >> Under [1] you can find an indexing.properties file that uses this >> configuration. It also provides - as comments - some more background >> information about the different configuration options. >> >> Let me also add that the RDF data you have already imported to the TDB store >> can be also used for this indexing mode. Therefore I would recommend to you >> to move RDF data that you have already imported from the >> "{indexing-root}/indexing/resources/rdfdata" to some other directory. This >> will avoid to re-import them on every new indexing process. >> >> best >> Rupert Westenthaler >> >> [1] >> http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/genericrdf/src/main/resources/indexing/config/indexing.properties >> >>> Thanks, I can send the complete log if it is needed. >>> Best, >>> Alex >>
