On Tue, Feb 15, 2011 at 8:05 PM, Olivier Grisel
<[email protected]> wrote:
> Great. Do you think it would be possible to have a default
> configuration for a small index of the top 10000 entities as measured
> by popularity?
Yes the indexer can be configured to build specialized indexes.
However to really make it easy to use I would need to implement some
improvements.

To use the current version have a look at the /entityhub/indexer/dbPedia bundle

(1) Use "mvn assembly:assembly" to build the jar with all dependency
(2) copy the jar to a different directory (because otherwise mvn clean
might delete some files you do not want to be deleted)
(3) use "java -jar
org.apache.stanbol.entityhub.indexing.dbPedia-0.1-SNAPSHOT-jar-with-dependencies.jar
-h" to see options

Parameters:
The first parameter is the URL of the Solr Core used for indexing. You
will want to configure an own core for the dbPedia index
The second parameter is the path to a directory with the RDF dump of
DBPedia. Files can be found on "http://wiki.dbpedia.org/Downloads36";.
Download the files you need and put them into a directory. The indexer
will automatically all the files.

Options:
 -i : this can be used to provide the file with incomming links. You
should better know than I how to create such files, because you
provided me with the one I used to create my index
("incoming-counts.tsv"). Note that this file is based on an older
version of the dbPedia dump because of that newer entities will not be
ignored during indexing.
 -ri : the minimum number of incoming links required that an entity is
included within the index. This can be used to control the size of the
index.
 -s : This is very handy to resume the indexing if you have already
completed the importing of the RDF data.
 -r : Resume Mode. Can also be used to activate the entity ranking
based indexing mode (see NOTE below)

IMPORTANT NOTE: For building small indices (number of indexed entities
<< number of entities in the dataset) it will be faster to activate
the "-r" switch. The generic RDF Indexer has two modes how to iterate
over the entities in the dataset. First by iterating over all triples
and second by using the entity ranking (file parsed by the -i option).
The first method is ~5times faster than the second, but if one only
index a small subset of the entities the entity ranking based indexing
mode will still be more efficient.

On my laptop it needed around 3 days to build the index, but this was
mainly limited to the ~100 IO operation/sec of the hard disk.

>
> I am also thinking of building maven artifacts to embed the opennlp
> models in version 1.5 without checking them in the Stanbol svn repo. I
> could help you bundle a set of small entity indexes.
>

That would be a cool thing to do. I am specially interested to find a
good way to provide configurations for the Entityhub (especially to
provide a default config so that the entityhub can be used without any
required configuration).
Adding new Referenced Sites by copying special bundles to an config
directory (e.g. by using
http://felix.apache.org/site/apache-felix-file-install.html) would be
an other great thing to do.

> Also could you write a howto for building indexes? I think such howto
> should better be written as text file in the stanbol source tree or
> better as a new documentation page for the stanbol website (using the
> markdown syntax) rather than a new wikipage on the IKS wiki).
>
I do not plan to update the documentation on the IKS wiki.
Looking at the stanbol website and start to move/adapt existing
Documentation is on my TODO list since some weeks. However I fear that
I will only have time to start with this after the Semantic
Interaction Framework Hackathon February 24th-26th, in Vienna

> As soon as you have such an howto ready I would be glad to write a
> bunch of pig scripts to build indexes for topics (rather than
> entities) so as to be able to perform document level topic assignment
> rather than occurrence-based entity lookups.
>
OK I do not really understand what you mean by that.

best
Rupert


> --
> Olivier
>



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Reply via email to