Re: dbpedia solr index dump

Rupert Westenthaler Thu, 30 Aug 2012 05:02:17 -0700

Hi Harish,

sorry for the long mail, but this covers a lot of different things ...


On Thu, Aug 30, 2012 at 2:22 AM, harish suvarna <[email protected]> wrote:
> Some insight into what exactly Oliver did would help. Some statistics of
> generating index on my 8gb MacBookPro. Chinese index is taking 3hrs and
> macpro desktop is taking <15mins.

You should not try to index DBpedia on a Laptop without an SSD. The
small HDs in laptops have a far to low IO performance for indexing (I
had ~100 IO operations per second on my MacBookPro, with an SSD I get
constantly more than 5000). I tried (and succeeded) once to index
DBpedia on my MacBookPro, but it took four full days :(

Desktop computers do usually much faster HDs so this will most likely
be the reason why you see the big difference in indexing time.


On Thu, Aug 30, 2012 at 2:22 AM, harish suvarna <[email protected]> wrote:
> Rupert,
> Thank you.
> More progress now. I see some indexes being added and the resulting chinese
> dbpedia.solr.zip is 17MB. I drop this into Stanbol,

17MB is still not much. As I mentioned already I would expect the file
to be at least ten times bigger.

What you should do is to review the data directly via the RESTful
interface of the Solr core. If you run Stanbol on localhost:8080 this
url should work

    http://localhost:8080/solr/default/dbpedia/select?q=*:*

I would expect data like

    <arr name="@zh/rdfs:label/">
        <str>美國</str>
    </arr>

as part of the returned data.

You can also try to explicitly query for entities by (I use here '美'
only as example)

http://localhost:8080/solr/default/dbpedia/select?q=@zh/rdfs\:label/:%E7%BE%8E*

If you find see those data and you get the expected results for
queries similar to this than please try the same also without the '*'
(wildcard) at the end and check if this makes any difference.

> I donot get any results back.

What engines do you use? The KeywordlinkingEngine does need only the
labels so it should even work when there are no type information. So
if the above queries do work for you than it is most likely because
the KeywordLinkingEngine does not create good queries for Chinese.

To debug this I would start Apache Stanbol in the DEBUG mode (add '-l
DEBUG') when starting stanbol. Than open the
'{stanbol-working-dir}/stanbol/logs/error.log' file and filter the
loggings for the
"org.apache.stanbol.entityhub.yard.solr.impl.SolrQueryFactory"
component.

When doing that you will see Solr OR queries for two words of the text. e.g.

    (((@en/rdfs\:label/:"Android")) OR ((@en/rdfs\:label/:"Google")))

Those queries will help you to find out what the KeywordLinkingEngine
is looking for - and also what does go wrong.

Possible causes:

* The KeywordLinkingEngine does not add '*' at the end (e.g. it uses
"google" instead of "google*"), because it expects to search for full
words (and not parts of words). But if the Solr Tokenizer/Stemmer for
Chinese is not configured that this might be an origin of the problem.
* The KeywordLinkingEngine tokenizes the parsed text using the OpenNLP
Tokenizer. The default OpenNLP Tolenizer (the SimpleTokenizer) that
uses

    Character.isWhitespace(charCode)  ||
    Character.getType(charCode) == Character.SPACE_SEPARATOR;

    for tokenizing. I have no Idea if/how well this works for Chinese,
but if not than you will see Solr queries covering several words in a
singe OR.

> I am using the english persondata and instance_types. These files still
> have

Adding (or editing) the English "instance_types" will not work, as the
zh data do use different urls. However the mappings of the Chinese
entity to the english one are available. So you can copy over the
types of the english versions by adding both

    http://downloads.dbpedia.org/3.8/zh/interlanguage_links_same_as_zh.nt.bz2
    http://downloads.dbpedia.org/3.8/en/instance_types_en.nt.bz2

and than use LDPath to copy over the types of the english version to
the Chinese one

   rdf:type = owl:sameAs/rdf:type;

This follows outgoing 'owl:sameAs' relations (as defined by
'interlanguage_links_same_as_zh.nt.bz2') and than copies over the
rdf:type values (as defined by 'instance_types_en.nt.bz2'). So this
will allow you to have the rdf:type values for all Entities that are
mapped to the english DBpedia. However NOTE that (1) not all entities
are mapped to the English DBpedia and (2) that not all entities in the
english DBpedia do have a rdf:type - so you will still have a lot of
Entities without rdf:type information.

For doing that you need to add the "LdpathSourceProcessor" to the
"entityProcessor" property of your "indexing.properties" file.

here an example (you need to remove line breaks from the following
text if you copy them over to your indexing config!)

entityProcessor=org.apache.stanbol.entityhub.indexing.core.processor.LdpathSourceProcessor,ldpath:ldpath-mapping.txt,append:true;org.apache.stanbol.entityhub.indexing.core.processor.FieldValueFilter,config:entityTypes;org.apache.stanbol.entityhub.indexing.core.processor.FiledMapperProcessor

Note that this links to the file "ldpath-mapping.txt". You need also
create this file in the config directory. It is expected to contain
the LDpath program

    rdf:type = owl:sameAs/rdf:type;

The LDpath program configured in this file will be executed on the
indexing source (the Jena TDB store with the DBpedia data). You can
use this to execute any kind of LDpath programs as described on [1]


>
> Also the person data has strings like "Aristotle"@en . Do I need
> "Aristotle"@zh
>

For person data you would need to also use the LDpath stuff as
explained above. However changing "Aristotle"@en to "Aristotle"@zh
sounds strange to me because the Chinese label for "Aristotle" should
be "亚里士多德" (at least based on DBpedia.org). So if no person data for
Chinese are available I would recommend to not use them at all.

>
> I am starting the English index to make sure that it works.

Creating an new DBpedia index for version 3.8 is also on my TODO list. So


best
Rupert

[1] http://code.google.com/p/ldpath/wiki/PathLanguage

>
> -harish
>
> ================instance types
> <http://dbpedia.org/resource/Autism> <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://dbpedia.org/ontology/Disease> .
> <http://dbpedia.org/resource/Autism> <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://www.w3.org/2002/07/owl#Thing> .
> <http://dbpedia.org/resource/Animal_Farm> <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://dbpedia.org/ontology/Book> .
> <http://dbpedia.org/resource/Animal_Farm> <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Book> .
>
> ============Person data====================
> <http://dbpedia.org/resource/Aristotle> <http://xmlns.com/foaf/0.1/name>
> "Aristotle"@en .
> <http://dbpedia.org/resource/Aristotle> <
> http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <
> http://xmlns.com/foaf/0.1/Person> .
> <http://dbpedia.org/resource/Aristotle> <
> http://purl.org/dc/elements/1.1/description> "Greek philosopher"@en .
>
>
>
>
>
> On Mon, Aug 27, 2012 at 11:59 PM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> Hi,
>>
>> oh sorry I completely forgot to answer your question your problem with
>> the indexing configuration. But it looks like you where on the right
>> track anyway as the problem is indeed with the format of the
>> "incoming_links.txt" what is caused by the different namespace of the
>> Chinese dump.
>>
>> Here are the details
>>
>> The expected format of the "incoming_links.txt" (based on the
>> configuration in "iditerator.properties") is
>>
>>     {score} {local-name}
>>
>> Note also the 'id-namespace' property that is set to to
>> "http://dbpedia.org/resource/"; in the "iditerator.properties" file.
>>
>> This configuration corresponds to the 'sed' command in your shell script
>>
>> > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
>> >         | bzcat \
>> >         | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>
>> ./\1/' \
>> >         | sort \
>> >         | uniq -c  \
>> >         | sort -nr > incoming_links.txt
>>
>> Because the Chinese dump uses a different namespace than the regex (of
>> the -e parameter) does not match and because of that URIs of the
>> Entities are not correctly extracted form the "page_links_zh.nt.bz2"
>> file. Because of that the results of the script are not the expected
>> one.
>>
>> To fix this you need to make the following two changes:
>>
>> 1) change the sed command so that is uses the correct namespace
>>
>>     sed -e 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
>>
>> 2) change the value for the 'id-namespace' in
>> {indexing-working-dir}/indexing/config/iditerator.properties to the
>> namespace used by the Chinese dump "http://zh.dbpedia.org/resource/";
>>
>>
>> NOTES:
>>
>> * I recognized that the curl part of the included shell script still
>> refers to version 3.6. You might probably want to download the data
>> from "http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2";
>> instead.
>>
>> * for testing it is nice to add a '| head -n 1000 \' between ' | bzcat
>> \' and the 'sed' command. This causes only the first 'n' lines of the
>> dump to be processed. This will execute in <1sec and allows you to
>> review the results of the comment. You can even use the resulting
>> "incoming_links.txt" file for indexing! While this will only index a
>> small fraction of the entities it might still be useful for testing.
>>
>> I made some test and the following script looked fine to me (NOTE it
>> contains the '| head -n 1000 \' - you might want to remove this line
>> after checking the results)
>>
>> curl http://downloads.dbpedia.org/3.8/zh/page_links_zh.nt.bz2 \
>>          | bzcat \
>>          | head -n 1000 \
>>          | sed -e
>> 's/.*<http\:\/\/zh\.dbpedia\.org\/resource\/\([^>]*\)> ./\1/' \
>>          | sort \
>>          | uniq -c  \
>>          | sort -nr > incoming_links.txt
>>
>>
>> again sorry the late response
>>
>> best
>> Rupert
>>
>> On Mon, Aug 27, 2012 at 5:09 PM, harish suvarna <[email protected]>
>> wrote:
>> > Rupert, any clues on this problem?
>> >
>> > The resources below have http://zh.dbpedia.org. That does not exist.
>> Does
>> > it cause any problems? I did
>> >
>> > curl http://downloads.dbpedia.org/3.6/zh/page_links_zh.nt.bz2 \
>> >         | bzcat \
>> >         | sed -e 's/.*<http\:\/\/dbpedia\.org\/resource\/\([^>]*\)>
>> ./\1/' \
>> >         | sort \
>> >         | uniq -c  \
>> >         | sort -nr > incoming_links.txt
>> >
>> > to generate chinese incoming_links.txt.
>> >
>> > -harish
>> >
>> > On Thu, Aug 23, 2012 at 2:15 PM, harish suvarna <[email protected]>
>> wrote:
>> >
>> >> OK. Great. It may be easy to fix then. here are few lines.
>> >>
>> >> 1192 <
>> >>
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u7F8E\u570B\u96FB\u5F71\u5217\u8868
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u660E\u73E0\u53F0> .
>> >>  876 <
>> >> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1000-1999)>
>> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
>> >>  781 <
>> >>
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u7BC0\u76EE\u5217\u8868
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
>> >>  611 <
>> >>
>> http://zh.dbpedia.org/resource/\u7121\u7DAB\u96FB\u8996\u5916\u8CFC\u52D5\u756B\u5217\u8868
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u7FE1\u7FE0\u53F0> .
>> >>  573 <
>> http://zh.dbpedia.org/resource/NGC\u5929\u4F53\u5217\u8868_(1-999)>
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u661F\u7CFB> .
>> >>  519 <
>> >>
>> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u52D5\u756B\u96C6\u6578\u5217\u8868
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >>
>> http://zh.dbpedia.org/resource/\u540D\u5075\u63A2\u67EF\u5357\u6F2B\u756B\u5217\u8868
>> >
>> >> .
>> >>  384 <
>> >>
>> http://zh.dbpedia.org/resource/2006\u5E74\u9999\u6E2F\u9078\u8209\u59D4\u54E1\u6703\u754C\u5225\u5206\u7D44\u9078\u8209
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/File:Black_check.svg> .
>> >>  366 <
>> >>
>> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u5C0F\u9B3C> .
>> >>  365 <
>> >>
>> http://zh.dbpedia.org/resource/\u7C21\u7E41\u8F49\u63DB\u4E00\u5C0D\u591A\u5217\u8868
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/File:Cmbox_move.png> .
>> >>  355 <
>> >>
>> http://zh.dbpedia.org/resource/\u5A1B\u6A02\u767E\u5206\u767E\u7BC0\u76EE\u5217\u8868_(2007\u5E74)
>> >
>> >> <http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/\u5C0F\u8C6C> .
>> >> 7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category:\u90B5\u9633\u4EBA> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category:\u8523\u59D3> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u806F\u5408\u570B\u5B89\u5168\u7406\u4E8B\u6703\u4E3B\u5E2D>
>> >> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u570B\u7ACB\u6E05\u83EF\u5927\u5B78\u6559\u6388>
>> >> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u54E5\u502B\u6BD4\u4E9E\u5927\u5B78\u6821\u53CB>
>> >> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category:\u53F0\u7063\u5916\u7701\u4EBA>
>> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u5357\u958B\u5927\u5B78\u6559\u6388>
>> >> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u4E2D\u83EF\u6C11\u570B\u99D0\u8607\u806F\u5927\u4F7F>
>> >> .
>> >>    7 <http://zh.dbpedia.org/resource/\u8523\u5EF7\u9EFB> <
>> >> http://dbpedia.org/ontology/wikiPageWikiLink> <
>> >> http://zh.dbpedia.org/resource/Category
>> :\u4E2D\u83EF\u6C11\u570B\u99D0\u7F8E\u570B\u5927\u4F7F>
>> >> .
>> >>
>> >>
>> >>
>> >> On Thu, Aug 23, 2012 at 1:37 PM, Rupert Westenthaler <
>> >> [email protected]> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> one more thing. Can you please post me the first few lines of
>> >>>
>> >>>  {indexing-source}/indexing/resource/incoming_links.txt
>> >>>
>> >>> so that I can check the data against the configuration of the
>> >>> iditerator.properties file
>> >>>
>> >>> best
>> >>> Rupert
>> >>>
>> >>> On Thu, Aug 23, 2012 at 10:31 PM, Rupert Westenthaler
>> >>> <[email protected]> wrote:
>> >>> > Hi
>> >>> >
>> >>> > The log shows clearly that you only import the triples from the dumps
>> >>> > to the Jena TDB triple store used as Source for the indexing.
>> >>> >
>> >>> > See all the lines such as
>> >>> >
>> >>> >     8:14:08,196 [Thread-5] INFO  tdb.loader - Add: 50,000 triples
>> >>> > (Batch: 3,256 / Avg: 3,256)
>> >>> >     08:14:12,802 [Thread-5] INFO  tdb.loader - Add: 100,000 triples
>> >>> > (Batch: 10,855 / Avg: 5,009)
>> >>> >
>> >>> > BTW: this needs only to be done once. After this initialization step
>> >>> > completes you can remove the RDF files from
>> >>> > "{indexing-root}/indexing/resources/rdfdata/" (I usually just rename
>> >>> > the rdfdata folder to imported-rdfdata).
>> >>> >
>> >>> > The ~1.5hrs are just the time needed to import the data from the RDF
>> >>> > dumps to the Jena TDB store.
>> >>> >
>> >>> > With
>> >>> >
>> >>> >     08:18:04,242 [main] INFO  impl.IndexerImpl - Indexing started ...
>> >>> >
>> >>> > the indexing starts and
>> >>> >
>> >>> >     08:21:03,176 [Indexing: Finished Entity Logger Deamon] INFO
>> >>> > impl.IndexerImpl - Indexed 0 items in 1410320sec (Infinityms/item):
>> >>> > processing:  -1.000ms/item | queue:  -1.000ms
>> >>> >
>> >>> > states clearly that no single Entity was indexed.
>> >>> >
>> >>> > I guess this has to do with the configuration. I will have a look at
>> >>> > it tomorrow morning.
>> >>> >
>> >>> > best
>> >>> > Rupert
>> >>> >
>> >>> > On Thu, Aug 23, 2012 at 9:53 PM, harish suvarna <[email protected]>
>> >>> wrote:
>> >>> >> I am attaching the zip of config folder. The indexing takes quiet
>> some
>> >>> time
>> >>> >> (~1.5hrs). The number of triples it generates is high.
>> >>> >> I am attaching the english indexing output also. I used 10 files
>> >>> (except
>> >>> >> long_abstarcts_en.nt, it is 2.5 GB and I could not save it in utf8
>> on
>> >>> my
>> >>> >> mac.). But for Chinese I had all files.
>> >>> >> -harish
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Aug 23, 2012 at 12:27 PM, Rupert Westenthaler
>> >>> >> <[email protected]> wrote:
>> >>> >>>
>> >>> >>> I would expect the dbpedia.solrindex.zip file to be several
>> hundreds
>> >>> >>> MByte in size (if not gigabytes).
>> >>> >>>
>> >>> >>> The only explanation for this file to be so small is that
>> something is
>> >>> >>> going wrong during indexing.
>> >>> >>>
>> >>> >>> Can you maybe provide the {indexing-root}/indexing/config folder so
>> >>> >>> that I can have a look at your configuration
>> >>> >>>
>> >>> >>> best
>> >>> >>> Rupert
>> >>> >>>
>> >>> >>> On Thu, Aug 23, 2012 at 5:49 PM, harish suvarna <
>> [email protected]>
>> >>> >>> wrote:
>> >>> >>> >
>> >>> >>> > Rupert,
>> >>> >>> > I generated the index for dbpedia3.8 English files only.
>> >>> >>> > One thing that intrigues me is that the dbpedia.solrindex.zip
>> >>> filesize
>> >>> >>> > is
>> >>> >>> > 53kb, same when I generated for chinese. The english files are
>> much
>> >>> >>> > bigger.
>> >>> >>> > In the english zip also, I can't find paris.
>> >>> >>> > I am attaching English dbpedia.solrindex.zip for any clues.
>> >>> >>> > Do I need to load the bundle jar file created by the dbpedia
>> >>> indexing?
>> >>> >>> >
>> >>> >>> > -harish
>> >>> >>>
>> >>> >>>
>> >>> >>>
>> >>> >>> --
>> >>> >>> | Rupert Westenthaler             [email protected]
>> >>> >>> | Bodenlehenstraße 11
>> ++43-699-11108907
>> >>> >>> | A-5500 Bischofshofen
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> Thanks
>> >>> >> Harish
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > | Rupert Westenthaler             [email protected]
>> >>> > | Bodenlehenstraße 11                             ++43-699-11108907
>> >>> > | A-5500 Bischofshofen
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> | Rupert Westenthaler             [email protected]
>> >>> | Bodenlehenstraße 11                             ++43-699-11108907
>> >>> | A-5500 Bischofshofen
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Thanks
>> >> Harish
>> >>
>> >>
>> >
>> >
>> > --
>> > Thanks
>> > Harish
>>
>>
>>
>> --
>> | Rupert Westenthaler             [email protected]
>> | Bodenlehenstraße 11                             ++43-699-11108907
>> | A-5500 Bischofshofen
>>
>
>
>
> --
> Thanks
> Harish



-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: dbpedia solr index dump

Reply via email to