Re: dbpedia solr index dump

harish suvarna Wed, 22 Aug 2012 15:58:16 -0700

I am finally successfull after converting some chinese dbpedia dump files
to utf8. But I can't hit any dbpedia links in stanbol using this solr dump.
I am just wondering whether I should pre-process the chinese dbpedia dump
files. I uploaded the new jar file successfully as a new bundle. Then I
defined a new engine using this reference site 'dbpedia'. I donot have any
other dbpedia solr dump. The chain says it is active and all 3 engines are
available.
If I put the dbpedia solr index from Ogrisel (1.19GB), it works fine. I get
some dbpedia links.
Am I missing anything
else?<http://localhost:8080/system/console/bundles/179>I did add the
instance_types and person_data from english dump.


-harish



On Tue, Aug 21, 2012 at 6:22 PM, harish suvarna <[email protected]> wrote:

>
>
> On Mon, Aug 20, 2012 at 9:30 PM, Rupert Westenthaler <
> [email protected]> wrote:
>
>> On Tue, Aug 21, 2012 at 2:30 AM, harish suvarna <[email protected]>
>> wrote:
>> >>
>> >> I had not yet time to look at dbpedia 3.8. They might have changed
>> >> names of some dump files. Generally "instance_types" are very
>> >> important (this provides the information about the type of an Entity).
>> >> "person_data" includes additional information for persons, AFAIK those
>> >> information are not included in the default configuration of the
>> >> dbpedia indexing tool
>> >>
>> >>
>> > Not all language dumps have these files. Japanese, Italian also donot
>> have
>> > these files. These files are listed in the readme file. Hence I was
>> looking
>> > for these.
>> >
>> Types are the same for all languages. Therefore they are only
>> available in English.
>> I am no sure about "person_data" but there it might be the same.
>>
>> In other words - if you build an index for a specific language you
>> need to include the English dumps of those that are not language
>> specific.
>>
>> >>> I will try this. Thanks a lot.
>
>> >
>> >> > I get a java exception.
>> >>
>> >> The included exceptions look like the RDF file containing the Chinese
>> >> labels is not well formatted. The experience says that this is most
>> >> likely related to char encoding issues. This was also the case with
>> >> some dbpedia 3.7 files (see the special treatment of some files in the
>> >> shell script of the dbpedia).
>> >>
>> >> OK. I will try to debug this.
>> >
>>
> >>>>
>
> I converted the labels_zh.nt to utf-8 using ms word. MS word adds the bom
> bytes though. I needed to remove the bom bytes.
> Then lables_ZH.NT WENT THROUGH. But long abstracts has same problem. So I
> am still working on these other files.
>  Thanks a lot for all your patience and all stanbol teachings.
>
>
>
> --
> Thanks
> Harish
>
>


-- 
Thanks
Harish

Re: dbpedia solr index dump

Reply via email to