On Mon, Aug 20, 2012 at 9:30 PM, Rupert Westenthaler <
[email protected]> wrote:

> On Tue, Aug 21, 2012 at 2:30 AM, harish suvarna <[email protected]>
> wrote:
> >>
> >> I had not yet time to look at dbpedia 3.8. They might have changed
> >> names of some dump files. Generally "instance_types" are very
> >> important (this provides the information about the type of an Entity).
> >> "person_data" includes additional information for persons, AFAIK those
> >> information are not included in the default configuration of the
> >> dbpedia indexing tool
> >>
> >>
> > Not all language dumps have these files. Japanese, Italian also donot
> have
> > these files. These files are listed in the readme file. Hence I was
> looking
> > for these.
> >
> Types are the same for all languages. Therefore they are only
> available in English.
> I am no sure about "person_data" but there it might be the same.
>
> In other words - if you build an index for a specific language you
> need to include the English dumps of those that are not language
> specific.
>
> >>> I will try this. Thanks a lot.

> >
> >> > I get a java exception.
> >>
> >> The included exceptions look like the RDF file containing the Chinese
> >> labels is not well formatted. The experience says that this is most
> >> likely related to char encoding issues. This was also the case with
> >> some dbpedia 3.7 files (see the special treatment of some files in the
> >> shell script of the dbpedia).
> >>
> >> OK. I will try to debug this.
> >
>
>>>>

I converted the labels_zh.nt to utf-8 using ms word. MS word adds the bom
bytes though. I needed to remove the bom bytes.
Then lables_ZH.NT WENT THROUGH. But long abstracts has same problem. So I
am still working on these other files.
 Thanks a lot for all your patience and all stanbol teachings.



-- 
Thanks
Harish

Reply via email to