I am finally successfull after converting some chinese dbpedia dump files to utf8. But I can't hit any dbpedia links in stanbol using this solr dump. I am just wondering whether I should pre-process the chinese dbpedia dump files. I uploaded the new jar file successfully as a new bundle. Then I defined a new engine using this reference site 'dbpedia'. I donot have any other dbpedia solr dump. The chain says it is active and all 3 engines are available. If I put the dbpedia solr index from Ogrisel (1.19GB), it works fine. I get some dbpedia links. Am I missing anything else?<http://localhost:8080/system/console/bundles/179>I did add the instance_types and person_data from english dump.
-harish On Tue, Aug 21, 2012 at 6:22 PM, harish suvarna <[email protected]> wrote: > > > On Mon, Aug 20, 2012 at 9:30 PM, Rupert Westenthaler < > [email protected]> wrote: > >> On Tue, Aug 21, 2012 at 2:30 AM, harish suvarna <[email protected]> >> wrote: >> >> >> >> I had not yet time to look at dbpedia 3.8. They might have changed >> >> names of some dump files. Generally "instance_types" are very >> >> important (this provides the information about the type of an Entity). >> >> "person_data" includes additional information for persons, AFAIK those >> >> information are not included in the default configuration of the >> >> dbpedia indexing tool >> >> >> >> >> > Not all language dumps have these files. Japanese, Italian also donot >> have >> > these files. These files are listed in the readme file. Hence I was >> looking >> > for these. >> > >> Types are the same for all languages. Therefore they are only >> available in English. >> I am no sure about "person_data" but there it might be the same. >> >> In other words - if you build an index for a specific language you >> need to include the English dumps of those that are not language >> specific. >> >> >>> I will try this. Thanks a lot. > >> > >> >> > I get a java exception. >> >> >> >> The included exceptions look like the RDF file containing the Chinese >> >> labels is not well formatted. The experience says that this is most >> >> likely related to char encoding issues. This was also the case with >> >> some dbpedia 3.7 files (see the special treatment of some files in the >> >> shell script of the dbpedia). >> >> >> >> OK. I will try to debug this. >> > >> > >>>> > > I converted the labels_zh.nt to utf-8 using ms word. MS word adds the bom > bytes though. I needed to remove the bom bytes. > Then lables_ZH.NT WENT THROUGH. But long abstracts has same problem. So I > am still working on these other files. > Thanks a lot for all your patience and all stanbol teachings. > > > > -- > Thanks > Harish > > -- Thanks Harish
