Re: [Virtuoso-users] [Dbpedia-discussion] DBPedia 3.2 Load in Virtuoso 5.0.9 - Reporting on results, and some questions

Kingsley Idehen Sat, 22 Nov 2008 16:58:56 +0000

Marvin Lugair wrote:

Hello,


I would like to report back on my loading of dbpedia 3.2 into Open-Source 
Virtuoso 5.0.9.
The good news is that I was successful and have a local DBPedia to play with 
now. Thanks to everyone for their input and suggestions on configuration 
parameters!

Marv

----------------

Running Ubuntu 8.1 (intrepid)
Kernel 2.6.27-7
8GB DDR2 RAM
AMD Athlon 2.5ghz Dual core

It took around 22 hours to import the core (21 files) and make a .db database 
file out of them. The imported resulted in one dbpedia.db file that is about 
20-something GB in size.
It typically takes a little an hour to start that database (load the .db file 
in memory) and start the virtuoso process.
As a reference:
Time to load infobox_en.nt = 52 minutes


Some of the parameters in my dbpedia.ini

MaxCheckpointRemap              = 1000000
MaxMemPoolSize                  = 0
StopCompilerWhenXOverRunTime    = 1
DefaultIsolation                = 2
NumberOfBuffers                 = 550000
MaxDirtyBuffers                 = 320000


Files that had errors
---------------
Three files did not load because of malformed URIs (about 500 of them across 
the three files, 400-something lines were in the externallinks file). I tried 
to reload these files with the ttlp_mt bit mask that ignores errors but it did 
not work.
I deleted the corresponding triples and reloaded. Bascially you lose those 
triples. Someone needs to fix these in the DBPedia files.


The three files with errors are:
 1> homepage_en.nt
 2> externallinks_en.nt

3> infobox-mappingbased-loose.ntThe URI's either had spaces, backslashes or even Korean characters (in one case) in them. These files need cleaning up.





Some questions
---------------------------------------
* Why does short-abstracts take 4 hours to load though it is 982MB
whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?!
The only difference is that short was loaded a few files after long... does 
performance change as the database file (the one i am creating, dbpedia.db) 
grows larger?

* What is the best way to check for and delete duplicate triples in the 
database?

* Related to this last question, it seems the online dbpedia at 
dbpedia.org/sparql gateway does not return duplicates over the webpage 
interface. However it does return duplicates for the SAME query when submitted 
through Jena. To duplicate this paste the following query in the webpage:

select ?s
where {
?s
 <http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}

This will return the following results in my web browser:
http://dbpedia.org/resource/Bill_Cosby
http://dbpedia.org/resource/Dick_Gregory
http://dbpedia.org/resource/Eddie_Murphy
http://dbpedia.org/resource/Flip_Wilson
http://dbpedia.org/resource/George_Carlin
http://dbpedia.org/resource/Mort_Sahl
http://dbpedia.org/resource/Redd_Foxx
http://dbpedia.org/resource/Richard_Pryor
http://dbpedia.org/resource/Rodney_Dangerfield
http://dbpedia.org/resource/Sam_Kinison
http://dbpedia.org/resource/Steve_Martin

no duplicates,Now run the *same* query through a Jena program

In my java source here is how I am connecting to what I assume is the SAME 
gateway!
 QueryExecution qexec = 
QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql";, q);

and here is what i get (again this is the exact same query):

----------------------------------------------------
| s                                                |
====================================================
| <http://dbpedia.org/resource/Bill_Cosby>         |
| <http://dbpedia.org/resource/Dick_Gregory>       |
| <http://dbpedia.org/resource/Eddie_Murphy>       |
| <http://dbpedia.org/resource/Flip_Wilson>        |
| <http://dbpedia.org/resource/George_Carlin>      |
| <http://dbpedia.org/resource/Mort_Sahl>          |
| <http://dbpedia.org/resource/Redd_Foxx>          |
| <http://dbpedia.org/resource/Richard_Pryor>      |
| <http://dbpedia.org/resource/Rodney_Dangerfield> |
| <http://dbpedia.org/resource/Sam_Kinison>        |
| <http://dbpedia.org/resource/Steve_Martin>       |
| <http://dbpedia.org/resource/Bill_Cosby>         |
| <http://dbpedia.org/resource/Bill_Cosby>         |
| <http://dbpedia.org/resource/Dick_Gregory>       |
| <http://dbpedia.org/resource/Eddie_Murphy>       |
| <http://dbpedia.org/resource/Flip_Wilson>        |
| <http://dbpedia.org/resource/George_Carlin>      |
| <http://dbpedia.org/resource/Mort_Sahl>          |
| <http://dbpedia.org/resource/Redd_Foxx>          |
| <http://dbpedia.org/resource/Richard_Pryor>      |
| <http://dbpedia.org/resource/Rodney_Dangerfield> |
| <http://dbpedia.org/resource/Sam_Kinison>        |
| <http://dbpedia.org/resource/Steve_Martin>       |
| <http://dbpedia.org/resource/Eddie_Murphy>       |
----------------------------------------------------

Duplicates!
Can someone please explain this?

As a side, when I run this from isql on my newly locally installed dbpedia I 
get no duplicates (I havent tried Jena with my local).


<eom>

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Dbpedia-discussion mailing list
dbpedia-discuss...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Marvin,

You will see why when you run:

select *
where {graph ?g {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}


As you can see their are two graphs:
1. http://dbpedia.org

2. http://dbpedia.org/resource/<entity> (this one results from cacheactivity associated with client interactions with Virtuoso)


Solutions:
-- Being specific about source Graph by specifying Graph IRI
select ?s
where {graph <http://dbpedia.org> {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}

OR

select ?s
from <http://dbpedia.org>
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}

-- Using DISTINCT

select distinct ?s
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}

--


Regards,

Kingsley Idehen       Weblog: http://www.openlinksw.com/blog/~kidehen

President & CEOOpenLink Software Web: http://www.openlinksw.com

Re: [Virtuoso-users] [Dbpedia-discussion] DBPedia 3.2 Load in Virtuoso 5.0.9 - Reporting on results, and some questions

Reply via email to