Marvin Lugair wrote:
Hello,
I would like to report back on my loading of dbpedia 3.2 into Open-Source
Virtuoso 5.0.9.
The good news is that I was successful and have a local DBPedia to play with
now. Thanks to everyone for their input and suggestions on configuration
parameters!
Marv
----------------
Running Ubuntu 8.1 (intrepid)
Kernel 2.6.27-7
8GB DDR2 RAM
AMD Athlon 2.5ghz Dual core
It took around 22 hours to import the core (21 files) and make a .db database
file out of them. The imported resulted in one dbpedia.db file that is about
20-something GB in size.
It typically takes a little an hour to start that database (load the .db file
in memory) and start the virtuoso process.
As a reference:
Time to load infobox_en.nt = 52 minutes
Some of the parameters in my dbpedia.ini
MaxCheckpointRemap = 1000000
MaxMemPoolSize = 0
StopCompilerWhenXOverRunTime = 1
DefaultIsolation = 2
NumberOfBuffers = 550000
MaxDirtyBuffers = 320000
Files that had errors
---------------
Three files did not load because of malformed URIs (about 500 of them across
the three files, 400-something lines were in the externallinks file). I tried
to reload these files with the ttlp_mt bit mask that ignores errors but it did
not work.
I deleted the corresponding triples and reloaded. Bascially you lose those
triples. Someone needs to fix these in the DBPedia files.
The three files with errors are:
1> homepage_en.nt
2> externallinks_en.nt
3> infobox-mappingbased-loose.nt
The URI's either had spaces, backslashes or even Korean characters (in one case) in them. These files need cleaning up.
Some questions
---------------------------------------
* Why does short-abstracts take 4 hours to load though it is 982MB
whereas long-abstracts took 2 hours to load though its size is 1.7 gigs?!
The only difference is that short was loaded a few files after long... does
performance change as the database file (the one i am creating, dbpedia.db)
grows larger?
* What is the best way to check for and delete duplicate triples in the
database?
* Related to this last question, it seems the online dbpedia at
dbpedia.org/sparql gateway does not return duplicates over the webpage
interface. However it does return duplicates for the SAME query when submitted
through Jena. To duplicate this paste the following query in the webpage:
select ?s
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}
This will return the following results in my web browser:
http://dbpedia.org/resource/Bill_Cosby
http://dbpedia.org/resource/Dick_Gregory
http://dbpedia.org/resource/Eddie_Murphy
http://dbpedia.org/resource/Flip_Wilson
http://dbpedia.org/resource/George_Carlin
http://dbpedia.org/resource/Mort_Sahl
http://dbpedia.org/resource/Redd_Foxx
http://dbpedia.org/resource/Richard_Pryor
http://dbpedia.org/resource/Rodney_Dangerfield
http://dbpedia.org/resource/Sam_Kinison
http://dbpedia.org/resource/Steve_Martin
no duplicates,
Now run the *same* query through a Jena program
In my java source here is how I am connecting to what I assume is the SAME
gateway!
QueryExecution qexec =
QueryExecutionFactory.sparqlService("http://DBpedia.org/sparql", q);
and here is what i get (again this is the exact same query):
----------------------------------------------------
| s |
====================================================
| <http://dbpedia.org/resource/Bill_Cosby> |
| <http://dbpedia.org/resource/Dick_Gregory> |
| <http://dbpedia.org/resource/Eddie_Murphy> |
| <http://dbpedia.org/resource/Flip_Wilson> |
| <http://dbpedia.org/resource/George_Carlin> |
| <http://dbpedia.org/resource/Mort_Sahl> |
| <http://dbpedia.org/resource/Redd_Foxx> |
| <http://dbpedia.org/resource/Richard_Pryor> |
| <http://dbpedia.org/resource/Rodney_Dangerfield> |
| <http://dbpedia.org/resource/Sam_Kinison> |
| <http://dbpedia.org/resource/Steve_Martin> |
| <http://dbpedia.org/resource/Bill_Cosby> |
| <http://dbpedia.org/resource/Bill_Cosby> |
| <http://dbpedia.org/resource/Dick_Gregory> |
| <http://dbpedia.org/resource/Eddie_Murphy> |
| <http://dbpedia.org/resource/Flip_Wilson> |
| <http://dbpedia.org/resource/George_Carlin> |
| <http://dbpedia.org/resource/Mort_Sahl> |
| <http://dbpedia.org/resource/Redd_Foxx> |
| <http://dbpedia.org/resource/Richard_Pryor> |
| <http://dbpedia.org/resource/Rodney_Dangerfield> |
| <http://dbpedia.org/resource/Sam_Kinison> |
| <http://dbpedia.org/resource/Steve_Martin> |
| <http://dbpedia.org/resource/Eddie_Murphy> |
----------------------------------------------------
Duplicates!
Can someone please explain this?
As a side, when I run this from isql on my newly locally installed dbpedia I
get no duplicates (I havent tried Jena with my local).
<eom>
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Dbpedia-discussion mailing list
dbpedia-discuss...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
Marvin,
You will see why when you run:
select *
where {graph ?g {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}
As you can see their are two graphs:
1. http://dbpedia.org
2. http://dbpedia.org/resource/<entity> (this one results from cache
activity associated with client interactions with Virtuoso)
Solutions:
-- Being specific about source Graph by specifying Graph IRI
select ?s
where {graph <http://dbpedia.org> {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}}
OR
select ?s
from <http://dbpedia.org>
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}
-- Using DISTINCT
select distinct ?s
where {
?s
<http://dbpedia.org/property/influenced>
<http://dbpedia.org/resource/Chris_Rock>
}
--
Regards,
Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
President & CEO
OpenLink Software Web: http://www.openlinksw.com