Hi Jörn,

Are you running the Virtuoso open source git [1] stable or develop 7 branch ? I 
would recommend the load be performed with the develop/7 branch if this is not 
already being used.

From analysis development have performed in-house earlier this year, it was 
found the latest Freebase datasets need about 13,000,000 buffers ie about 105GB 
RAM to load in memory and not have to use disk which significantly reduces the 
load rate.  This is because the dataset contains many large literal values and 
thus does not compress very well and also a lot of duplicate data, so even 
though it is only about 1.9 billion triples as you have seen the actual as you 
have observed also.

What I would not expect though is for the memory consumption to continue to 
increase until the server is killed due to oom error which would imply a 
possible memory leak, which is why I recommend building with the develop/7  
build where there have been improvement in memory management.

To speed the load you should consider using faster disk (SSD's) ideally as a 
trade off for insufficient memory when loading the dataset, and also database 
striping [2] for improved parallel i/O access to the database files if 
possible. Another option would be to load the dataset in 4 parts, which should 
give the leave enough

On our LOD Cache Cloud server [3] which is a 4 node cluster with 768GB RAM and 
60billion + triples load the Freebase datasets loaded in about 1.7 hrs:

SQL> select min(ll_started) as start, max(ll_done) as finish, 
datediff('second', min(ll_started), max(ll_done)) as delta from load_list where 
ll_graph like 
'http://commondatastorage.googleapis.com/freebase-public/rdf/freebase-rdf-2013-11-17-00-00.gz';
start                finish               delta
TIMESTAMP            TIMESTAMP            INTEGER
_______________________________________________________________________________

2013.12.2 22:34.9 0  2013.12.3 0:16.24 0  6135

1 Rows. -- 74 msec.
SQL>

On the single server database we testing on with 105GB RAM it loaded in about 
2hrs.

Best Regards
Hugh Williams
Professional Services
OpenLink Software, Inc.      //              http://www.openlinksw.com/
Weblog   -- http://www.openlinksw.com/blogs/
LinkedIn -- http://www.linkedin.com/company/openlink-software/
Twitter  -- http://twitter.com/OpenLink
Google+  -- http://plus.google.com/100570109519069333827/
Facebook -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

[1] http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSGitUsage
[2] http://docs.openlinksw.com/virtuoso/dbadm.html#ini_Striping
[3] http://lod.openlinksw.com

On 22 Aug 2014, at 14:41, Jörn Hees <j_h...@cs.uni-kl.de> wrote:

> Hi,
> 
> TLDR: When importing the Freebase RDF dump Virtuoso seems to consume way more 
> RAM than configured.
> 
> i'm trying to load the Freebase RDF dump ( 
> https://developers.google.com/freebase/data ) into a clean Virtuoso 
> OpenSource 7.1.0 instance running on a VM with 4 cores and 32 GB of RAM, 300+ 
> GB HD free.
> The dump file contains 2,656,580,382 rows (even though the page claims 1.9 
> billion triples, maybe outdated or dups).
> Before attempting to load the whole Freebase dump, i loaded the basekb.com 
> dump which contained 1,205,456,739 triples into the store which was already 
> filled with DBpedia without any noticeable problem.
> 
> The Freebase dump rdf_loader_run() import starts with rapid IO rates (several 
> 100MB/s read and write bursts) and quickly consumes ~ 25 GB of RAM as 
> configured. It then continues to slowly consume more and more RAM ~ 1 
> MB/minute. As this goes on, the IO rates slowly drop down to some KB/s read 
> and no / very very rare writes. htop at this point shows that the process 
> spends nearly all its time on IO wait. After a couple of days Virtuoso is 
> finally killed by the kernel when it consumed all RAM of the machine and 
> wants even more.
> 
> I already tried adding 16 GB swap. This didn't help but made the machine 
> completely unresponsive after 4 days (sshd seems to have been swapped out and 
> never came back over a couple of hour long retries to ssh into the VM).
> 
> Ubuntu 12.04 LTS or 14.04.1 LTS doesn't seem to make a difference.
> 
> A colleague is reporting that the import works fine on a 256 GB RAM, 8 core 
> machine with settings for 64 GB... takes about 1 day to import, the final DB 
> is ~ 130 GB. Mine never gets to > 100 GB before Virtuoso is killed.
> 
> 
> The instance is set up following my tutorial 
> http://joernhees.de/blog/2014/04/23/setting-up-a-local-dbpedia-3-9-mirror-with-virtuoso-7/
>  just substitute the DBpedia Datasets with the Freebase triple dump and 
> Wikidata links.
> 
> The virtuoso.ini values are set as suggested for 32 GB of RAM, there's 
> nothing else running on the VM:
> [Database]
> MaxCheckpointRemap              = 2000  // also tried with 62500, so ~1/4th 
> of NumberOfBuffers as in the blogpost
> [Parameters]
> ;; Uncomment next two lines if there is 32 GB system memory free
> NumberOfBuffers          = 2720000
> MaxDirtyBuffers          = 2000000
> 
> 
> As I already tried a lot of things but can't get this to work, i'd be 
> thankful for feedback or someone looking into why virtuoso is consuming all 
> of the RAM.
> 
> Cheers,
> Jörn
> 
> 
> ------------------------------------------------------------------------------
> Slashdot TV.  
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Virtuoso-users mailing list
> Virtuoso-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/virtuoso-users

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Virtuoso-users mailing list
Virtuoso-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/virtuoso-users

Reply via email to