Re: Report on loading wikidata

2017-12-14 Thread Laura Morales
> The loaders work on empty databases. Yes my test is on a new empty dataset. The command that I use is `tdbloader2 --loc wikidata wikidata.ttl` > If you are splitting files, and doing partial loads, things are rather > different. No I'm using the whole file. I'd only consider splitting it if

Re: Report on loading wikidata (errata)

2017-12-14 Thread Andy Seaborne
A connection to help out. Dick Original message From: Laura Morales <laure...@mail.com> Date: 14/12/2017 20:09 (GMT+00:00) To: jena-users-ml <users@jena.apache.org> Subject: Re: Report on loading wikidata (errata) ERRATA: I don't know why then. Maybe SSD is making

Re: Report on loading wikidata (errata)

2017-12-14 Thread dandh988
(GMT+00:00) To: jena-users-ml <users@jena.apache.org> Subject: Re: Report on loading wikidata (errata) ERRATA: > I don't know why then. Maybe SSD is making all the difference. Try to load it > (or "latest-all") on a comparable machine using a single SATA disk in

Re: Report on loading wikidata (errata)

2017-12-14 Thread Laura Morales
ERRATA: > I don't know why then. Maybe SSD is making all the difference. Try to load it > (or "latest-all") on a comparable machine using a single SATA disk instead of > SSD. s/SATA/HDD > I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's

Re: Report on loading wikidata

2017-12-13 Thread Laura Morales
> > Creating the node table index may be amenable to the same approach as > > index building, caveat details. > > Or switch to NodeId being hash based. What blocks certain parallel > processing currently is that NodeIds are allocated sequentially. > > But that has an impact when the loaded data

Re: Report on loading wikidata

2017-12-13 Thread Laura Morales
ot; <a...@apache.org> To: users@jena.apache.org Subject: Re: Report on loading wikidata On 12/12/17 21:06, Laura Morales wrote: > 2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K > TPS) but then it quickly slows down*a lot* over time, That's memory. When

Re: Report on loading wikidata

2017-12-12 Thread Laura Morales
> > And I'm not convinced it's a problem of disk cache either, because I tried > > to flush it several times, but the disk was always getting slower and > > slower as more triples were added (1MB/s writes!!!). So, didn't you > > experience the same issue with your 5400rpm disks? > > Your IO is

Re: Report on loading wikidata

2017-12-12 Thread Laura Morales
ot; where is a graph split among several independent stores?   Sent: Tuesday, December 12, 2017 at 11:27 PM From: "Dick Murray" <dandh...@gmail.com> To: users@jena.apache.org Subject: Re: Report on loading wikidata Correct, Mosaic federates multiple datasets as one. At some point in a

Re: Report on loading wikidata

2017-12-12 Thread Andy Seaborne
On 12/12/17 22:54, Andy Seaborne wrote: Creating the node table index may be amenable to the same approach as index building, caveat details. Or switch to NodeId being hash based. What blocks certain parallel processing currently is that NodeIds are allocated sequentially. But that has

Re: Report on loading wikidata

2017-12-12 Thread Andy Seaborne
On 12/12/17 21:06, Laura Morales wrote: 2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K TPS) but then it quickly slows down*a lot* over time, That's memory. When the node table index exceeds RAM, updating slows down because disk I/O happens on what used to be

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
n but eventually I'll saturate it. Sent: Tuesday, December 12, 2017 at 9:20 PM From: "Dick Murray" <dandh...@gmail.com> To: users@jena.apache.org Subject: Re: Report on loading wikidata tdbloader2 For anyone still following this thread ;-) latest-truthy supposedly

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Correct, Mosaic federates multiple datasets as one. At some point in a query find [G]SPO will get called and Mosaic will concurrently call find on each child dataset and return the set of results. The dataset can be memory or TDB or Thrift (this one's another discussion) Mosaic doesn't care as

Re: Report on loading wikidata

2017-12-12 Thread ajs6f
That's not what Mosaic is doing at all. I'll leave it to Dick to explain after this, because I am not the expert here, he is, but it's federating multiple datasets so that they appear as one to SPARQL. It's got nothing to do with individual graphs within a dataset. ajs6f > On Dec 12, 2017, at

Re: Report on loading wikidata

2017-12-12 Thread Laura Morales
> He can correct me as needed, but it seems that Dick is using (and getting > great results from) > an extension to Jena ("Mosaic") that federates different datasets (in this > cases from > independent TDB instances) and runs queries over them in parallel. We've had > some discussions > (all

Re: Report on loading wikidata

2017-12-12 Thread ajs6f
roblem of disk cache either, because I tried to > flush it several times, but the disk was always getting slower and slower as > more triples were added (1MB/s writes!!!). So, didn't you experience the same > issue with your 5400rpm disks? > > > > Sent: Tuesday, Decembe

Re: Report on loading wikidata

2017-12-12 Thread Laura Morales
(1MB/s writes!!!). So, didn't you experience the same issue with your 5400rpm disks?     Sent: Tuesday, December 12, 2017 at 9:20 PM From: "Dick Murray" <dandh...@gmail.com> To: users@jena.apache.org Subject: Re: Report on loading wikidata tdbloader2 For anyone still follo

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Sent: Monday, December 11, 2017 at 11:31 AM From: "Dick Murray" <dandh...@gmail.com> To: users@jena.apache.org Subject: Re: Report on loading wikidata Inline... On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote: > Thank you a lot Dick! Is this

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Understand, I'm running sort and uniq on truthy out of interest... On 12 December 2017 at 10:31, Andy Seaborne wrote: > > > On 12/12/17 10:06, Dick Murray wrote: > ... > >> As an aside there are duplicate entries in the data-triples.tmp file, is >> this by design? if you sort

Re: Report on loading wikidata

2017-12-12 Thread Andy Seaborne
On 12/12/17 10:06, Dick Murray wrote: ... As an aside there are duplicate entries in the data-triples.tmp file, is this by design? if you sort data-triples.tmp | uniq > it returns a smaller file and I've checked visually and there are duplicate entries... ... It's expected. data-triples.tmp

Re: Report on loading wikidata

2017-12-12 Thread Laura Morales
> I hacked (i.e. no checking/setup/params) the data/index scripts to create > s, p, o folders on soft linked three separate devices and moved in the > respective.dat and .idn files, hard linked back to the data-triples.tmp. > and ran the three triple indexes in parallel. sort was parallel 8 and >

Re: Report on loading wikidata

2017-12-12 Thread Dick Murray
Similar here. I hacked (i.e. no checking/setup/params) the data/index scripts to create s, p, o folders on soft linked three separate devices and moved in the respective.dat and .idn files, hard linked back to the data-triples.tmp. and ran the three triple indexes in parallel. sort was parallel 8

Re: Report on loading wikidata

2017-12-11 Thread ajs6f
t; From: "Dick Murray" <dandh...@gmail.com> > To: users@jena.apache.org > Subject: Re: Report on loading wikidata > Inline... > > On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote: > >> Thank you a lot Dick! Is this test for tdbloader

Re: Report on loading wikidata

2017-12-11 Thread Laura Morales
@jena.apache.org Subject: Re: Report on loading wikidata Inline... On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote: > Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or > tdb2.tdbloader? > > > 32GB DDR4 quad channel > > 2133 or hi

Re: Report on loading wikidata

2017-12-11 Thread Andy Seaborne
>> "mid-range setup" (iCore/xeon + DDR3) The distinction here is app server class machines and database server classes machines. app servers typically have less RAM, less I/O bandwidth, less disk optimization, and also may have to share hardware. Any virtualization matters - some

Re: Report on loading wikidata

2017-12-11 Thread Andy Seaborne
This is for the large amount of temporary space that tdbloader2 uses? I got "latest-all" to load but I had to do some things with tdbloader2 to work with a compresses data-triples.tmp.gz and also have sort write comprssed temporary files (I messed up a bit and set the gzip compression too

Re: Report on loading wikidata

2017-12-11 Thread Dick Murray
Inline... On 10 December 2017 at 23:03, Laura Morales wrote: > Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or > tdb2.tdbloader? > > > 32GB DDR4 quad channel > > 2133 or higher? > 2133 > > 3 x M.2 Samsung 960 EVO > > Are these PCI-e disks? Or SATA? Also,

Re: Report on loading wikidata

2017-12-10 Thread Laura Morales
Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or tdb2.tdbloader? > 32GB DDR4 quad channel 2133 or higher? > 3 x M.2 Samsung 960 EVO Are these PCI-e disks? Or SATA? Also, what size and configuration? > Is it possible to split the index files into separate folders? > Or sym link

Re: Report on loading wikidata

2017-12-10 Thread Dick Murray
Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO, 172K/sec 3h45m for truthy. Is it possible to split the index files into separate folders? Or sym link the files, if I run the data phase, sym link, then run the index phase? Point me in the right direction and I'll extend the

Re: Report on loading wikidata

2017-12-07 Thread Andy Seaborne
On 07/12/17 19:01, Laura Morales wrote: Thank you a lot Andy, very informative (special thanks for specifying the hardware). For anybody reading this, I'd like to highlight the fact that the data source is "latest-truthy" and not "latest-all". From what I understand, truthy leaves out a lot

Re: Report on loading wikidata

2017-12-07 Thread Andy Seaborne
On 07/12/17 18:34, Marco Neumann wrote: did you try to point the wdqs copy to your tdb/fuseki endpoint? "SERVICE wikibase:label" isn't in the data. Andy On Thu, 7 Dec 2017 at 18:58, Andy Seaborne wrote: Dell XPS 13 (model 9350) - the 2015 model. Ubuntu 17.10, not

Re: Report on loading wikidata

2017-12-07 Thread Laura Morales
> == TDB2 > > TDB2 is experimental. The current TDB2 loader is a functional placeholder. > > It is writing all three indexes at the same time. While for SPO this is > not a bad access pattern (subjects are naturally grouped), for POS and > OSP, the I/O is a random pattern, not a stream pattern.

Re: Report on loading wikidata

2017-12-07 Thread Laura Morales
ver-grade hardware.     Sent: Thursday, December 07, 2017 at 6:20 PM From: "Andy Seaborne" <a...@apache.org> To: "users@jena.apache.org" <users@jena.apache.org> Subject: Report on loading wikidata Dell XPS 13 (model 9350) - the 2015 model. Ubuntu 17.10, not a VM. 1T SS

Re: Report on loading wikidata

2017-12-07 Thread Marco Neumann
did you try to point the wdqs copy to your tdb/fuseki endpoint? On Thu, 7 Dec 2017 at 18:58, Andy Seaborne wrote: > Dell XPS 13 (model 9350) - the 2015 model. > Ubuntu 17.10, not a VM. > 1T SSD. > 16G RAM. > Two volumes = root and user. > Swappiness = 10 > > java version

Report on loading wikidata

2017-12-07 Thread Andy Seaborne
Dell XPS 13 (model 9350) - the 2015 model. Ubuntu 17.10, not a VM. 1T SSD. 16G RAM. Two volumes = root and user. Swappiness = 10 java version "1.8.0_151" (OpenJDK) Data: latest-truthy.nt.gz (version of 2017-11-24) == TDB1, tdbloader2 8 hours // 76,164 TPS Using SORT_ARGS: