Yes, you're right. /etc/os-release reports "Ubuntu 20.04.2 LTS"
> -Ursprüngliche Nachricht-
> Von: Andrii Berezovskyi
> Gesendet: Freitag, 18. Februar 2022 10:49
> An: users@jena.apache.org
> Betreff: Re: Loading Wikidata
>
> I see, thanks. Are you sure
5
>> An: users@jena.apache.org
>> Betreff: Re: Loading Wikidata
>>
>> May I ask an unrelated question: how do you get Ubuntu version in such a
>> format? 'cat /etc/os-release' (or lsb_release, hostnamectl, neofetch) only
>> gives me the '20.04.3' format
I used cat /proc/version
> -Ursprüngliche Nachricht-
> Von: Andrii Berezovskyi
> Gesendet: Freitag, 18. Februar 2022 10:35
> An: users@jena.apache.org
> Betreff: Re: Loading Wikidata
>
> May I ask an unrelated question: how do you get Ubuntu version in such a
&
of the machine is one 10TB raid6 SSD.
> >
> > Cheers, Joachim
> >
> > > -Ursprüngliche Nachricht-
> > > Von: Andy Seaborne
> > > Gesendet: Mittwoch, 16. Februar 2022 20:05
> > > An: users@jena.apache.org
>
3.0.
>
> CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores
> in total)
>
> Cheers, Joachim
>
> > -Ursprüngliche Nachricht-
> > Von: Marco Neumann
> > Gesendet: Freitag, 18. Februar 2022 10:00
> > An: users@jena.apache.org
> &
a.apache.org
> Betreff: Re: Loading Wikidata
>
> Thank you for the effort Joachim, what CPU and OS was used for the load
> test?
>
> Best,
> Marco
>
> On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim
> wrote:
>
> > Storage of the mac
y Seaborne
> > Gesendet: Mittwoch, 16. Februar 2022 20:05
> > An: users@jena.apache.org
> > Betreff: Re: Loading Wikidata
> >
> >
> >
> > On 16/02/2022 11:56, Neubert, Joachim wrote:
> > > I've loaded the Wikidata "truthy" dataset with 6b tri
Storage of the machine is one 10TB raid6 SSD.
Cheers, Joachim
> -Ursprüngliche Nachricht-
> Von: Andy Seaborne
> Gesendet: Mittwoch, 16. Februar 2022 20:05
> An: users@jena.apache.org
> Betreff: Re: Loading Wikidata
>
>
>
> On 16/02/2022 11:56, Neubert, Joa
On 16/02/2022 11:56, Neubert, Joachim wrote:
I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:
10:09:29 INFO Load node table = 3 seconds
10:09:29 INFO Load ingest data = 25165 seconds
10:09:29 INFO Build index SPO = 11241 seconds
10:09:29 INFO Build index
I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:
10:09:29 INFO Load node table = 3 seconds
10:09:29 INFO Load ingest data = 25165 seconds
10:09:29 INFO Build index SPO = 11241 seconds
10:09:29 INFO Build index POS = 14100 seconds
10:09:29 INFO Build index
> The loaders work on empty databases.
Yes my test is on a new empty dataset. The command that I use is `tdbloader2
--loc wikidata wikidata.ttl`
> If you are splitting files, and doing partial loads, things are rather
> different.
No I'm using the whole file. I'd only consider splitting it if
A connection to help out.
Dick
Original message From: Laura Morales <laure...@mail.com> Date:
14/12/2017 20:09 (GMT+00:00) To: jena-users-ml <users@jena.apache.org> Subject: Re:
Report on loading wikidata (errata)
ERRATA:
I don't know why then. Maybe SSD is making
(GMT+00:00) To: jena-users-ml <users@jena.apache.org>
Subject: Re: Report on loading wikidata (errata)
ERRATA:
> I don't know why then. Maybe SSD is making all the difference. Try to load it
> (or "latest-all") on a comparable machine using a single SATA disk in
ERRATA:
> I don't know why then. Maybe SSD is making all the difference. Try to load it
> (or "latest-all") on a comparable machine using a single SATA disk instead of
> SSD.
s/SATA/HDD
> I loaded 2.2B on a 16G machine which wasn't even server class (i.e. it's
> > Creating the node table index may be amenable to the same approach as
> > index building, caveat details.
>
> Or switch to NodeId being hash based. What blocks certain parallel
> processing currently is that NodeIds are allocated sequentially.
>
> But that has an impact when the loaded data
ot; <a...@apache.org>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
On 12/12/17 21:06, Laura Morales wrote:
> 2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K
> TPS) but then it quickly slows down*a lot* over time,
That's memory.
When
> > And I'm not convinced it's a problem of disk cache either, because I tried
> > to flush it several times, but the disk was always getting slower and
> > slower as more triples were added (1MB/s writes!!!). So, didn't you
> > experience the same issue with your 5400rpm disks?
>
> Your IO is
ot;
where is a graph split among several independent stores?
Sent: Tuesday, December 12, 2017 at 11:27 PM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
Correct, Mosaic federates multiple datasets as one. At some point in a
On 12/12/17 22:54, Andy Seaborne wrote:
Creating the node table index may be amenable to the same approach as
index building, caveat details.
Or switch to NodeId being hash based. What blocks certain parallel
processing currently is that NodeIds are allocated sequentially.
But that has
On 12/12/17 21:06, Laura Morales wrote:
2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K
TPS) but then it quickly slows down*a lot* over time,
That's memory.
When the node table index exceeds RAM, updating slows down because disk
I/O happens on what used to be
n but eventually I'll saturate it.
Sent: Tuesday, December 12, 2017 at 9:20 PM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
tdbloader2
For anyone still following this thread ;-)
latest-truthy supposedly
Correct, Mosaic federates multiple datasets as one. At some point in a
query find [G]SPO will get called and Mosaic will concurrently call find on
each child dataset and return the set of results. The dataset can be memory
or TDB or Thrift (this one's another discussion) Mosaic doesn't care as
That's not what Mosaic is doing at all. I'll leave it to Dick to explain after
this, because I am not the expert here, he is, but it's federating multiple
datasets so that they appear as one to SPARQL. It's got nothing to do with
individual graphs within a dataset.
ajs6f
> On Dec 12, 2017, at
> He can correct me as needed, but it seems that Dick is using (and getting
> great results from)
> an extension to Jena ("Mosaic") that federates different datasets (in this
> cases from
> independent TDB instances) and runs queries over them in parallel. We've had
> some discussions
> (all
roblem of disk cache either, because I tried to
> flush it several times, but the disk was always getting slower and slower as
> more triples were added (1MB/s writes!!!). So, didn't you experience the same
> issue with your 5400rpm disks?
>
>
>
> Sent: Tuesday, Decembe
(1MB/s writes!!!). So, didn't you experience the same issue
with your 5400rpm disks?
Sent: Tuesday, December 12, 2017 at 9:20 PM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
tdbloader2
For anyone still follo
Sent: Monday, December 11, 2017 at 11:31 AM
From: "Dick Murray" <dandh...@gmail.com>
To: users@jena.apache.org
Subject: Re: Report on loading wikidata
Inline...
On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:
> Thank you a lot Dick! Is this
Understand, I'm running sort and uniq on truthy out of interest...
On 12 December 2017 at 10:31, Andy Seaborne wrote:
>
>
> On 12/12/17 10:06, Dick Murray wrote:
> ...
>
>> As an aside there are duplicate entries in the data-triples.tmp file, is
>> this by design? if you sort
On 12/12/17 10:06, Dick Murray wrote:
...
As an aside there are duplicate entries in the data-triples.tmp file, is
this by design? if you sort data-triples.tmp | uniq > it returns a smaller
file and I've checked visually and there are duplicate entries...
...
It's expected.
data-triples.tmp
> I hacked (i.e. no checking/setup/params) the data/index scripts to create
> s, p, o folders on soft linked three separate devices and moved in the
> respective.dat and .idn files, hard linked back to the data-triples.tmp.
> and ran the three triple indexes in parallel. sort was parallel 8 and
>
Similar here.
I hacked (i.e. no checking/setup/params) the data/index scripts to create
s, p, o folders on soft linked three separate devices and moved in the
respective.dat and .idn files, hard linked back to the data-triples.tmp.
and ran the three triple indexes in parallel. sort was parallel 8
t; From: "Dick Murray" <dandh...@gmail.com>
> To: users@jena.apache.org
> Subject: Re: Report on loading wikidata
> Inline...
>
> On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:
>
>> Thank you a lot Dick! Is this test for tdbloader
@jena.apache.org
Subject: Re: Report on loading wikidata
Inline...
On 10 December 2017 at 23:03, Laura Morales <laure...@mail.com> wrote:
> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or hi
>> "mid-range setup" (iCore/xeon + DDR3)
The distinction here is app server class machines and database server
classes machines.
app servers typically have less RAM, less I/O bandwidth, less disk
optimization, and also may have to share hardware. Any virtualization
matters - some
This is for the large amount of temporary space that tdbloader2 uses?
I got "latest-all" to load but I had to do some things with tdbloader2
to work with a compresses data-triples.tmp.gz and also have sort write
comprssed temporary files (I messed up a bit and set the gzip
compression too
Inline...
On 10 December 2017 at 23:03, Laura Morales wrote:
> Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or
> tdb2.tdbloader?
>
> > 32GB DDR4 quad channel
>
> 2133 or higher?
>
2133
> > 3 x M.2 Samsung 960 EVO
>
> Are these PCI-e disks? Or SATA? Also,
Thank you a lot Dick! Is this test for tdbloader, tdbloader2, or tdb2.tdbloader?
> 32GB DDR4 quad channel
2133 or higher?
> 3 x M.2 Samsung 960 EVO
Are these PCI-e disks? Or SATA? Also, what size and configuration?
> Is it possible to split the index files into separate folders?
> Or sym link
Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO,
172K/sec 3h45m for truthy.
Is it possible to split the index files into separate folders?
Or sym link the files, if I run the data phase, sym link, then run the
index phase?
Point me in the right direction and I'll extend the
On 07/12/17 19:01, Laura Morales wrote:
Thank you a lot Andy, very informative (special thanks for specifying the
hardware).
For anybody reading this, I'd like to highlight the fact that the data source is
"latest-truthy" and not "latest-all".
From what I understand, truthy leaves out a lot
On 07/12/17 18:34, Marco Neumann wrote:
did you try to point the wdqs copy to your tdb/fuseki endpoint?
"SERVICE wikibase:label" isn't in the data.
Andy
On Thu, 7 Dec 2017 at 18:58, Andy Seaborne wrote:
Dell XPS 13 (model 9350) - the 2015 model.
Ubuntu 17.10, not
> == TDB2
>
> TDB2 is experimental. The current TDB2 loader is a functional placeholder.
>
> It is writing all three indexes at the same time. While for SPO this is
> not a bad access pattern (subjects are naturally grouped), for POS and
> OSP, the I/O is a random pattern, not a stream pattern.
ver-grade
hardware.
Sent: Thursday, December 07, 2017 at 6:20 PM
From: "Andy Seaborne" <a...@apache.org>
To: "users@jena.apache.org" <users@jena.apache.org>
Subject: Report on loading wikidata
Dell XPS 13 (model 9350) - the 2015 model.
Ubuntu 17.10, not a VM.
1T SS
did you try to point the wdqs copy to your tdb/fuseki endpoint?
On Thu, 7 Dec 2017 at 18:58, Andy Seaborne wrote:
> Dell XPS 13 (model 9350) - the 2015 model.
> Ubuntu 17.10, not a VM.
> 1T SSD.
> 16G RAM.
> Two volumes = root and user.
> Swappiness = 10
>
> java version
Dell XPS 13 (model 9350) - the 2015 model.
Ubuntu 17.10, not a VM.
1T SSD.
16G RAM.
Two volumes = root and user.
Swappiness = 10
java version "1.8.0_151" (OpenJDK)
Data: latest-truthy.nt.gz (version of 2017-11-24)
== TDB1, tdbloader2
8 hours // 76,164 TPS
Using SORT_ARGS:
44 matches
Mail list logo