Re: About fuseki2 load performance by java API

2019-07-21 Thread Andy Seaborne




On 19/07/2019 08:09, Laura Morales wrote:

tdb2.tdbloader --loader=parallel

but it still becomes random IO (moves disk heads)

I haven't tried it extensively on an HDD - I'd be interested in hearing
what happens.



oh nice! I completely missed it. I've tried it with a 67GB .nt file from LinkedGeoData on 
the same 750GB HDD but the end result does not seem very different. It's difficult to 
compare exactly with when I tried to load wikidata, because I don't see any progress 
being reported here. I mean I don't see any "X triples loaded (Y per second)" 
kind of message. Anyway it starts at full speed, boiling CPU, HDD cooking up my wrist 
from beneath the plastic case, and fans almost generating enough lift to take off. Then 
it gradually slows down. I stopped it after 1 hour. At this point I was seeing less than 
10% CPU usage, 90% iowait, TDB2 files size ~15GB.



The proper solution is either to do caching+write ordering



What does this mean in practice? Can I change my input data (eg. sorting 
triples) so that tdb2.tdbloader can overcome the bottleneck with HDDs?


No, it is not to do with the data - what's needed is internal changes, 
which is something tdbloader2 (for TDB1) tends to do better on.  It 
doesn't do a massive amount of random pattern I/O (as much reading as 
writing).  Random I/O ends up bad for HDDs - the physical head has to 
move too much.


Andy


RE: About fuseki2 load performance by java API

2019-07-21 Thread Scarlet Remilia
Thank you very much.

I recreate a new workflow generating RDF files and try on tdb2.tdbloaders.



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10




From: Andy Seaborne 
Sent: Friday, July 19, 2019 1:41:34 AM
To: users@jena.apache.org 
Subject: Re: About fuseki2 load performance by java API



On 18/07/2019 13:08, Scarlet Remilia wrote:
> Thank you for reply!
>
>
>
> The server storage is HDD on local with RAID 10.
>
> CPU is 4x 14 cores with 28 threads but only one core is used during the load.
>
> The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is 
> also tuned by tuning cache size.
>
> I observed disk IO by iostat, but it seems not utilized much disk IO and also 
> it is observed that memory usage of fuseki2 is increasing after loading every 
> 3 millions triples.

If you mean by IO bandwidth, then yes, it will not be high because it
becomes random I/O and the effects as Laura describes happen.

Memory will increase because Java does not do a GC unless it needs to.

The tdb2.tdbloaders will do better then the Fuseki UI even with a disk,
but for larger datasets, SSD is preferred.

>
> Fuseki2 is setup as a standalone server by the command below:
>
>
>
> ./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2
>
>
>
> Thank you very much!
>
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
>
>
> ____
> From: Andy Seaborne 
> Sent: Thursday, July 18, 2019 6:41:56 PM
> To: users@jena.apache.org
> Subject: Re: About fuseki2 load performance by java API
>
> That's quite slow. I get maybe 50-70K triples for a 100m load via the
> Fuseki UI.
>
> The fastest way is to use the bulk loader directly to setup the
> database, then add it to Fuseki.
>
> The hardware of the server makes a big difference. What's the server
> setup? Disk/SSD? Local or remote storage?
>
>   Andy
>
> You don't need the begin/commit in the client - the transaction is in
> the backend server.
>
> On 18/07/2019 09:02, Scarlet Remilia wrote:
>> Hello everyone,
>> I want to load a hundred millions triple into TDB2-backend fuseki2 by Java 
>> API.
>> I used code below:
>>
>> Model model = ModelFactory.createDefaultModel();
>> model.add(model.asStatement(triple));
>> RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
>>   .destination(FusekiURL);
>>   RDFConnection conn = builder.build();
>>   conn.begin(ReadWrite.WRITE);
>>   try {
>>   conn.load(model);
>>   conn.commit();
>>   } finally {
>>   conn.end();
>>   }
>>
>> The code is actually worked but performance is not ideal enough.
>>
>> [2019-07-18 23:29:25] Fuseki INFO  [46] POST 
>> http://192.168.204.244:/fuseki2?default
>> [2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
>> Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : 
>> Count=3257309 Triples=3257309 Quads=0
>> [2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)
>>
>> Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
>> millions triples in queue…(One in-mem Model is impossible to contain so much 
>> triples…)
>>
>> Is there any better method to load them quicker?
>>
>> Thanks!
>>
>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>>
>>
>


Re: About fuseki2 load performance by java API

2019-07-19 Thread Laura Morales
> tdb2.tdbloader --loader=parallel
>
> but it still becomes random IO (moves disk heads)
>
> I haven't tried it extensively on an HDD - I'd be interested in hearing
> what happens.


oh nice! I completely missed it. I've tried it with a 67GB .nt file from 
LinkedGeoData on the same 750GB HDD but the end result does not seem very 
different. It's difficult to compare exactly with when I tried to load 
wikidata, because I don't see any progress being reported here. I mean I don't 
see any "X triples loaded (Y per second)" kind of message. Anyway it starts at 
full speed, boiling CPU, HDD cooking up my wrist from beneath the plastic case, 
and fans almost generating enough lift to take off. Then it gradually slows 
down. I stopped it after 1 hour. At this point I was seeing less than 10% CPU 
usage, 90% iowait, TDB2 files size ~15GB.


> The proper solution is either to do caching+write ordering


What does this mean in practice? Can I change my input data (eg. sorting 
triples) so that tdb2.tdbloader can overcome the bottleneck with HDDs?



Re: About fuseki2 load performance by java API

2019-07-18 Thread Andy Seaborne




On 18/07/2019 13:08, Scarlet Remilia wrote:

Thank you for reply!



The server storage is HDD on local with RAID 10.

CPU is 4x 14 cores with 28 threads but only one core is used during the load.

The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is also 
tuned by tuning cache size.

I observed disk IO by iostat, but it seems not utilized much disk IO and also 
it is observed that memory usage of fuseki2 is increasing after loading every 3 
millions triples.


If you mean by IO bandwidth, then yes, it will not be high because it 
becomes random I/O and the effects as Laura describes happen.


Memory will increase because Java does not do a GC unless it needs to.

The tdb2.tdbloaders will do better then the Fuseki UI even with a disk, 
but for larger datasets, SSD is preferred.




Fuseki2 is setup as a standalone server by the command below:



./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2



Thank you very much!



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10




From: Andy Seaborne 
Sent: Thursday, July 18, 2019 6:41:56 PM
To: users@jena.apache.org
Subject: Re: About fuseki2 load performance by java API

That's quite slow. I get maybe 50-70K triples for a 100m load via the
Fuseki UI.

The fastest way is to use the bulk loader directly to setup the
database, then add it to Fuseki.

The hardware of the server makes a big difference. What's the server
setup? Disk/SSD? Local or remote storage?

  Andy

You don't need the begin/commit in the client - the transaction is in
the backend server.

On 18/07/2019 09:02, Scarlet Remilia wrote:

Hello everyone,
I want to load a hundred millions triple into TDB2-backend fuseki2 by Java API.
I used code below:

Model model = ModelFactory.createDefaultModel();
model.add(model.asStatement(triple));
RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
  .destination(FusekiURL);
  RDFConnection conn = builder.build();
  conn.begin(ReadWrite.WRITE);
  try {
  conn.load(model);
  conn.commit();
  } finally {
  conn.end();
  }

The code is actually worked but performance is not ideal enough.

[2019-07-18 23:29:25] Fuseki INFO  [46] POST 
http://192.168.204.244:/fuseki2?default
[2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : Count=3257309 
Triples=3257309 Quads=0
[2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)

Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
millions triples in queue…(One in-mem Model is impossible to contain so much 
triples…)

Is there any better method to load them quicker?

Thanks!

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10






Re: About fuseki2 load performance by java API

2019-07-18 Thread Andy Seaborne




On 18/07/2019 13:49, Laura Morales wrote:

I had a similar problem when trying to load wikidata on my laptop with 8GB RAM, 
i7 CPU, 750GB HDD. It started fine but then slowed to a crawl after about 100 
million triples. I don't think CPU or RAM are the problem, it's probably to do 
with disk queues or caches or something like that. IIRC when Andy tried to load 
the same dataset on his PC with a 1TB SSD and 16GB RAM, he didn't have those 
problems. Bottom line: try with an SSD/NVMe instead of an HDD.

Besides, it would be nice to have a better way (parallelized) for loading huge 
datasets (trillions of triples).


Already done :-)

tdb2.tdbloader --loader=parallel

but it still becomes random IO (moves disk heads)

I haven't tried it extensively on an HDD - I'd be interested in hearing 
what happens.


The proper solution is either to do caching+write ordering or use a 
different storage system.  A small matter of finding the time to experiment.


Andy






Sent: Thursday, July 18, 2019 at 2:08 PM
From: "Scarlet Remilia" 
To: "users@jena.apache.org" 
Subject: RE: About fuseki2 load performance by java API

Thank you for reply!



The server storage is HDD on local with RAID 10.

CPU is 4x 14 cores with 28 threads but only one core is used during the load.

The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is also 
tuned by tuning cache size.

I observed disk IO by iostat, but it seems not utilized much disk IO and also 
it is observed that memory usage of fuseki2 is increasing after loading every 3 
millions triples.

Fuseki2 is setup as a standalone server by the command below:



./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2



Thank you very much!



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10




From: Andy Seaborne 
Sent: Thursday, July 18, 2019 6:41:56 PM
To: users@jena.apache.org
Subject: Re: About fuseki2 load performance by java API

That's quite slow. I get maybe 50-70K triples for a 100m load via the
Fuseki UI.

The fastest way is to use the bulk loader directly to setup the
database, then add it to Fuseki.

The hardware of the server makes a big difference. What's the server
setup? Disk/SSD? Local or remote storage?

  Andy

You don't need the begin/commit in the client - the transaction is in
the backend server.

On 18/07/2019 09:02, Scarlet Remilia wrote:

Hello everyone,
I want to load a hundred millions triple into TDB2-backend fuseki2 by Java API.
I used code below:

Model model = ModelFactory.createDefaultModel();
model.add(model.asStatement(triple));
RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
  .destination(FusekiURL);
  RDFConnection conn = builder.build();
  conn.begin(ReadWrite.WRITE);
  try {
  conn.load(model);
  conn.commit();
  } finally {
  conn.end();
  }

The code is actually worked but performance is not ideal enough.

[2019-07-18 23:29:25] Fuseki INFO  [46] POST 
http://192.168.204.244:/fuseki2?default
[2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : Count=3257309 
Triples=3257309 Quads=0
[2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)

Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
millions triples in queue…(One in-mem Model is impossible to contain so much 
triples…)

Is there any better method to load them quicker?

Thanks!

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10






Re: RE: About fuseki2 load performance by java API

2019-07-18 Thread Laura Morales
I had a similar problem when trying to load wikidata on my laptop with 8GB RAM, 
i7 CPU, 750GB HDD. It started fine but then slowed to a crawl after about 100 
million triples. I don't think CPU or RAM are the problem, it's probably to do 
with disk queues or caches or something like that. IIRC when Andy tried to load 
the same dataset on his PC with a 1TB SSD and 16GB RAM, he didn't have those 
problems. Bottom line: try with an SSD/NVMe instead of an HDD.

Besides, it would be nice to have a better way (parallelized) for loading huge 
datasets (trillions of triples).



> Sent: Thursday, July 18, 2019 at 2:08 PM
> From: "Scarlet Remilia" 
> To: "users@jena.apache.org" 
> Subject: RE: About fuseki2 load performance by java API
>
> Thank you for reply!
> 
> 
> 
> The server storage is HDD on local with RAID 10.
> 
> CPU is 4x 14 cores with 28 threads but only one core is used during the load.
> 
> The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is 
> also tuned by tuning cache size.
> 
> I observed disk IO by iostat, but it seems not utilized much disk IO and also 
> it is observed that memory usage of fuseki2 is increasing after loading every 
> 3 millions triples.
> 
> Fuseki2 is setup as a standalone server by the command below:
> 
> 
> 
> ./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2
> 
> 
> 
> Thank you very much!
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> ____________
> From: Andy Seaborne 
> Sent: Thursday, July 18, 2019 6:41:56 PM
> To: users@jena.apache.org
> Subject: Re: About fuseki2 load performance by java API
> 
> That's quite slow. I get maybe 50-70K triples for a 100m load via the
> Fuseki UI.
> 
> The fastest way is to use the bulk loader directly to setup the
> database, then add it to Fuseki.
> 
> The hardware of the server makes a big difference. What's the server
> setup? Disk/SSD? Local or remote storage?
> 
>  Andy
> 
> You don't need the begin/commit in the client - the transaction is in
> the backend server.
> 
> On 18/07/2019 09:02, Scarlet Remilia wrote:
> > Hello everyone,
> > I want to load a hundred millions triple into TDB2-backend fuseki2 by Java 
> > API.
> > I used code below:
> >
> > Model model = ModelFactory.createDefaultModel();
> > model.add(model.asStatement(triple));
> > RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
> >  .destination(FusekiURL);
> >  RDFConnection conn = builder.build();
> >  conn.begin(ReadWrite.WRITE);
> >  try {
> >  conn.load(model);
> >  conn.commit();
> >  } finally {
> >  conn.end();
> >  }
> >
> > The code is actually worked but performance is not ideal enough.
> >
> > [2019-07-18 23:29:25] Fuseki INFO  [46] POST 
> > http://192.168.204.244:/fuseki2?default
> > [2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
> > Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : 
> > Count=3257309 Triples=3257309 Quads=0
> > [2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)
> >
> > Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
> > millions triples in queue…(One in-mem Model is impossible to contain so 
> > much triples…)
> >
> > Is there any better method to load them quicker?
> >
> > Thanks!
> >
> > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 
> > 10
> >
> >
>


Re: About fuseki2 load performance by java API

2019-07-18 Thread ajs6f
I want to emphasize what Andy said first: 

> The fastest way is to use the bulk loader directly to setup the database, 
> then add it to Fuseki.


This will be very much faster, as well as eliminating any questions of you 
needing to write efficient code. If you can find a workflow that does this, I 
suspect it might be the best immediate choice.

ajs6f

> On Jul 18, 2019, at 8:08 AM, Scarlet Remilia 
>  wrote:
> 
> Thank you for reply!
> 
> 
> 
> The server storage is HDD on local with RAID 10.
> 
> CPU is 4x 14 cores with 28 threads but only one core is used during the load.
> 
> The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is 
> also tuned by tuning cache size.
> 
> I observed disk IO by iostat, but it seems not utilized much disk IO and also 
> it is observed that memory usage of fuseki2 is increasing after loading every 
> 3 millions triples.
> 
> Fuseki2 is setup as a standalone server by the command below:
> 
> 
> 
> ./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2
> 
> 
> 
> Thank you very much!
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> ________
> From: Andy Seaborne 
> Sent: Thursday, July 18, 2019 6:41:56 PM
> To: users@jena.apache.org
> Subject: Re: About fuseki2 load performance by java API
> 
> That's quite slow. I get maybe 50-70K triples for a 100m load via the
> Fuseki UI.
> 
> The fastest way is to use the bulk loader directly to setup the
> database, then add it to Fuseki.
> 
> The hardware of the server makes a big difference. What's the server
> setup? Disk/SSD? Local or remote storage?
> 
> Andy
> 
> You don't need the begin/commit in the client - the transaction is in
> the backend server.
> 
> On 18/07/2019 09:02, Scarlet Remilia wrote:
>> Hello everyone,
>> I want to load a hundred millions triple into TDB2-backend fuseki2 by Java 
>> API.
>> I used code below:
>> 
>> Model model = ModelFactory.createDefaultModel();
>> model.add(model.asStatement(triple));
>> RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
>> .destination(FusekiURL);
>> RDFConnection conn = builder.build();
>> conn.begin(ReadWrite.WRITE);
>> try {
>> conn.load(model);
>> conn.commit();
>> } finally {
>> conn.end();
>> }
>> 
>> The code is actually worked but performance is not ideal enough.
>> 
>> [2019-07-18 23:29:25] Fuseki INFO  [46] POST 
>> http://192.168.204.244:/fuseki2?default
>> [2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
>> Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : 
>> Count=3257309 Triples=3257309 Quads=0
>> [2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)
>> 
>> Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
>> millions triples in queue…(One in-mem Model is impossible to contain so much 
>> triples…)
>> 
>> Is there any better method to load them quicker?
>> 
>> Thanks!
>> 
>> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>> 
>> 



RE: About fuseki2 load performance by java API

2019-07-18 Thread Scarlet Remilia
Thank you for reply!



The server storage is HDD on local with RAID 10.

CPU is 4x 14 cores with 28 threads but only one core is used during the load.

The JVM of fuseki2 is tuned by adding -Xmx=50GB -Xms=50GB and TDB2 used is also 
tuned by tuning cache size.

I observed disk IO by iostat, but it seems not utilized much disk IO and also 
it is observed that memory usage of fuseki2 is increasing after loading every 3 
millions triples.

Fuseki2 is setup as a standalone server by the command below:



./fuseki-server –tdb2 –loc=./tdb2dataset –port   -update /fuseki2



Thank you very much!



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10




From: Andy Seaborne 
Sent: Thursday, July 18, 2019 6:41:56 PM
To: users@jena.apache.org
Subject: Re: About fuseki2 load performance by java API

That's quite slow. I get maybe 50-70K triples for a 100m load via the
Fuseki UI.

The fastest way is to use the bulk loader directly to setup the
database, then add it to Fuseki.

The hardware of the server makes a big difference. What's the server
setup? Disk/SSD? Local or remote storage?

 Andy

You don't need the begin/commit in the client - the transaction is in
the backend server.

On 18/07/2019 09:02, Scarlet Remilia wrote:
> Hello everyone,
> I want to load a hundred millions triple into TDB2-backend fuseki2 by Java 
> API.
> I used code below:
>
> Model model = ModelFactory.createDefaultModel();
> model.add(model.asStatement(triple));
> RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
>  .destination(FusekiURL);
>  RDFConnection conn = builder.build();
>  conn.begin(ReadWrite.WRITE);
>  try {
>  conn.load(model);
>  conn.commit();
>  } finally {
>  conn.end();
>  }
>
> The code is actually worked but performance is not ideal enough.
>
> [2019-07-18 23:29:25] Fuseki INFO  [46] POST 
> http://192.168.204.244:/fuseki2?default
> [2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
> Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : 
> Count=3257309 Triples=3257309 Quads=0
> [2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)
>
> Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
> millions triples in queue…(One in-mem Model is impossible to contain so much 
> triples…)
>
> Is there any better method to load them quicker?
>
> Thanks!
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
>


Re: About fuseki2 load performance by java API

2019-07-18 Thread Andy Seaborne
That's quite slow. I get maybe 50-70K triples for a 100m load via the 
Fuseki UI.


The fastest way is to use the bulk loader directly to setup the 
database, then add it to Fuseki.


The hardware of the server makes a big difference. What's the server 
setup? Disk/SSD? Local or remote storage?


Andy

You don't need the begin/commit in the client - the transaction is in 
the backend server.


On 18/07/2019 09:02, Scarlet Remilia wrote:

Hello everyone,
I want to load a hundred millions triple into TDB2-backend fuseki2 by Java API.
I used code below:

Model model = ModelFactory.createDefaultModel();
model.add(model.asStatement(triple));
RDFConnectionRemoteBuilder builder = RDFConnectionFuseki.create()
 .destination(FusekiURL);
 RDFConnection conn = builder.build();
 conn.begin(ReadWrite.WRITE);
 try {
 conn.load(model);
 conn.commit();
 } finally {
 conn.end();
 }

The code is actually worked but performance is not ideal enough.

[2019-07-18 23:29:25] Fuseki INFO  [46] POST 
http://192.168.204.244:/fuseki2?default
[2019-07-18 23:30:45] Fuseki INFO  [15] Body: Content-Length=-1, 
Content-Type=application/rdf+thrift, Charset=null => RDF-THRIFT : Count=3257309 
Triples=3257309 Quads=0
[2019-07-18 23:31:12] Fuseki INFO  [15] 200 OK (3,302.546 s)

Every 3 millions triples cost 3,302.546 seconds and there are totally 300 
millions triples in queue…(One in-mem Model is impossible to contain so much 
triples…)

Is there any better method to load them quicker?

Thanks!

Sent from Mail for Windows 10