Re: TDB2 parallel load on cloud SSD and other observations/questions

Andy Seaborne Wed, 24 Jun 2020 06:32:33 -0700

Hi Isroel,

It is worth trying each of the loaders to see how they perform on yourmachine. Theer si a lot of parallel work when unlimited parallelloading of quads happens so may it is a mismatch to the machine setup.Databases created by any loader are compatible.


On 22/06/2020 16:57, Isroel Kogan wrote:

thank you Rob - I have confused the terminology. Indeed each run processes 0.5m 
quads.
what is UOM of the batch loading? M/s?

Looking at the output of iotop - the 3 main threads - which comprise the 
lionshare of the activity - have pretty steady reads - of about 900-950 M/s w 
little variation beyond that. write varies a little more - but 2-6 M/s. What is 
puzzling to me - is that in a configuration with twice the CPU and twice the 
RAM - the load average remained the same in absolute terms.
RAM usage was steady at about 8GB in the 32GB configuration - now its been 
steady at 4.8 (16GB configuration) -
in both cases it is severely underutilized.

That might be due to how much mmap memory the process can have. Thereis a per-process restriction.


----

I'm not clear if you are loading an empty database or an existing one.It makes a difference

The indexes are sorted (they are B+Trees) so as the database grows, thedepth of the tree (size of sorted data) grows and insert rates drop.


----

Another factor is the nature of the data : if there are many "largeliterals" such as long strings, there is more byte-shifting needed whichslows the first phase, not the indexing steps.


    Andy


I'm not conversant in Java - I've been trying jstat and jmap - the latter hasnt 
been working. how best to verify the heap size? what else can affect things?

On 2020/06/22 09:01:36, Rob Vesse <rve...@dotnetrdf.org> wrote:

Isabel

I think there might be a fundamental misunderstanding happening about batch sizes here.  
The batch sizes are fixed for a run and never changes, the "batch size" you 
refer to is a speed calculation e.g

     19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 
3,562 / Avg: 38,030)
     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
4,322 / Avg: 37,443)
     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
3,649 / Avg: 36,759)
     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
4,868 / Avg: 36,283)
     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
3,532 / Avg: 35,622)
     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
19:11:41 UTC]
     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
3,643 / Avg: 35,009)

Note that each batch is 500,000 quads as the number after "Add:" increases by 500,000 each time.  
The "Batch" in brackets is referring to the calculated loading speed for the current batch, where 
"Avg" is the calculated loading speed over the entire load.

So yes the speed is decreasing over time, this is a commonly reported issue on 
these lists but there is no simple fix for this because it depends both on 
hardware and data.  The output you have provided suggests that you are stuck on 
IO, Andy is the primary developer on TDB2 so he may be able to shed more light 
on what might be going on.

Rob

On 22/06/2020, 06:32, "Isroel Kogan" <isroelko...@gmail.com> wrote:

     Hi Andy - thanks for your comments.

Instead of responding point by point, its best if I present a clearer picture -as I also have a better understanding of the factors so far.GCP instance stats:

     $ lscpu

Architecture: x86_64

     CPU op-mode(s):      32-bit, 64-bit
     Byte Order:          Little Endian
     Address sizes:       46 bits physical, 48 bits virtual
     CPU(s):              10
     On-line CPU(s) list: 0-9
     Thread(s) per core:  2
     Core(s) per socket:  5
     Socket(s):           1
     NUMA node(s):        1
     Vendor ID:           GenuineIntel
     CPU family:          6
     Model:               63
     Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
     Stepping:            0
     CPU MHz:             2300.000
     BogoMIPS:            4600.00
     Hypervisor vendor:   KVM
     Virtualization type: full
     L1d cache:           32K
     L1i cache:           32K
     L2 cache:            256K
     L3 cache:            46080K
     NUMA node0 CPU(s):   0-9

16GB RAM

     (I configured less RAM because on a prior iteration - out of 32GB - only 8 
was being used)

3TB local SSDaccording to google cloud - max performance for this size is as follows:

     Storage space Partitions     IOPS                          Throughput 
(MB/s)
                                                 Read      Write            
Read         Write

3 TB 8 680,000 360,000 2,650 1,400https://cloud.google.com/compute/docs/disksI'm not getting that - but performance is an order of magnitude or more better than on my machine - as youll see below -My data is as follows -15 files of varying size (based on the original bz2 files provided by the authors). files range from 2GB (a few million triples) to 129GB. 4 files have close to a half a billion triples each.

     The files are .nq - as you noted - quads. (unzipped as already mentioned).

Currently I have a process running initiated by tdb2.tdbloader command in parallel mode. DB empty initially.I will include some information here that may help you get a sense of the performance issues I alluded tobeginning of log output:17:14:42 INFO loader :: Loader = LoaderParallel

     17:14:42 INFO  loader     :: Start: 15 files
     17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq 
(Batch: 77,339 / Avg: 77,339)
     17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq 
(Batch: 89,285 / Avg: 82,884)
     17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq 
(Batch: 90,090 / Avg: 85,154)
     17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq 
(Batch: 87,382 / Avg: 85,700)
     17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq 
(Batch: 82,304 / Avg: 84,999)
     17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq 
(Batch: 87,032 / Avg: 85,331)
     17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq 
(triples/quads = 3,402,753)
     17:15:23 INFO  loader     :: Add: 3,500,000 
datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
     17:15:29 INFO  loader     :: Add: 4,000,000 
datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
     17:15:35 INFO  loader     :: Add: 4,500,000 
datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
     17:15:40 INFO  loader     :: Add: 5,000,000 
datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)

the batch size increases for the next file - but the third file is one of those w nearly a half a billion triples.the batch size keeps steadily decreasinghere is the output at the 250m triple mark19:03:24 INFO loader :: Add: 248,000,000 github_1_fixed.nq (Batch: 3,562 / Avg: 38,030)

     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
4,322 / Avg: 37,443)
     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
3,649 / Avg: 36,759)
     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
4,868 / Avg: 36,283)
     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
3,532 / Avg: 35,622)
     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
19:11:41 UTC]
     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
3,643 / Avg: 35,009)
     19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 
3,976 / Avg: 34,473)
     19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 
3,171 / Avg: 33,810)
     19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 
3,512 / Avg: 33,241)

The batch size has decreased (I don't understand exactly what the batch size refers to) and conversely the time for a half million triples to be processed has gone from 5s to 2.5m.This is the tail of the log currently:03:01:16 INFO loader :: Add: 305,500,000 github_1_fixed.nq (Batch: 1,468 / Avg: 8,680)

     03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 
1,512 / Avg: 8,613)
     03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 
1,897 / Avg: 8,564)
     03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 
1,723 / Avg: 8,509)
     03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 
1,489 / Avg: 8,444)
     03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 
1,842 / Avg: 8,395)
     03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 
1,491 / Avg: 8,333)
     03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 
1,882 / Avg: 8,287)
     03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 
2,391 / Avg: 8,254)
     03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 
1,887 / Avg: 8,209)
     03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 
03:44:02 UTC]
     03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 
2,711 / Avg: 8,183)
     03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 
1,717 / Avg: 8,133)

batch size down below 2k and processing time per 0.5m triples up to around 5 min.What has puzzled me is that in the beginning, CPU utilization was high. At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).

     Whereas now Load average sits near 3.5 not crossing 4 for several hours. 
This is based on htop - according to the gcp console dashboard - CPU 
utilization did indeed peak near 80% - but has been steady around 13-14% for 
the last hours - I don't know how to match these two metrics - but its heavily 
underutilized.

Here's a flavor of the output (as I write now -when the batch size has decreased) of$ iotop -obCurrent DISK READ: 2.76 G/s | Current DISK WRITE: 0.00 B/s

       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
     12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-

The relevant output of running (for 124s) - this is the last accumulation:$ iotop -abTotal DISK READ: 2.76 G/s | Total DISK WRITE: 16.87 M/s

     Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
     12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena
     12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
     12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-

and now the output of for pid 12155 - the subprocess running all the threads

     $ pidstat -p 12155 -ut

Linux 4.19.0-9-cloud-amd64 (graph4codev1) 06/22/20 _x86_64_ (10 CPU)04:38:32 UID TGID TID %usr %system %guest %wait %CPU CPU Command

     04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   
91.54     4  java
     04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    
0.00     4  |__java
     04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    
4.99     7  |__java
     04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    
0.52     7  |__GC Thread#0
     04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Main Marker
     04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    
0.27     2  |__G1 Conc#0
     04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    
0.34     8  |__G1 Refine#0
     04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    
0.28     6  |__G1 Young RemSet
     04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    
0.03     1  |__VM Thread
     04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    
0.00     4  |__Reference Handl
     04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    
0.00     7  |__Finalizer
     04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    
0.00     6  |__Signal Dispatch
     04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    
0.04     1  |__C2 CompilerThre
     04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    
0.00     1  |__C1 CompilerThre
     04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    
0.00     1  |__Sweeper thread
     04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    
0.00     1  |__Service Thread
     04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    
0.05     1  |__VM Periodic Tas
     04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    
0.00     6  |__Common-Cleaner
     04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    
0.51     8  |__GC Thread#1
     04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    
0.51     3  |__GC Thread#2
     04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    
0.50     1  |__GC Thread#3
     04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    
0.52     5  |__GC Thread#4
     04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    
0.51     2  |__GC Thread#5
     04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    
0.00     5  |__Thread-0
     04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    
0.00     6  |__Thread-1
     04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    
0.00     7  |__Thread-2
     04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    
4.42     2  |__Thread-3
     04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    
5.54     3  |__Thread-4
     04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    
5.29     4  |__Thread-5
     04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   
19.34     3  |__Thread-6
     04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   
20.83     2  |__Thread-7
     04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   
20.07     9  |__Thread-8
     04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    
5.08     0  |__Thread-9
     04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    
0.52     6  |__GC Thread#6
     04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    
0.51     0  |__GC Thread#7
     04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    
0.51     4  |__GC Thread#8
     04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    
0.02     5  |__G1 Refine#1
     04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    
0.01     9  |__G1 Refine#2
     04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    
0.01     0  |__G1 Refine#3
     04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    
0.01     7  |__G1 Refine#4
     04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Refine#5
     04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Refine#6
     04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    
0.00     6  |__G1 Refine#7
     04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    
0.00     6  |__G1 Refine#8
     04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    
0.27     9  |__G1 Conc#1

you can see threads, 12184, 12185, 12186 have the highest CPU utilization, albeit still at 20% only. they are the same threads in the output of iotop - doing a lot of reading mostly. As you can see in iotop output, the read is at over 900 M/s - which is excellent. All the other threads are idle the overwhelming majority of the timeFinally htop gives a similar picture, the 3 threads have the highest CPU utilization (around 20%) now, and so 7/10 cores are practically idle with short spikes when the other threads pick up. which isnt much as the above shows. Also those threads are either in R or D status (rarely in S) - while the rest spend most of the time in S.I didnt understand what you wrote about the heap size, RAM and read threads - I don't use java - so I'm flying by the seat of my pants.

     Here is the output of

$ jstat -gc 12155 S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT CGC CGCT GCT

      0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  
20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 
1042.531

which may be relevant.My very crudely formulated question is - how come it seems that with smaller batch sizes, the processing time increased so much - while at the same time there is actually much less processing going on (much lower CPU utilization) - it seems like everything is waiting for whatever those 3 threads are doing.

     And is three anyway to address it?

thanks for your help!best,IsroelOn 2020/06/21 21:38:50, Andy Seaborne <a...@apache.org> wrote:

     >  > And when it says "nq" is really quads or all data for the default
     >  > graph? (there is more indexing work for named graphs).
     >
     >  >> : Quads = 127,206,280
     >
     >
     > OK - it's quads. There are 6 quad indexes and in full parallel mode it
     > will use 2 more threads to parse and to build the node table.
     >
     > Full parallel loading is going to use up all the cores and HT threads
     > aren't full threads for this purpose.
     >
     > The phased loader (default) uses less threads.
     >
     > Phase 1:
     > one thread to decompress and parse
     > one thread to build the node table.
     > one thread to for the GSPO
     > (and one for SPO but you seem to have no triples)
     > =3
     >
     > Phase 2:
     > two threads
     > =2
     >
     > Phase 3:
     > three threads
     > =3
     >
     >      Andy
     >
     >
     > On 21/06/2020 22:11, Andy Seaborne wrote:
     > > Hi there,
     > >
     > > Thanks for reporting the findings.
     > >
     > > On 20/06/2020 16:10, Isroel Kogan wrote:
     > >> Hi,
     > >>
     > >> I am also a newcomer to the RDF world - and particularly Jena, which I
     > >> started using this week.
     > >>
     > >> A couple of observations I have made over the last few days exploring
     > >> different options.
     > >>
     > >> Local Machine (specs):
     > >>
     > >> Ubuntu 18.04
     > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
     > >
     > > which is 4 core and hyper threading. for this workload that is more 
like
     > > 4 threads. HT is not a completely x2 for this sort of continuous
     > > processing threading.
     > >
     > > And pre-emtptive timeslicing is not nice!
     > >
     > >> 16GB RAM
     > >> 512 SSD (NVMe).
     > >>
     > >>
     > >> the following compares loading a file in compressed vs decompressed
     > >> format -both w parallel loader.
     > >>
     > >> file:
     > >> docstrings_triples.nq
     > >> size: 28GB
     > >>
     > >> cmd:
     > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb
     > >> docstrings_triples.nq > tdb2.log1 2>&1
     > >>
     > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
     > >>
     > >> real    22m46.346s
     > >> user    120m46.591s
     > >> sys    3m22.698s
     > >>
     > >>
     > >> file:
     > >> docstrings_triples.nq.bz2
     > >> size: 542M
     > >>
     > >> cmd:
     > >>
     > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb
     > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
     > >>
     > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
     > >>
     > >>
     > >> real    37m8.182s
     > >> user    109m42.970s
     > >> sys    6m27.426s
     > >>
     > >> resulting DB size
     > >> 30GB
     > >>
     > >> confirmed equal via diff.
     > >>
     > >> pbzip2 ran in 84s
     > >>
     > >> Less rigorously I noticed a similar gain in speed for other files.
     > >
     > > For gz files, the speed of loading of compressed vs uncompressed is
     > > usually not very much.  It does look like bz2
     > >
     > > Using a separate process and faster decompressor may help:
     > >
     > > bzip2 -d < docstrings_triples.nq.bz2 | \
     > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
     > >      -- - > tdb2.log2 2>&1
     > >
     > > When Jena decompresses a bz2 file, it uses a Apache Common Compress so
     > > it is a java decompressor which will take time to get optimized by the
     > > JIT and is likely slower than a specialized tool like bzip2.
     > >
     > > But with 4 core, it may have the opposite effect - using more processes
     > > causes preemption timeslicing.
     > >
     > > It maybe one of the other loaders is faster because it is a better 
match
     > > to the hardware.
     > >
     > >> Is this expected behaviour? What factors influence this?
     > >>
     > >> SSD - local vs cloud.
     > >>
     > >> on my local machine, when running parallel loader, cores were working
     > >> at over 70% capacity and there was little IO induced down time.
     > >
     > > How many core were active?
     > > And when it says "nq" is really quads or all data for the default 
graph?
     > > (there is more indexing work for named graphs).
     > >
     > > Some of that will be the bz2 decompression but it looks to me "like 
it's
     > >   "more threads than cores" causing timeslicing.
     > >
     > >>
     > >> GCP instance specs:
     > >>
     > >> 20 CPU
     > >> 32GB RAM
     > >
     > > And same heap size?
     > >
     > > While the parallel loader is using multiple threads it is a fixed 
number
     > > so more CPU will help only if
     > >
     > > More RAM is going to help because the OS will use it for file system
     > > cache, delaying writes.
     > >
     > > But with more read threads, it could be there is less preemptive
     > > scheduling and that could be a big gain.
     > >
     > >> 6TB "local SSD" storage
     > >> the local SSD storage offers the best performance to reduce IO latency
     > >> - it has physical proximity to instance - as per GCP.
     > >>
     > >> a few cores were working at near capacity, while the vast majority
     > >> idle (near 0%) w occasional spikes. average load translates to 20%
     > >> utilization. As I've seen others write here, this is a difference
     > >> others have noted.
     > >> How can this be addressed? buffer size? (I don't have a deep enough
     > >> understanding).
     > >
     > > My guess is that on the GCP instance it is one thread-one core.
     > >
     > >>
     > >>
     > >> Another recurring pattern is the reduction in batch size.
     > >> I've been running a load job on my gcp instance for almost a day 
(23+h).
     > >>
     > >> file size: 93GB
     > >> triples: 472m
     > >>
     > >> batch size decreased from 160k range to under 1k, while processing
     > >> time per batch increased from a few seconds to over 10 min. All this
     > >> time average CPU usage has remained steady, as has RAM usage.
     > >
     > > Not sure I quite understand - this is adding more data to an existing
     > > database? And 10mins for 1k? While it will be slower, that does sound
     > > rather extreme.
     > >
     > >>
     > >> I don't understand how all of this works with indexing. Is this
     > >> expected behaviour? besides a locally proximate SSD, I've thrown an
     > >> overkill of hardware at it.
     > >>
     > >> thanks
     > >>
     > >
     > >      Andy
     > >
     >

Re: TDB2 parallel load on cloud SSD and other observations/questions

Reply via email to