Re: TDB2 parallel load on cloud SSD and other observations/questions

Isroel Kogan Mon, 22 Jun 2020 08:58:05 -0700

thank you Rob - I have confused the terminology. Indeed each run processes 0.5m 
quads. 
what is UOM of the batch loading? M/s?


Looking at the output of iotop - the 3 main threads - which comprise the 
lionshare of the activity - have pretty steady reads - of about 900-950 M/s w 
little variation beyond that. write varies a little more - but 2-6 M/s. What is 
puzzling to me - is that in a configuration with twice the CPU and twice the 
RAM - the load average remained the same in absolute terms. 
RAM usage was steady at about 8GB in the 32GB configuration - now its been 
steady at 4.8 (16GB configuration) -
in both cases it is severely underutilized. 

I'm not conversant in Java - I've been trying jstat and jmap - the latter hasnt 
been working. how best to verify the heap size? what else can affect things?

On 2020/06/22 09:01:36, Rob Vesse <rve...@dotnetrdf.org> wrote: 
> Isabel
> 
> I think there might be a fundamental misunderstanding happening about batch 
> sizes here.  The batch sizes are fixed for a run and never changes, the 
> "batch size" you refer to is a speed calculation e.g
> 
>     19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 
> 3,562 / Avg: 38,030)
>     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
> 4,322 / Avg: 37,443)
>     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
> 3,649 / Avg: 36,759)
>     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
> 4,868 / Avg: 36,283)
>     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
> 3,532 / Avg: 35,622)
>     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
> 19:11:41 UTC]
>     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
> 3,643 / Avg: 35,009)
> 
> Note that each batch is 500,000 quads as the number after "Add:" increases by 
> 500,000 each time.  The "Batch" in brackets is referring to the calculated 
> loading speed for the current batch, where "Avg" is the calculated loading 
> speed over the entire load.
> 
> So yes the speed is decreasing over time, this is a commonly reported issue 
> on these lists but there is no simple fix for this because it depends both on 
> hardware and data.  The output you have provided suggests that you are stuck 
> on IO, Andy is the primary developer on TDB2 so he may be able to shed more 
> light on what might be going on.
> 
> Rob
> 
> On 22/06/2020, 06:32, "Isroel Kogan" <isroelko...@gmail.com> wrote:
> 
>     Hi Andy - thanks for your comments.
>     
>     Instead of responding point by point, its best if I present a clearer 
> picture -as  I also have a better understanding of the factors so far.
>     
>     
>     GCP instance stats:
>     $ lscpu
>     
>     Architecture:        x86_64
>     CPU op-mode(s):      32-bit, 64-bit
>     Byte Order:          Little Endian
>     Address sizes:       46 bits physical, 48 bits virtual
>     CPU(s):              10
>     On-line CPU(s) list: 0-9
>     Thread(s) per core:  2
>     Core(s) per socket:  5
>     Socket(s):           1
>     NUMA node(s):        1
>     Vendor ID:           GenuineIntel
>     CPU family:          6
>     Model:               63
>     Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
>     Stepping:            0
>     CPU MHz:             2300.000
>     BogoMIPS:            4600.00
>     Hypervisor vendor:   KVM
>     Virtualization type: full
>     L1d cache:           32K
>     L1i cache:           32K
>     L2 cache:            256K
>     L3 cache:            46080K
>     NUMA node0 CPU(s):   0-9
>     
>     16GB RAM
>     (I configured less RAM because on a prior iteration - out of 32GB - only 
> 8 was being used)
>     
>     3TB local SSD
>     
>     according to google cloud - max performance for this size is as follows:
>     Storage space Partitions     IOPS                          Throughput 
> (MB/s)
>                                                 Read      Write            
> Read         Write
>     
>     3 TB                 8                 680,000  360,000       2,650       
>   1,400
>     
>     
>     https://cloud.google.com/compute/docs/disks
>     
>     I'm not getting that - but performance is an order of magnitude or more 
> better than on my machine - as youll see below -
>     
>     
>     
>     My data is as follows -15 files of varying size (based on the original 
> bz2 files provided by the authors). files range from 2GB (a few million 
> triples) to 129GB. 4 files have close to a half a billion triples each.
>     The files are .nq - as you noted - quads. (unzipped as already mentioned).
>     
>     Currently I have a process running initiated by tdb2.tdbloader command in 
> parallel mode. DB empty initially.
>     
>     I will include some information here that may help you get a sense of the 
> performance issues I alluded to
>     
>     
>     beginning of log output:
>     
>     
>     17:14:42 INFO  loader     :: Loader = LoaderParallel
>     17:14:42 INFO  loader     :: Start: 15 files
>     17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq 
> (Batch: 77,339 / Avg: 77,339)
>     17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq 
> (Batch: 89,285 / Avg: 82,884)
>     17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq 
> (Batch: 90,090 / Avg: 85,154)
>     17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq 
> (Batch: 87,382 / Avg: 85,700)
>     17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq 
> (Batch: 82,304 / Avg: 84,999)
>     17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq 
> (Batch: 87,032 / Avg: 85,331)
>     17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq 
> (triples/quads = 3,402,753)
>     17:15:23 INFO  loader     :: Add: 3,500,000 
> datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
>     17:15:29 INFO  loader     :: Add: 4,000,000 
> datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
>     17:15:35 INFO  loader     :: Add: 4,500,000 
> datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
>     17:15:40 INFO  loader     :: Add: 5,000,000 
> datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)
>     
>     
>     the batch size increases for the next file - but the third file is one of 
> those w nearly a half a billion triples.
>     
>     the batch size keeps steadily decreasing
>     
>     here is the output at the 250m triple mark
>     
>     19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 
> 3,562 / Avg: 38,030)
>     19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
> 4,322 / Avg: 37,443)
>     19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
> 3,649 / Avg: 36,759)
>     19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
> 4,868 / Avg: 36,283)
>     19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
> 3,532 / Avg: 35,622)
>     19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
> 19:11:41 UTC]
>     19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
> 3,643 / Avg: 35,009)
>     19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 
> 3,976 / Avg: 34,473)
>     19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 
> 3,171 / Avg: 33,810)
>     19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 
> 3,512 / Avg: 33,241)
>     
>     The batch size has decreased (I don't understand exactly what the batch 
> size refers to) and conversely the time for a half million triples to be 
> processed has gone from 5s to 2.5m.
>     
>     This is the tail of the log currently:
>     
>     03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 
> 1,468 / Avg: 8,680)
>     03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 
> 1,512 / Avg: 8,613)
>     03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 
> 1,897 / Avg: 8,564)
>     03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 
> 1,723 / Avg: 8,509)
>     03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 
> 1,489 / Avg: 8,444)
>     03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 
> 1,842 / Avg: 8,395)
>     03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 
> 1,491 / Avg: 8,333)
>     03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 
> 1,882 / Avg: 8,287)
>     03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 
> 2,391 / Avg: 8,254)
>     03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 
> 1,887 / Avg: 8,209)
>     03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 
> 03:44:02 UTC]
>     03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 
> 2,711 / Avg: 8,183)
>     03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 
> 1,717 / Avg: 8,133)
>     
>     batch size down below 2k and processing time per 0.5m triples up to 
> around 5 min.
>     
>     What has puzzled me is that in the beginning, CPU utilization was high. 
> At some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
>     Whereas now Load average sits near 3.5 not crossing 4 for several hours. 
> This is based on htop - according to the gcp console dashboard - CPU 
> utilization did indeed peak near 80% - but has been steady around 13-14% for 
> the last hours - I don't know how to match these two metrics - but its 
> heavily underutilized.
>     
>     Here's a flavor of the output (as I write now -when the batch size has 
> decreased) of
>     
>     $ iotop -ob
>     
>     
>     Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
>       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>     12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     
>     
>     The relevant output of running (for 124s) - this is the last accumulation:
>     
>     $ iotop -ab
>     
>     Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
>     Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
>       TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
>     12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena
>     12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java 
> -Dlog4j.configuration=file:/usr/local/apache-jena-
>     
>     
>     
>     and now the output of for pid 12155 - the subprocess running all the 
> threads
>     $ pidstat -p 12155 -ut
>     
>     
>     Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 
> CPU)
>     
>     04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    
> %CPU   CPU  Command
>     04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   
> 91.54     4  java
>     04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    
> 0.00     4  |__java
>     04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    
> 4.99     7  |__java
>     04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    
> 0.52     7  |__GC Thread#0
>     04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    
> 0.00     1  |__G1 Main Marker
>     04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    
> 0.27     2  |__G1 Conc#0
>     04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    
> 0.34     8  |__G1 Refine#0
>     04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    
> 0.28     6  |__G1 Young RemSet
>     04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    
> 0.03     1  |__VM Thread
>     04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    
> 0.00     4  |__Reference Handl
>     04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    
> 0.00     7  |__Finalizer
>     04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    
> 0.00     6  |__Signal Dispatch
>     04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    
> 0.04     1  |__C2 CompilerThre
>     04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    
> 0.00     1  |__C1 CompilerThre
>     04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    
> 0.00     1  |__Sweeper thread
>     04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    
> 0.00     1  |__Service Thread
>     04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    
> 0.05     1  |__VM Periodic Tas
>     04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    
> 0.00     6  |__Common-Cleaner
>     04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    
> 0.51     8  |__GC Thread#1
>     04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    
> 0.51     3  |__GC Thread#2
>     04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    
> 0.50     1  |__GC Thread#3
>     04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    
> 0.52     5  |__GC Thread#4
>     04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    
> 0.51     2  |__GC Thread#5
>     04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    
> 0.00     5  |__Thread-0
>     04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    
> 0.00     6  |__Thread-1
>     04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    
> 0.00     7  |__Thread-2
>     04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    
> 4.42     2  |__Thread-3
>     04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    
> 5.54     3  |__Thread-4
>     04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    
> 5.29     4  |__Thread-5
>     04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   
> 19.34     3  |__Thread-6
>     04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   
> 20.83     2  |__Thread-7
>     04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   
> 20.07     9  |__Thread-8
>     04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    
> 5.08     0  |__Thread-9
>     04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    
> 0.52     6  |__GC Thread#6
>     04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    
> 0.51     0  |__GC Thread#7
>     04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    
> 0.51     4  |__GC Thread#8
>     04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    
> 0.02     5  |__G1 Refine#1
>     04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    
> 0.01     9  |__G1 Refine#2
>     04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    
> 0.01     0  |__G1 Refine#3
>     04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    
> 0.01     7  |__G1 Refine#4
>     04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    
> 0.00     1  |__G1 Refine#5
>     04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    
> 0.00     1  |__G1 Refine#6
>     04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    
> 0.00     6  |__G1 Refine#7
>     04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    
> 0.00     6  |__G1 Refine#8
>     04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    
> 0.27     9  |__G1 Conc#1
>     
>     
>     you can see threads, 12184, 12185, 12186 have the highest CPU 
> utilization, albeit still at 20% only. they are the same threads in the 
> output of iotop - doing a lot of reading mostly. As you can see in iotop 
> output, the read is at over 900 M/s - which is excellent. All the other 
> threads are idle the overwhelming majority of the time
>     
>     Finally htop gives a similar picture, the 3 threads have the highest CPU 
> utilization (around 20%) now, and so 7/10 cores are practically idle with 
> short spikes when the other threads pick up. which isnt much as the above 
> shows. Also those threads are either in R or D status (rarely in S) - while 
> the rest spend most of the time in S.
>     
>     I didnt understand what you wrote about the heap size, RAM and read 
> threads - I don't use java - so I'm flying by the seat of my pants. 
>     Here is the output of 
>     
>     $ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC     
>     OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    
> CGCT     GCT   
>      0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  
> 20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 
> 1042.531
>     
>     which may be relevant. 
>     
>     My very crudely formulated question is - how come it seems that with 
> smaller batch sizes, the processing time increased so much - while at the 
> same time there is actually much less processing going on (much lower CPU 
> utilization) - it seems like everything is waiting for whatever those 3 
> threads are doing. 
>     And is three anyway to address it? 
>     
>     thanks for your help!
>     
>     best,
>     
>     Isroel
>     
>     On 2020/06/21 21:38:50, Andy Seaborne <a...@apache.org> wrote: 
>     >  > And when it says "nq" is really quads or all data for the default
>     >  > graph? (there is more indexing work for named graphs).
>     > 
>     >  >> : Quads = 127,206,280
>     > 
>     > 
>     > OK - it's quads. There are 6 quad indexes and in full parallel mode it 
>     > will use 2 more threads to parse and to build the node table.
>     > 
>     > Full parallel loading is going to use up all the cores and HT threads 
>     > aren't full threads for this purpose.
>     > 
>     > The phased loader (default) uses less threads.
>     > 
>     > Phase 1:
>     > one thread to decompress and parse
>     > one thread to build the node table.
>     > one thread to for the GSPO
>     > (and one for SPO but you seem to have no triples)
>     > =3
>     > 
>     > Phase 2:
>     > two threads
>     > =2
>     > 
>     > Phase 3:
>     > three threads
>     > =3
>     > 
>     >      Andy
>     > 
>     > 
>     > On 21/06/2020 22:11, Andy Seaborne wrote:
>     > > Hi there,
>     > > 
>     > > Thanks for reporting the findings.
>     > > 
>     > > On 20/06/2020 16:10, Isroel Kogan wrote:
>     > >> Hi,
>     > >>
>     > >> I am also a newcomer to the RDF world - and particularly Jena, which 
> I 
>     > >> started using this week.
>     > >>
>     > >> A couple of observations I have made over the last few days 
> exploring 
>     > >> different options.
>     > >>
>     > >> Local Machine (specs):
>     > >>
>     > >> Ubuntu 18.04
>     > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
>     > > 
>     > > which is 4 core and hyper threading. for this workload that is more 
> like 
>     > > 4 threads. HT is not a completely x2 for this sort of continuous 
>     > > processing threading.
>     > > 
>     > > And pre-emtptive timeslicing is not nice!
>     > > 
>     > >> 16GB RAM
>     > >> 512 SSD (NVMe).
>     > >>
>     > >>
>     > >> the following compares loading a file in compressed vs decompressed 
>     > >> format -both w parallel loader.
>     > >>
>     > >> file:
>     > >> docstrings_triples.nq
>     > >> size: 28GB
>     > >>
>     > >> cmd:
>     > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
>     > >> docstrings_triples.nq > tdb2.log1 2>&1
>     > >>
>     > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
>     > >>
>     > >> real    22m46.346s
>     > >> user    120m46.591s
>     > >> sys    3m22.698s
>     > >>
>     > >>
>     > >> file:
>     > >> docstrings_triples.nq.bz2
>     > >> size: 542M
>     > >>
>     > >> cmd:
>     > >>
>     > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
>     > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
>     > >>
>     > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
>     > >>
>     > >>
>     > >> real    37m8.182s
>     > >> user    109m42.970s
>     > >> sys    6m27.426s
>     > >>
>     > >> resulting DB size
>     > >> 30GB
>     > >>
>     > >> confirmed equal via diff.
>     > >>
>     > >> pbzip2 ran in 84s
>     > >>
>     > >> Less rigorously I noticed a similar gain in speed for other files.
>     > > 
>     > > For gz files, the speed of loading of compressed vs uncompressed is 
>     > > usually not very much.  It does look like bz2
>     > > 
>     > > Using a separate process and faster decompressor may help:
>     > > 
>     > > bzip2 -d < docstrings_triples.nq.bz2 | \
>     > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
>     > >      -- - > tdb2.log2 2>&1
>     > > 
>     > > When Jena decompresses a bz2 file, it uses a Apache Common Compress 
> so 
>     > > it is a java decompressor which will take time to get optimized by 
> the 
>     > > JIT and is likely slower than a specialized tool like bzip2.
>     > > 
>     > > But with 4 core, it may have the opposite effect - using more 
> processes 
>     > > causes preemption timeslicing.
>     > > 
>     > > It maybe one of the other loaders is faster because it is a better 
> match 
>     > > to the hardware.
>     > > 
>     > >> Is this expected behaviour? What factors influence this?
>     > >>
>     > >> SSD - local vs cloud.
>     > >>
>     > >> on my local machine, when running parallel loader, cores were 
> working 
>     > >> at over 70% capacity and there was little IO induced down time.
>     > > 
>     > > How many core were active?
>     > > And when it says "nq" is really quads or all data for the default 
> graph? 
>     > > (there is more indexing work for named graphs).
>     > > 
>     > > Some of that will be the bz2 decompression but it looks to me "like 
> it's 
>     > >   "more threads than cores" causing timeslicing.
>     > > 
>     > >>
>     > >> GCP instance specs:
>     > >>
>     > >> 20 CPU
>     > >> 32GB RAM
>     > > 
>     > > And same heap size?
>     > > 
>     > > While the parallel loader is using multiple threads it is a fixed 
> number 
>     > > so more CPU will help only if
>     > > 
>     > > More RAM is going to help because the OS will use it for file system 
>     > > cache, delaying writes.
>     > > 
>     > > But with more read threads, it could be there is less preemptive 
>     > > scheduling and that could be a big gain.
>     > > 
>     > >> 6TB "local SSD" storage
>     > >> the local SSD storage offers the best performance to reduce IO 
> latency 
>     > >> - it has physical proximity to instance - as per GCP.
>     > >>
>     > >> a few cores were working at near capacity, while the vast majority 
>     > >> idle (near 0%) w occasional spikes. average load translates to 20% 
>     > >> utilization. As I've seen others write here, this is a difference 
>     > >> others have noted.
>     > >> How can this be addressed? buffer size? (I don't have a deep enough 
>     > >> understanding).
>     > > 
>     > > My guess is that on the GCP instance it is one thread-one core.
>     > > 
>     > >>
>     > >>
>     > >> Another recurring pattern is the reduction in batch size.
>     > >> I've been running a load job on my gcp instance for almost a day 
> (23+h).
>     > >>
>     > >> file size: 93GB
>     > >> triples: 472m
>     > >>
>     > >> batch size decreased from 160k range to under 1k, while processing 
>     > >> time per batch increased from a few seconds to over 10 min. All this 
>     > >> time average CPU usage has remained steady, as has RAM usage.
>     > > 
>     > > Not sure I quite understand - this is adding more data to an existing 
>     > > database? And 10mins for 1k? While it will be slower, that does sound 
>     > > rather extreme.
>     > > 
>     > >>
>     > >> I don't understand how all of this works with indexing. Is this 
>     > >> expected behaviour? besides a locally proximate SSD, I've thrown an 
>     > >> overkill of hardware at it.
>     > >>
>     > >> thanks
>     > >>
>     > > 
>     > >      Andy
>     > > 
>     > 
>     
> 
> 
> 
> 
>

Re: TDB2 parallel load on cloud SSD and other observations/questions

Reply via email to