Re: TDB2 parallel load on cloud SSD and other observations/questions

Rob Vesse Mon, 22 Jun 2020 02:02:38 -0700

Isabel

I think there might be a fundamental misunderstanding happening about batch 
sizes here.  The batch sizes are fixed for a run and never changes, the "batch 
size" you refer to is a speed calculation e.g


    19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 
3,562 / Avg: 38,030)
    19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
4,322 / Avg: 37,443)
    19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
3,649 / Avg: 36,759)
    19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
4,868 / Avg: 36,283)
    19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
3,532 / Avg: 35,622)
    19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
19:11:41 UTC]
    19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
3,643 / Avg: 35,009)

Note that each batch is 500,000 quads as the number after "Add:" increases by 
500,000 each time.  The "Batch" in brackets is referring to the calculated 
loading speed for the current batch, where "Avg" is the calculated loading 
speed over the entire load.

So yes the speed is decreasing over time, this is a commonly reported issue on 
these lists but there is no simple fix for this because it depends both on 
hardware and data.  The output you have provided suggests that you are stuck on 
IO, Andy is the primary developer on TDB2 so he may be able to shed more light 
on what might be going on.

Rob

On 22/06/2020, 06:32, "Isroel Kogan" <isroelko...@gmail.com> wrote:

    Hi Andy - thanks for your comments.
    
    Instead of responding point by point, its best if I present a clearer 
picture -as  I also have a better understanding of the factors so far.
    
    
    GCP instance stats:
    $ lscpu
    
    Architecture:        x86_64
    CPU op-mode(s):      32-bit, 64-bit
    Byte Order:          Little Endian
    Address sizes:       46 bits physical, 48 bits virtual
    CPU(s):              10
    On-line CPU(s) list: 0-9
    Thread(s) per core:  2
    Core(s) per socket:  5
    Socket(s):           1
    NUMA node(s):        1
    Vendor ID:           GenuineIntel
    CPU family:          6
    Model:               63
    Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
    Stepping:            0
    CPU MHz:             2300.000
    BogoMIPS:            4600.00
    Hypervisor vendor:   KVM
    Virtualization type: full
    L1d cache:           32K
    L1i cache:           32K
    L2 cache:            256K
    L3 cache:            46080K
    NUMA node0 CPU(s):   0-9
    
    16GB RAM
    (I configured less RAM because on a prior iteration - out of 32GB - only 8 
was being used)
    
    3TB local SSD
    
    according to google cloud - max performance for this size is as follows:
    Storage space Partitions     IOPS                          Throughput (MB/s)
                                                Read      Write            Read 
        Write
    
    3 TB                 8                 680,000  360,000       2,650         
1,400
    
    
    https://cloud.google.com/compute/docs/disks
    
    I'm not getting that - but performance is an order of magnitude or more 
better than on my machine - as youll see below -
    
    
    
    My data is as follows -15 files of varying size (based on the original bz2 
files provided by the authors). files range from 2GB (a few million triples) to 
129GB. 4 files have close to a half a billion triples each.
    The files are .nq - as you noted - quads. (unzipped as already mentioned).
    
    Currently I have a process running initiated by tdb2.tdbloader command in 
parallel mode. DB empty initially.
    
    I will include some information here that may help you get a sense of the 
performance issues I alluded to
    
    
    beginning of log output:
    
    
    17:14:42 INFO  loader     :: Loader = LoaderParallel
    17:14:42 INFO  loader     :: Start: 15 files
    17:14:49 INFO  loader     :: Add: 500,000 ai_stackexchange_triples.nq 
(Batch: 77,339 / Avg: 77,339)
    17:14:55 INFO  loader     :: Add: 1,000,000 ai_stackexchange_triples.nq 
(Batch: 89,285 / Avg: 82,884)
    17:15:00 INFO  loader     :: Add: 1,500,000 ai_stackexchange_triples.nq 
(Batch: 90,090 / Avg: 85,154)
    17:15:06 INFO  loader     :: Add: 2,000,000 ai_stackexchange_triples.nq 
(Batch: 87,382 / Avg: 85,700)
    17:15:12 INFO  loader     :: Add: 2,500,000 ai_stackexchange_triples.nq 
(Batch: 82,304 / Avg: 84,999)
    17:15:18 INFO  loader     :: Add: 3,000,000 ai_stackexchange_triples.nq 
(Batch: 87,032 / Avg: 85,331)
    17:15:22 INFO  loader     ::   End file: ai_stackexchange_triples.nq 
(triples/quads = 3,402,753)
    17:15:23 INFO  loader     :: Add: 3,500,000 
datascience_stackexchange_triples.nq (Batch: 89,493 / Avg: 85,902)
    17:15:29 INFO  loader     :: Add: 4,000,000 
datascience_stackexchange_triples.nq (Batch: 86,475 / Avg: 85,973)
    17:15:35 INFO  loader     :: Add: 4,500,000 
datascience_stackexchange_triples.nq (Batch: 88,136 / Avg: 86,208)
    17:15:40 INFO  loader     :: Add: 5,000,000 
datascience_stackexchange_triples.nq (Batch: 93,562 / Avg: 86,891)
    
    
    the batch size increases for the next file - but the third file is one of 
those w nearly a half a billion triples.
    
    the batch size keeps steadily decreasing
    
    here is the output at the 250m triple mark
    
    19:03:24 INFO  loader     :: Add: 248,000,000 github_1_fixed.nq (Batch: 
3,562 / Avg: 38,030)
    19:05:19 INFO  loader     :: Add: 248,500,000 github_1_fixed.nq (Batch: 
4,322 / Avg: 37,443)
    19:07:36 INFO  loader     :: Add: 249,000,000 github_1_fixed.nq (Batch: 
3,649 / Avg: 36,759)
    19:09:19 INFO  loader     :: Add: 249,500,000 github_1_fixed.nq (Batch: 
4,868 / Avg: 36,283)
    19:11:41 INFO  loader     :: Add: 250,000,000 github_1_fixed.nq (Batch: 
3,532 / Avg: 35,622)
    19:11:41 INFO  loader     ::   Elapsed: 7,017.95 seconds [2020/06/21 
19:11:41 UTC]
    19:13:58 INFO  loader     :: Add: 250,500,000 github_1_fixed.nq (Batch: 
3,643 / Avg: 35,009)
    19:16:04 INFO  loader     :: Add: 251,000,000 github_1_fixed.nq (Batch: 
3,976 / Avg: 34,473)
    19:18:41 INFO  loader     :: Add: 251,500,000 github_1_fixed.nq (Batch: 
3,171 / Avg: 33,810)
    19:21:03 INFO  loader     :: Add: 252,000,000 github_1_fixed.nq (Batch: 
3,512 / Avg: 33,241)
    
    The batch size has decreased (I don't understand exactly what the batch 
size refers to) and conversely the time for a half million triples to be 
processed has gone from 5s to 2.5m.
    
    This is the tail of the log currently:
    
    03:01:16 INFO  loader     :: Add: 305,500,000 github_1_fixed.nq (Batch: 
1,468 / Avg: 8,680)
    03:06:47 INFO  loader     :: Add: 306,000,000 github_1_fixed.nq (Batch: 
1,512 / Avg: 8,613)
    03:11:10 INFO  loader     :: Add: 306,500,000 github_1_fixed.nq (Batch: 
1,897 / Avg: 8,564)
    03:16:00 INFO  loader     :: Add: 307,000,000 github_1_fixed.nq (Batch: 
1,723 / Avg: 8,509)
    03:21:36 INFO  loader     :: Add: 307,500,000 github_1_fixed.nq (Batch: 
1,489 / Avg: 8,444)
    03:26:07 INFO  loader     :: Add: 308,000,000 github_1_fixed.nq (Batch: 
1,842 / Avg: 8,395)
    03:31:43 INFO  loader     :: Add: 308,500,000 github_1_fixed.nq (Batch: 
1,491 / Avg: 8,333)
    03:36:08 INFO  loader     :: Add: 309,000,000 github_1_fixed.nq (Batch: 
1,882 / Avg: 8,287)
    03:39:37 INFO  loader     :: Add: 309,500,000 github_1_fixed.nq (Batch: 
2,391 / Avg: 8,254)
    03:44:02 INFO  loader     :: Add: 310,000,000 github_1_fixed.nq (Batch: 
1,887 / Avg: 8,209)
    03:44:02 INFO  loader     ::   Elapsed: 37,759.63 seconds [2020/06/22 
03:44:02 UTC]
    03:47:07 INFO  loader     :: Add: 310,500,000 github_1_fixed.nq (Batch: 
2,711 / Avg: 8,183)
    03:51:58 INFO  loader     :: Add: 311,000,000 github_1_fixed.nq (Batch: 
1,717 / Avg: 8,133)
    
    batch size down below 2k and processing time per 0.5m triples up to around 
5 min.
    
    What has puzzled me is that in the beginning, CPU utilization was high. At 
some point the Load Average was over 7 and even 8 (so 70-80% for 10 cores).
    Whereas now Load average sits near 3.5 not crossing 4 for several hours. 
This is based on htop - according to the gcp console dashboard - CPU 
utilization did indeed peak near 80% - but has been steady around 13-14% for 
the last hours - I don't know how to match these two metrics - but its heavily 
underutilized.
    
    Here's a flavor of the output (as I write now -when the batch size has 
decreased) of
    
    $ iotop -ob
    
    
    Current DISK READ:       2.76 G/s | Current DISK WRITE:       0.00 B/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    12184 be/4 isroelko  936.79 M/s    5.72 M/s  0.00 % 80.55 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12186 be/4 isroelko  961.04 M/s    7.28 M/s  0.00 % 78.52 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12185 be/4 isroelko  932.44 M/s    9.45 M/s  0.00 % 77.55 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    
    
    The relevant output of running (for 124s) - this is the last accumulation:
    
    $ iotop -ab
    
    Total DISK READ:         2.76 G/s | Total DISK WRITE:        16.87 M/s
    Current DISK READ:       2.74 G/s | Current DISK WRITE:       0.00 B/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    12184 be/4 isroelko     88.11 G    414.80 M  0.00 % 79.31 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12185 be/4 isroelko     87.82 G    498.31 M  0.00 % 78.74 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12186 be/4 isroelko     88.17 G    435.88 M  0.00 % 78.66 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12187 be/4 isroelko      3.07 G    206.02 M  0.00 %  4.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12156 be/4 isroelko     68.29 M      8.00 K  0.00 %  0.63 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12171 be/4 isroelko   1048.00 K      0.00 B  0.00 %  0.13 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12160 be/4 isroelko   1016.00 K      0.00 B  0.00 %  0.10 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12157 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.09 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12192 be/4 isroelko    388.00 K      0.00 B  0.00 %  0.09 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12162 be/4 isroelko   1704.00 K      8.00 K  0.00 %  0.06 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12181 be/4 isroelko     33.52 M      9.52 M  0.00 %  0.05 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12182 be/4 isroelko     33.29 M      9.69 M  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12177 be/4 isroelko    404.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12176 be/4 isroelko    140.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12191 be/4 isroelko    176.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12173 be/4 isroelko     48.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12175 be/4 isroelko    112.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12190 be/4 isroelko     44.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena
    12174 be/4 isroelko    248.00 K      0.00 B  0.00 %  0.04 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12183 be/4 isroelko     33.54 M      9.70 M  0.00 %  0.02 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12159 be/4 isroelko     84.00 K      0.00 B  0.00 %  0.00 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    12170 be/4 isroelko      0.00 B    100.00 K  0.00 %  0.00 % java 
-Dlog4j.configuration=file:/usr/local/apache-jena-
    
    
    
    and now the output of for pid 12155 - the subprocess running all the threads
    $ pidstat -p 12155 -ut
    
    
    Linux 4.19.0-9-cloud-amd64 (graph4codev1)   06/22/20    _x86_64_    (10 CPU)
    
    04:38:32      UID      TGID       TID    %usr %system  %guest   %wait    
%CPU   CPU  Command
    04:38:32     1001     12155         -   44.41   47.13    0.00    0.00   
91.54     4  java
    04:38:32     1001         -     12155    0.00    0.00    0.00    0.00    
0.00     4  |__java
    04:38:32     1001         -     12156    4.88    0.11    0.00    0.00    
4.99     7  |__java
    04:38:32     1001         -     12157    0.52    0.00    0.00    0.00    
0.52     7  |__GC Thread#0
    04:38:32     1001         -     12158    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Main Marker
    04:38:32     1001         -     12159    0.27    0.00    0.00    0.00    
0.27     2  |__G1 Conc#0
    04:38:32     1001         -     12160    0.34    0.00    0.00    0.01    
0.34     8  |__G1 Refine#0
    04:38:32     1001         -     12161    0.28    0.00    0.00    0.00    
0.28     6  |__G1 Young RemSet
    04:38:32     1001         -     12162    0.02    0.01    0.00    0.00    
0.03     1  |__VM Thread
    04:38:32     1001         -     12163    0.00    0.00    0.00    0.00    
0.00     4  |__Reference Handl
    04:38:32     1001         -     12164    0.00    0.00    0.00    0.00    
0.00     7  |__Finalizer
    04:38:32     1001         -     12165    0.00    0.00    0.00    0.00    
0.00     6  |__Signal Dispatch
    04:38:32     1001         -     12166    0.04    0.00    0.00    0.00    
0.04     1  |__C2 CompilerThre
    04:38:32     1001         -     12167    0.00    0.00    0.00    0.00    
0.00     1  |__C1 CompilerThre
    04:38:32     1001         -     12168    0.00    0.00    0.00    0.00    
0.00     1  |__Sweeper thread
    04:38:32     1001         -     12169    0.00    0.00    0.00    0.00    
0.00     1  |__Service Thread
    04:38:32     1001         -     12170    0.03    0.02    0.00    0.01    
0.05     1  |__VM Periodic Tas
    04:38:32     1001         -     12171    0.00    0.00    0.00    0.00    
0.00     6  |__Common-Cleaner
    04:38:32     1001         -     12173    0.51    0.00    0.00    0.00    
0.51     8  |__GC Thread#1
    04:38:32     1001         -     12174    0.51    0.00    0.00    0.00    
0.51     3  |__GC Thread#2
    04:38:32     1001         -     12175    0.50    0.00    0.00    0.00    
0.50     1  |__GC Thread#3
    04:38:32     1001         -     12176    0.51    0.00    0.00    0.00    
0.52     5  |__GC Thread#4
    04:38:32     1001         -     12177    0.51    0.00    0.00    0.00    
0.51     2  |__GC Thread#5
    04:38:32     1001         -     12178    0.00    0.00    0.00    0.00    
0.00     5  |__Thread-0
    04:38:32     1001         -     12179    0.00    0.00    0.00    0.00    
0.00     6  |__Thread-1
    04:38:32     1001         -     12180    0.00    0.00    0.00    0.00    
0.00     7  |__Thread-2
    04:38:32     1001         -     12181    4.33    0.09    0.00    0.00    
4.42     2  |__Thread-3
    04:38:32     1001         -     12182    5.37    0.17    0.00    0.00    
5.54     3  |__Thread-4
    04:38:32     1001         -     12183    5.12    0.17    0.00    0.00    
5.29     4  |__Thread-5
    04:38:32     1001         -     12184    4.44   14.90    0.00    0.13   
19.34     3  |__Thread-6
    04:38:32     1001         -     12185    5.53   15.30    0.00    0.14   
20.83     2  |__Thread-7
    04:38:32     1001         -     12186    5.08   15.00    0.00    0.14   
20.07     9  |__Thread-8
    04:38:32     1001         -     12187    3.57    1.51    0.00    0.02    
5.08     0  |__Thread-9
    04:38:32     1001         -     12190    0.52    0.00    0.00    0.00    
0.52     6  |__GC Thread#6
    04:38:32     1001         -     12191    0.51    0.00    0.00    0.00    
0.51     0  |__GC Thread#7
    04:38:32     1001         -     12192    0.50    0.00    0.00    0.00    
0.51     4  |__GC Thread#8
    04:38:32     1001         -     12193    0.02    0.00    0.00    0.00    
0.02     5  |__G1 Refine#1
    04:38:32     1001         -     12194    0.01    0.00    0.00    0.00    
0.01     9  |__G1 Refine#2
    04:38:32     1001         -     12195    0.01    0.00    0.00    0.00    
0.01     0  |__G1 Refine#3
    04:38:32     1001         -     12196    0.01    0.00    0.00    0.00    
0.01     7  |__G1 Refine#4
    04:38:32     1001         -     12197    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Refine#5
    04:38:32     1001         -     12198    0.00    0.00    0.00    0.00    
0.00     1  |__G1 Refine#6
    04:38:32     1001         -     12199    0.00    0.00    0.00    0.00    
0.00     6  |__G1 Refine#7
    04:38:32     1001         -     12200    0.00    0.00    0.00    0.00    
0.00     6  |__G1 Refine#8
    04:38:32     1001         -     12205    0.27    0.00    0.00    0.00    
0.27     9  |__G1 Conc#1
    
    
    you can see threads, 12184, 12185, 12186 have the highest CPU utilization, 
albeit still at 20% only. they are the same threads in the output of iotop - 
doing a lot of reading mostly. As you can see in iotop output, the read is at 
over 900 M/s - which is excellent. All the other threads are idle the 
overwhelming majority of the time
    
    Finally htop gives a similar picture, the 3 threads have the highest CPU 
utilization (around 20%) now, and so 7/10 cores are practically idle with short 
spikes when the other threads pick up. which isnt much as the above shows. Also 
those threads are either in R or D status (rarely in S) - while the rest spend 
most of the time in S.
    
    I didnt understand what you wrote about the heap size, RAM and read threads 
- I don't use java - so I'm flying by the seat of my pants. 
    Here is the output of 
    
    $ jstat -gc 12155 S0C    S1C    S0U    S1U      EC       EU        OC       
  OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    
CGCT     GCT   
     0.0   184320.0  0.0   184320.0 2404352.0 102400.0 1519616.0   642119.0  
20224.0 19386.6 2560.0 2339.3   2974 1039.542   0      0.000  340     2.989 
1042.531
    
    which may be relevant. 
    
    My very crudely formulated question is - how come it seems that with 
smaller batch sizes, the processing time increased so much - while at the same 
time there is actually much less processing going on (much lower CPU 
utilization) - it seems like everything is waiting for whatever those 3 threads 
are doing. 
    And is three anyway to address it? 
    
    thanks for your help!
    
    best,
    
    Isroel
    
    On 2020/06/21 21:38:50, Andy Seaborne <a...@apache.org> wrote: 
    >  > And when it says "nq" is really quads or all data for the default
    >  > graph? (there is more indexing work for named graphs).
    > 
    >  >> : Quads = 127,206,280
    > 
    > 
    > OK - it's quads. There are 6 quad indexes and in full parallel mode it 
    > will use 2 more threads to parse and to build the node table.
    > 
    > Full parallel loading is going to use up all the cores and HT threads 
    > aren't full threads for this purpose.
    > 
    > The phased loader (default) uses less threads.
    > 
    > Phase 1:
    > one thread to decompress and parse
    > one thread to build the node table.
    > one thread to for the GSPO
    > (and one for SPO but you seem to have no triples)
    > =3
    > 
    > Phase 2:
    > two threads
    > =2
    > 
    > Phase 3:
    > three threads
    > =3
    > 
    >      Andy
    > 
    > 
    > On 21/06/2020 22:11, Andy Seaborne wrote:
    > > Hi there,
    > > 
    > > Thanks for reporting the findings.
    > > 
    > > On 20/06/2020 16:10, Isroel Kogan wrote:
    > >> Hi,
    > >>
    > >> I am also a newcomer to the RDF world - and particularly Jena, which I 
    > >> started using this week.
    > >>
    > >> A couple of observations I have made over the last few days exploring 
    > >> different options.
    > >>
    > >> Local Machine (specs):
    > >>
    > >> Ubuntu 18.04
    > >> Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
    > > 
    > > which is 4 core and hyper threading. for this workload that is more 
like 
    > > 4 threads. HT is not a completely x2 for this sort of continuous 
    > > processing threading.
    > > 
    > > And pre-emtptive timeslicing is not nice!
    > > 
    > >> 16GB RAM
    > >> 512 SSD (NVMe).
    > >>
    > >>
    > >> the following compares loading a file in compressed vs decompressed 
    > >> format -both w parallel loader.
    > >>
    > >> file:
    > >> docstrings_triples.nq
    > >> size: 28GB
    > >>
    > >> cmd:
    > >> time tdb2.tdbloader --loader=parallel --loc=test1graphdb 
    > >> docstrings_triples.nq > tdb2.log1 2>&1
    > >>
    > >> :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
    > >>
    > >> real    22m46.346s
    > >> user    120m46.591s
    > >> sys    3m22.698s
    > >>
    > >>
    > >> file:
    > >> docstrings_triples.nq.bz2
    > >> size: 542M
    > >>
    > >> cmd:
    > >>
    > >> time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
    > >> docstrings_triples.nq.bz2 > tdb2.log2 2>&1
    > >>
    > >> :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
    > >>
    > >>
    > >> real    37m8.182s
    > >> user    109m42.970s
    > >> sys    6m27.426s
    > >>
    > >> resulting DB size
    > >> 30GB
    > >>
    > >> confirmed equal via diff.
    > >>
    > >> pbzip2 ran in 84s
    > >>
    > >> Less rigorously I noticed a similar gain in speed for other files.
    > > 
    > > For gz files, the speed of loading of compressed vs uncompressed is 
    > > usually not very much.  It does look like bz2
    > > 
    > > Using a separate process and faster decompressor may help:
    > > 
    > > bzip2 -d < docstrings_triples.nq.bz2 | \
    > > time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
    > >      -- - > tdb2.log2 2>&1
    > > 
    > > When Jena decompresses a bz2 file, it uses a Apache Common Compress so 
    > > it is a java decompressor which will take time to get optimized by the 
    > > JIT and is likely slower than a specialized tool like bzip2.
    > > 
    > > But with 4 core, it may have the opposite effect - using more processes 
    > > causes preemption timeslicing.
    > > 
    > > It maybe one of the other loaders is faster because it is a better 
match 
    > > to the hardware.
    > > 
    > >> Is this expected behaviour? What factors influence this?
    > >>
    > >> SSD - local vs cloud.
    > >>
    > >> on my local machine, when running parallel loader, cores were working 
    > >> at over 70% capacity and there was little IO induced down time.
    > > 
    > > How many core were active?
    > > And when it says "nq" is really quads or all data for the default 
graph? 
    > > (there is more indexing work for named graphs).
    > > 
    > > Some of that will be the bz2 decompression but it looks to me "like 
it's 
    > >   "more threads than cores" causing timeslicing.
    > > 
    > >>
    > >> GCP instance specs:
    > >>
    > >> 20 CPU
    > >> 32GB RAM
    > > 
    > > And same heap size?
    > > 
    > > While the parallel loader is using multiple threads it is a fixed 
number 
    > > so more CPU will help only if
    > > 
    > > More RAM is going to help because the OS will use it for file system 
    > > cache, delaying writes.
    > > 
    > > But with more read threads, it could be there is less preemptive 
    > > scheduling and that could be a big gain.
    > > 
    > >> 6TB "local SSD" storage
    > >> the local SSD storage offers the best performance to reduce IO latency 
    > >> - it has physical proximity to instance - as per GCP.
    > >>
    > >> a few cores were working at near capacity, while the vast majority 
    > >> idle (near 0%) w occasional spikes. average load translates to 20% 
    > >> utilization. As I've seen others write here, this is a difference 
    > >> others have noted.
    > >> How can this be addressed? buffer size? (I don't have a deep enough 
    > >> understanding).
    > > 
    > > My guess is that on the GCP instance it is one thread-one core.
    > > 
    > >>
    > >>
    > >> Another recurring pattern is the reduction in batch size.
    > >> I've been running a load job on my gcp instance for almost a day 
(23+h).
    > >>
    > >> file size: 93GB
    > >> triples: 472m
    > >>
    > >> batch size decreased from 160k range to under 1k, while processing 
    > >> time per batch increased from a few seconds to over 10 min. All this 
    > >> time average CPU usage has remained steady, as has RAM usage.
    > > 
    > > Not sure I quite understand - this is adding more data to an existing 
    > > database? And 10mins for 1k? While it will be slower, that does sound 
    > > rather extreme.
    > > 
    > >>
    > >> I don't understand how all of this works with indexing. Is this 
    > >> expected behaviour? besides a locally proximate SSD, I've thrown an 
    > >> overkill of hardware at it.
    > >>
    > >> thanks
    > >>
    > > 
    > >      Andy
    > > 
    >

Re: TDB2 parallel load on cloud SSD and other observations/questions

Reply via email to