Re: xloader on large dataset : Data Task with poor load average

Steven Blanchard Wed, 31 May 2023 07:48:31 -0700

Le jeu., mai 25 2023 at 08:46:31 +0100, Andy Seaborne <[email protected]>a écrit :

On 24/05/2023 10:22, Steven Blanchard wrote:
Hi Andy,
I tried it on a local disk and it had no impact on the average speedfor the Data stage.
SSD or rotating disk? (It shouldn't make an extreme difference orxloader, because that's part of the point of the xloader.)

On a SSD. The average speed are the same on the SSD or on the BlockStorage.

I checked with iostat, there was indeed an increase in the speed ofreading the input files. This step writes very little data so therewas no difference in the writing speed.
I also did a test with only 1 of the uniprot files (291 milliontuples) and the average speed was about 160,000 tuples/s. Thisvalue corresponds to speeds obtained on other insertions.
On the exact same hardware?

Yes same hardware, same folder, same time. Only the quantity of data isdifferents.

Could this decrease of average speed be related to the amount oftotal data?Is it possible to run this Data step only file by file and all theother steps with all files?
Not sure - there is a shared node table being built. The slowness ispresumably a consequence of the previous stages. The use of the someURI needs to have the same internal NodeId everywhere - i.e. seeingall the data.

During our tests, we resumed the insertion at the Data stage and wenoticed that the decrease in average speed is related to the previoussteps. If we give an argument the existing directory with the stepNodes and Term having already instead the speed of insertion of thedata is 800 tuples/s. If we give an argument an empty directory, theinsertion speed of the Data step is 190,000 tuples/s.

The decrease in speed therefore seems to be related to the amount ofdata and the results of the previous steps. When this step ingest data,there is an optional step that uses previously created files that couldbe very long because of the total amount of data?What is the link between these 3 steps? What does each of these threesteps do for the data insertion?

I'm still not seeing why the data stage starts at a slow rate - Iwill need to find time to explore the code.
(This is an argument for having NodeIds be hashes because that can becomputed without reference to the table unique ids and representationstorage. Downside - the NodeIds would be longer, 96 or 128 bits andhashes have bad locality (i.e. none whatsoever)).
    Andy
Thank you,

Steven
Le mar., mai 23 2023 at 11:30:36 +0100, Andy Seaborne<[email protected] <mailto:[email protected]>> a écrit :
On 22/05/2023 16:38, Steven Blanchard wrote:
Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne<[email protected] <mailto:[email protected]><<mailto:[email protected]>>> a écrit :
Hello Andy,
Hi Steven,

How are you runnign xloader? Default settings?
Yes, we use your default settings.
The command line used is the following line :
tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir/nfs/uniprot_tmp/ --threads 30/nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf
Just looking at that, the use of NFS may be related.
NFS is shared, remote filing system so it has comparative highoverheads on every operation to give the semantics of sharing(visibility on write).
Could you try using local disk to see if that makes a difference?

    Andy
What's the storage being used?
We use a Block Storage from a cloud providers with ssd on a moutednsf volume.
On 22/05/2023 10:49, Steven Blanchard wrote:
Hello,
I am currently trying to load a very large dataset ( 54 billiontriples) with the tdb2.xloader command.
The first two steps (Nodes and Terms) are completed with anaverage load speed of ~ 120,000.
The third stage (Data) has an average load speed of only 800.
is thet "Avg" is 800 from teh start of the phase or "the averagedrops to 800" during the phase?
The Avg is 800 from the start of the phase and he stay at 800.
This average load speed is incompatible with the amount of datato be loaded.
Looking at the status of the job, it is possible that there isan excessive demand on memory which slows down theprocess extremely.
We saw with a top that java required many memories :
```
top
# PID USER PR NI VIRT RES SHR S%CPU %MEM TIME+ COMMAND# 867362 sblanch+ 20 0 289,0g 90,2g 88,4g S 3,3 72,11102:32 java
```
xloader does not have much requirement for java heap memory.
Ok, since our email we have try to increase the -xmx and we havenot an increase of the performance.
That space may be mapped files.
But with a free -g, we see that it actually uses very littlememory.
```
free -g
#             total used free shared buff/cache available
# Mem: 125 3 0 0 121120
```
Are there any possibilities to speed up this step? (Give a -xmsto java?)Can this significant drop in loading speed for this step be dueto memory usage? Do you know of any other limitingcauses in this loading stage?
For previous insertions on smaller datasets, this Data step wasnot limiting and the average speed was even slightlyhigher than the Nodes and Terms steps.
How small is "smaller"?
For example, we have upload UniRef RDF Database (Same providerslike UniProt) with 12 Milliards of triples with an averagefor Data task of 230 000 tuples/s
That sounds like what I see when loading.
For information, the machine used has 32 CPUs and 128 Giga ofRam.
Thanks for your help,
Regards,

Steven

Re: xloader on large dataset : Data Task with poor load average

Reply via email to