Re: [SPAM] Re: optimization of upload of several files by a curl request

Andy Seaborne Tue, 17 Jan 2023 04:37:41 -0800



On 16/01/2023 17:22, Steven Blanchard wrote:

Le lun., janv. 16 2023 at 16:01:51 +0000, Andy Seaborne<[email protected]> a écrit :
Dear Andy,
For the exponential increase in time, it is when I load the files one byone by multiple command line.


Do you really mean "exponential" or "more than linear"?

Two data points can't distinguish that.

For example, I get the following loading times with tdb2.tdbloader (Ihave observe the same phenomenon with requests).
Loading the file Uniref_1.nt: 35min
Then loading the file Uniref_2.nt (in the same graph in the samedatasets than Uniref_1.nt) : 237min


UniRef comes in several variants and seems to be RDF/XML.
Exactly which one are you working with?

File sizes? (in triples and bytes).
Does the data include large number of large literals?

Hardware? (RAM size, disk (SSD or magnetic)?

And don't max out the heap size! (most of the space is used is outsidethe heap).

Loading the Uniref_1.nt file and the Uniref_2.nt file in sametdb2.tdbloader command line: 150min
Upload a file in a non-empty datasets/graph is longer than upload thissame file and an other in the same time.
On 16/01/2023 13:16, Steven Blanchard wrote:
Hello,
I would like to upload a very large dataset (UniRef) to a fusekidatabase.
How big (in triples)?
In total, we have ~ 4 000 000 000  triples to insert.


There is an alternative, specialist loader that MAY help. tdb2.xloader

This is another offline database builder that builds database from empty(it does not incrementally load)

It works on large data on modest hardware (e..g (roughly) desktops andportables with less I/O performance, less that 32G RAM, including HDD)and is not faster than tdb2.loader until the hardware maxs out ontdb2.loader. (e.g. I can load wikipedia/truthy on a portable and ittakes 10s


The only way to know which is faster is to try.

https://lists.apache.org/thread/7d2j83vyjzz2om70nvcvf9c0n7lyzp07

https://lists.apache.org/thread/rphn74r9vbovwjvylxjmrd6qnfvbt4t0

    Andy

I tried to upload file by file but the upload time was exponentialwith each file added.
code use :
```python
    url: str = f"{jena_url}/{db_name}/data
    multipart_data: MultipartEncoder = MultipartEncoder(
        fields={
            "file": (
                f"{file_name}",
                open(
                    f"{path_file}",
                    "rb",
                ),
                "text/turtle",
            )
        }
    )
    response : requests.Request = requests.post(
        url,
        data=multipart_data,
        headers={"Content-Type": multipart_data.content_type},
        cookies=cookies,
    )
```
That is a multi-part file upload.

Does the Fuseki log show a single POST?
Does it have a Content-length? (run Fuseki with "-v" to see headers)
Is the Python client taking time to assemble the request?
It's not the root issue but I'd like to understand what various setupsdo in practice and what arrives at the server.
For the multipart, I have found this conversation :https://stackoverflow.com/questions/54549464/programmaticaly-upload-dataset-to-fuseki.Having been quickly blocked by the exponential increase in loadingtimes, I did not note the times to tell you if it was faster or not thannot using it.
Then I tried to upload with the command tdb2.tdbloader.
By uploading all the files in the same command the upload became verymuch faster. Also, tdb2.tdbloader has an option to parallelize theupload.
If you load into a live server, Fuseki does a safe add to the databasewithin a database transaction that does not consume all the serverhardware resources. If the data is bad (all too often the case) orthe client break the connection so that the data is corrupt, thetransaction will abort and the database is intact in the original state.
Uploading the uniref database will not be done often and will not beupdated afterwards. But I note well your warnings, for the other (small)databases the loading times are very acceptable I would use the fusekiserver.
This is to keep a balance between upload, integrity, and responding toqueries.
Load performance is hardware sensitive.
Is this using an SSD? If so, local or remote?
code use :
```bash
bin/tdb2.tdbloader --loader=parallel --locfuseki/base/databases/uniref/ data/uniref_*
```
The problem with tdb2 is that it does not work in http.
The parallel loader will saturate the I/O at scale. It's greedy!
I/O is the limiting factor at scale.
tdb2.tdbloader (default and parallel) runs without databasetransactions. Different parts of the database are in different states(the "parallel" bit). An aborted load, or programme crash, willdestroy the database. It is best used loading from empty.
When you say crash the database, is it just the datasets where the datais uploaded or all the datasets of the server?
I create a new datasets by version of my software so it's possible forme to upload the uniprot Graph first and next the other data by requeststo be safe.
I would like to know if it is possible to get the same performance astdb2 (loading all files at once, parallelization...) by using anhttp request?
Not as of today.
Could the functionality be added? Yes - a good case for a FusekiModule to make the functionality opt-in because it can take out theserver and break the database.
I'm also open to other suggestions to optimize this file loading.
Loading offline and putting in the new database is the way to exploitfaster loading.
What explains this exponential evolution of the upload time whenadding data in several times?
Loading slows down over time.
1. Hardware sizes mean more I/O and less %-age cached as the loadeddata grows.2. Loading is in effect a sort so that's n-log(n) once the scale islarge enough to see
    Andy
Thank you for your help,

Steven

Re: [SPAM] Re: optimization of upload of several files by a curl request

Reply via email to