Re: optimization of upload of several files by a curl request

Andy Seaborne Mon, 16 Jan 2023 08:02:07 -0800



On 16/01/2023 13:16, Steven Blanchard wrote:

Hello,

I would like to upload a very large dataset (UniRef) to a fuseki database.


How big (in triples)?

I tried to upload file by file but the upload time was exponential witheach file added.


code use :
```python
    url: str = f"{jena_url}/{db_name}/data
    multipart_data: MultipartEncoder = MultipartEncoder(
        fields={
            "file": (
                f"{file_name}",
                open(
                    f"{path_file}",
                    "rb",
                ),
                "text/turtle",
            )
        }
    )
    response : requests.Request = requests.post(
        url,
        data=multipart_data,
        headers={"Content-Type": multipart_data.content_type},
        cookies=cookies,
    )
```



That is a multi-part file upload.

Does the Fuseki log show a single POST?
Does it have a Content-length? (run Fuseki with "-v" to see headers)
Is the Python client taking time to assemble the request?

It's not the root issue but I'd like to understand what various setupsdo in practice and what arrives at the server.

Then I tried to upload with the command tdb2.tdbloader.
By uploading all the files in the same command the upload became verymuch faster. Also, tdb2.tdbloader has an option to parallelize the upload.

If you load into a live server, Fuseki does a safe add to the databasewithin a database transaction that does not consume all the serverhardware resources. If the data is bad (all too often the case) or theclient break the connection so that the data is corrupt, the transactionwill abort and the database is intact in the original state.

This is to keep a balance between upload, integrity, and responding toqueries.


Load performance is hardware sensitive.
Is this using an SSD? If so, local or remote?

code use :
```bash
bin/tdb2.tdbloader --loader=parallel --loc fuseki/base/databases/uniref/data/uniref_*
```
The problem with tdb2 is that it does not work in http.


The parallel loader will saturate the I/O at scale. It's greedy!
I/O is the limiting factor at scale.

tdb2.tdbloader (default and parallel) runs without databasetransactions. Different parts of the database are in different states(the "parallel" bit). An aborted load, or programme crash, will destroythe database. It is best used loading from empty.

I would like to know if it is possible to get the same performance astdb2 (loading all files at once, parallelization...) by using an httprequest?


Not as of today.

Could the functionality be added? Yes - a good case for a Fuseki Moduleto make the functionality opt-in because it can take out the server andbreak the database.

I'm also open to other suggestions to optimize this file loading.

Loading offline and putting in the new database is the way to exploitfaster loading.

What explains this exponential evolution of the upload time when addingdata in several times?


Loading slows down over time.

1. Hardware sizes mean more I/O and less %-age cached as the loaded datagrows.2. Loading is in effect a sort so that's n-log(n) once the scale islarge enough to see


    Andy


Thank you for your help,

Steven

Re: optimization of upload of several files by a curl request

Reply via email to