Re: optimization of upload of several files by a curl request

Dan Brickley Mon, 16 Jan 2023 05:56:25 -0800

Also interested in answers. I asked Andy a few years ago and I believe he
said it was best done via SPARQL, which went against my intuition that some
fast path shortcuts would help. I guess he meant this:

https://jena.apache.org/documentation/query/update.html

Looking at
https://github.com/apache/jena/blob/main/jena-examples/src/main/java/arq/examples/update/UpdateProgrammatic.java
I realize I am unsure whether the file: URI is in the client or server side
of the protocol.
UpdateLoad("file:etc/update-data.ttl", "http://example/g2";))

The reason I care is from a sense that we have now a lot of interoperable
data in RDF, and many SPARQL implementations with different strengths and
weaknesses. Yet it is hard to systematically mix-and-match eg in a Cloud
(docker etc.) environment without a lot of work on data loading. We are 25+
years into RDF now, but hopefully things will continue to get easier!

Excuse the top-posting,

Dan

On Mon, 16 Jan 2023 at 13:18, Steven Blanchard
<[email protected]> wrote:

> Hello,
>
> I would like to upload a very large dataset (UniRef) to a fuseki
> database.
> I tried to upload file by file but the upload time was exponential with
> each file added.
>
> code use :
>  ```python
>     url: str = f"{jena_url}/{db_name}/data
>     multipart_data: MultipartEncoder = MultipartEncoder(
>         fields={
>             "file": (
>                 f"{file_name}",
>                 open(
>                     f"{path_file}",
>                     "rb",
>                 ),
>                 "text/turtle",
>             )
>         }
>     )
>     response : requests.Request = requests.post(
>         url,
>         data=multipart_data,
>         headers={"Content-Type": multipart_data.content_type},
>         cookies=cookies,
>     )
> ```
>
> Then I tried to upload with the command tdb2.tdbloader.
> By uploading all the files in the same command the upload became very
> much faster. Also, tdb2.tdbloader has an option to parallelize the
> upload.
>
> code use :
>  ```bash
> bin/tdb2.tdbloader --loader=parallel --loc
> fuseki/base/databases/uniref/ data/uniref_*
> ```
> The problem with tdb2 is that it does not work in http.
>
> I would like to know if it is possible to get the same performance as
> tdb2 (loading all files at once, parallelization...) by using an http
> request?
> I'm also open to other suggestions to optimize this file loading.
>
> What explains this exponential evolution of the upload time when adding
> data in several times?
>
> Thank you for your help,
>
> Steven
>
>

Re: optimization of upload of several files by a curl request

Reply via email to