Juan Galvez created ARROW-8365:
----------------------------------

             Summary: arrow-cpp: Error when writing files to S3 larger than 5 GB
                 Key: ARROW-8365
                 URL: https://issues.apache.org/jira/browse/ARROW-8365
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.16.0
            Reporter: Juan Galvez


When purely using the arrow-cpp library to write to S3, I get the following 
error when trying to write a large Arrow table to S3 (resulting in a file size 
larger than 5 GB):

{{../src/arrow/io/interfaces.cc:219: Error ignored when destroying file of type 
N5arrow2fs12_GLOBAL__N_118ObjectOutputStreamE: IOError: When uploading part for 
key 'test01.parquet/part-00.parquet' in bucket 'test': AWS Error [code 100]: 
Unable to parse ExceptionName: EntityTooLarge Message: Your proposed upload 
exceeds the maximum allowed size with address : 52.219.100.32}}

I have diagnosed the problem by looking at and modifying the code in 
*{{s3fs.cc}}*. The code uses multipart upload, and uses 5 MB chunks for the 
first 100 parts. After it has submitted the first 100 parts, it is supposed to 
increase the size of the chunks to 10 MB (the part upload threshold or 
{{part_upload_threshold_}}). The issue is that the threshold is increased 
inside {{DoWrite}}, and {{DoWrite}} can be called multiple times before the 
current part is uploaded, which ultimately causes the threshold to keep getting 
increased indefinitely, and the last part ends up surpassing the 5 GB part 
upload limit of AWS/S3.

This issue where the last part is much larger than it should I'm pretty sure 
can happen every time a multi-part upload exceeds 100 parts, but the error is 
only thrown if the last part is larger than 5 GB. Therefore this is only 
observed with very large uploads.

I can confirm that the bug does not happen if I move this:

{{if (part_number_ % 100 == 0) {}}
{{ part_upload_threshold_ += kMinimumPartUpload;}}
{{ }}}


and do it in a different method, right before the line that does: 
{{++part_number_}}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to