Re: Streaming to Parquet Files in HDFS

Averell Thu, 11 Oct 2018 01:00:02 -0700

Hi Kostas,

Thanks for the info. That error caused by I built your code along with not
up-to-date baseline. I rebased my branch build, and there's no more such
issue.
I've been testing, and until now have some questions/issues as below:

1. I'm not able to write to S3 with the following URI format: *s3*://<path>,
and had to use *s3a*://<path>. Is this behaviour expected? (I am running
Flink on AWS EMR, and I thought that EMR provides a wrapper for HDFS over S3
with something called EMRFS).

2. Occasionally/randomly I got the below message ( parquet_error1.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/parquet_error1.log>

). I'm using ParquetAvroWriters.forReflectRecord() method to write Scala
case classes. Re-running the job doesn't get that error at the same data
location, so I don't think that there's issue with data.
*java.lang.ArrayIndexOutOfBoundsException: <some random number>* /at
org.apache.parquet.column.values.dictionary.DictionaryValuesWriter$PlainBinaryDictionaryValuesWriter.fallBackDictionaryEncodedData/.

3. Sometimes I got this error message when I use parallelism of 8 for the
sink ( parquet_error2.log
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/parquet_error2.log>

).
Reducing to 2 solves the issue. But is it possible to increase the pool
size? I could not find any place that I can change the
/fs.s3.maxconnections/ parameter.
/java.io.InterruptedIOException: initiate MultiPartUpload on
Test/output/dt=2018-09-20/part-7-5:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Unable
to execute HTTP request: Timeout waiting for connection from pool/

4. Where is the temporary folder that you store the parquet file before
uploading to S3?

Thanks a lot for your help.

Best regards,
Averell

--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Streaming to Parquet Files in HDFS

Reply via email to