Flink s3 streaming performance

venkata sateesh` kolluru Sat, 30 May 2020 17:32:56 -0700

Hello,

I have posted the same in stackoverflow but didnt get any response. So
posting it here for help.


https://stackoverflow.com/questions/62068787/flink-s3-write-performance-optimization?noredirect=1#comment109814428_62068787

Details:

I am working on a flink application on kubernetes(eks) which consumes data
from kafka and write it to s3.

We have around 120 million xml messages of size 4TB in kafka. Consuming
from kafka is super fast.

These are just string messages from kafka.

There is a high back pressure while writing to s3. We are not even hitting
the s3 PUT request limit which is arounf 3500 requests/sec. I am seeing
only 300 writes per minute to S3 which is very slow.

I am using StreamFileSink to write to s3 with Rolling policy as
OnCheckpointPolicy.

Using flink-s3-fs-hadoop-*.jar and s3:// as path (not using any s3a or s3p)

Other than this I dont have any config related to s3

    StreamingFileSink<Tuple3<String,String, String>> sink = StreamingFileSink
            .forRowFormat(new Path(s3://BUCKET),
                    (Tuple3<String,String, String> element,
OutputStream stream) -> {
                        PrintStream out = new PrintStream(stream);
                        out.println(element.f2);
                    })
            // Determine component type for each record
            .withBucketAssigner(new CustomBucketAssigner())
            .withRollingPolicy(OnCheckpointRollingPolicy.build())
            .withBucketCheckInterval((TimeUnit.MINUTES.toMillis(1)))
            .build();

Is there anything that we can optimize on s3 from streamfilesink or in
flink-conf.xml ?

Like using bulkformat or any config params like fs.s3.maxThreads etc.

For checkpointing too I am using s3:// instead of s3p or s3a

env.setStateBackend((StateBackend) new
RocksDBStateBackend(s3://checkpoint_bucket, true));
        env.enableCheckpointing(300000);

Flink s3 streaming performance

Reply via email to