before records are
read/write in that JVM.
From: Fabian Hueske [mailto:fhue...@gmail.com]
Sent: Tuesday, February 07, 2017 4:17 PM
To: user@flink.apache.org
Subject: Re: Strange filter performance with parquet
I'm not familiar with the details of Parquet and Avro, but I know that the
han
4:17 PM
To: user@flink.apache.org
Subject: Re: Strange filter performance with parquet
I'm not familiar with the details of Parquet and Avro, but I know that the
handling of GenericRecord is very inefficient in Flink.
The reason is that they are serialized using Kryo and always contain the
gt;
> Reading parquet files seems very slow to me. Writing is very fast in
> comparison. It takes 60 slots 10 minutes to read 550million records from a
> parquet file. We have MR jobs finishing processing in 8.5 minutes with 33
> cores so it’s very much slower than whats possible.
>
]
Sent: Tuesday, February 07, 2017 3:56 PM
To: user@flink.apache.org
Subject: Re: Strange filter performance with parquet
Hmm, the plan you posted does not look like it would need to spill data to
avoid a deadlock.
Not sure what's causing the slowdown.
How do you read Parquet files?
If
o it’s very much slower than whats possible.
>
>
>
>
>
> *From:* Fabian Hueske [mailto:fhue...@gmail.com]
> *Sent:* Tuesday, February 07, 2017 3:26 PM
> *To:* user@flink.apache.org
> *Subject:* Re: Strange filter performance with parquet
>
>
>
> Hi Billy,
>
&
@flink.apache.org
Subject: Re: Strange filter performance with parquet
Hi Billy,
this might depend on what you are doing with the live and dead DataSets later
on.
For example, if you join both data sets, Flink might need to spill one of them
to disk and read it back to avoid a deadlock.
This
Hi Billy,
this might depend on what you are doing with the live and dead DataSets
later on.
For example, if you join both data sets, Flink might need to spill one of
them to disk and read it back to avoid a deadlock.
This happens for instance if the join strategy is a HashJoin which blocks
one inp