RE: Strange filter performance with parquet

2017-02-07 Thread Newport, Billy
before records are read/write in that JVM. From: Fabian Hueske [mailto:fhue...@gmail.com] Sent: Tuesday, February 07, 2017 4:17 PM To: user@flink.apache.org Subject: Re: Strange filter performance with parquet I'm not familiar with the details of Parquet and Avro, but I know that the han

RE: Strange filter performance with parquet

2017-02-07 Thread Newport, Billy
4:17 PM To: user@flink.apache.org Subject: Re: Strange filter performance with parquet I'm not familiar with the details of Parquet and Avro, but I know that the handling of GenericRecord is very inefficient in Flink. The reason is that they are serialized using Kryo and always contain the

Re: Strange filter performance with parquet

2017-02-07 Thread Fabian Hueske
gt; > Reading parquet files seems very slow to me. Writing is very fast in > comparison. It takes 60 slots 10 minutes to read 550million records from a > parquet file. We have MR jobs finishing processing in 8.5 minutes with 33 > cores so it’s very much slower than whats possible. >

RE: Strange filter performance with parquet

2017-02-07 Thread Newport, Billy
] Sent: Tuesday, February 07, 2017 3:56 PM To: user@flink.apache.org Subject: Re: Strange filter performance with parquet Hmm, the plan you posted does not look like it would need to spill data to avoid a deadlock. Not sure what's causing the slowdown. How do you read Parquet files? If

Re: Strange filter performance with parquet

2017-02-07 Thread Fabian Hueske
o it’s very much slower than whats possible. > > > > > > *From:* Fabian Hueske [mailto:fhue...@gmail.com] > *Sent:* Tuesday, February 07, 2017 3:26 PM > *To:* user@flink.apache.org > *Subject:* Re: Strange filter performance with parquet > > > > Hi Billy, > &

RE: Strange filter performance with parquet

2017-02-07 Thread Newport, Billy
@flink.apache.org Subject: Re: Strange filter performance with parquet Hi Billy, this might depend on what you are doing with the live and dead DataSets later on. For example, if you join both data sets, Flink might need to spill one of them to disk and read it back to avoid a deadlock. This

Re: Strange filter performance with parquet

2017-02-07 Thread Fabian Hueske
Hi Billy, this might depend on what you are doing with the live and dead DataSets later on. For example, if you join both data sets, Flink might need to spill one of them to disk and read it back to avoid a deadlock. This happens for instance if the join strategy is a HashJoin which blocks one inp