Re: Very slow complex type column reads from parquet

2018-06-18 Thread Ryan Blue
Jakub, I'm moving the Spark list to bcc and adding the Parquet list, since you're probably more interested in Parquet tuning. It makes sense that you're getting better performance when you have more matching rows distributed, especially if those rows have a huge column that you need to project.

Re: Very slow complex type column reads from parquet

2018-06-15 Thread Jakub Wozniak
Hello, I’m sorry to bother you again but it is quite important for us to understand the problem better. One more finding in our problem is that the performance of queries in a timestamp sorted file depend a lot on the predicate timestamp. If you are lucky to get some records from the start of

Re: Very slow complex type column reads from parquet

2018-06-14 Thread Jakub Wozniak
Dear Ryan, Thanks a lot for your answer. After having sent the e-mail we have investigated a bit more the data itself. It happened that for certain days it was very skewed and one of the row groups had much more records that all others. This was somehow related to the fact that we have sorted it

Re: Very slow complex type column reads from parquet

2018-06-12 Thread Ryan Blue
Jakub, You're right that Spark currently doesn't use the vectorized read path for nested data, but I'm not sure that's the problem here. With 50k elements in the f1 array, it could easily be that you're getting the significant speed-up from not reading or materializing that column. The

Very slow complex type column reads from parquet

2018-06-11 Thread Jakub Wozniak
Hello, We have stumbled upon a quite degraded performance when reading a complex (struct, array) type columns stored in Parquet. A Parquet file is of around 600MB (snappy) with ~400k rows with a field of a complex type { f1: array of ints, f2: array of ints } where f1 array length is 50k