Re: Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-12 Thread Cheng Lian
Thanks for reporting this! Would you mind to open JIRA tickets for both Spark and Parquet? I'm not sure whether Parquet declares somewhere the user mustn't reuse byte arrays when using binary type. If it does, then it's a Spark bug. Anyway, this should be fixed. Cheng On 4/12/15 1:50 PM,

Parquet File Binary column statistics error when reuse byte[] among rows

2015-04-11 Thread Yijie Shen
Hi, Suppose I create a dataRDD which extends RDD[Row], and each row is GenericMutableRow(Array(Int, Array[Byte])). A same Array[Byte] object is reused among rows but has different content each time. When I convert it to a dataFrame and save it as Parquet File, the file's row group statistic(max