Re: Parquet performance tuning for help

2018-02-14 Thread Gabor Szadovszky
Hi, The old statistics have many problems. The sorting order was not defined properly and by specification it does not care about the logical type which can modify the order (e.g. UTF8 vs. DECIMAL for the primitive type BINARY.) See PARQUET-686

Re: Parquet performance tuning for help

2018-02-13 Thread Siva Gudavalli
Hello, 3 MB files are too small for parquet, Try to increase the size. Keep an eye on statistics. In our case, we haven’t seen statistics being generated for string data types and will perform a Scan. Regards Shiv > On Feb 12, 2018, at 9:24 PM, ilegend <511618...@qq.com> wrote: > > Hi

Parquet performance tuning for help

2018-02-13 Thread ilegend
Hi guys, We're testing parquet performance for our big data environment. Parquet is better than orc, but we believe that the parquet has more potential. Any comments and suggestions are welcomed. The test environment is as follows: 1. Server 48 cores + 256gb memory. 2. Spark 2.1.0 + hdfs 2.6.0 +