Hi all,
We are using Parquet Hive table, and we are upgrading to Spark 1.3. But we
find that, just a simple COUNT(*) query will much slower (100x) than Spark
1.2.
I find the most time spent on driver to get HDFS blocks. I find large
amount of get below logs printed:
15/03/30 23:03:43 DEBUG ProtobufRpcEngine: Call: getBlockLocations took 2097ms
15/03/30 23:03:43 DEBUG DFSClient: newInfo = LocatedBlocks{
fileLength=77153436
underConstruction=false
blocks=[LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0;
locs=[10.152.116.172:50010, 10.152.116.169:50010,
10.153.125.184:50010]}]
lastLocatedBlock=LocatedBlock{BP-1236294426-10.152.90.181-1425290838173:blk_1075187948_1448275;
getBlockSize()=77153436; corrupt=false; offset=0;
locs=[10.152.116.169:50010, 10.153.125.184:50010,
10.152.116.172:50010]}
isLastBlockComplete=true}
15/03/30 23:03:43 DEBUG DFSClient: Connecting to datanode 10.152.116.172:50010
I compare the printed log with Spark 1.2, although the number of
getBlockLocations call is similar, but each such operation only cost 20~30
ms (but it is 2000ms~3000ms now), and it didn't print the detailed
LocatedBlocks info.
Another finding is, if I read the Parquet file via scala code form
spark-shell as below, it looks fine, the computation will return the result
quick as before.
sqlContext.parquetFile("data/myparquettable")
Any idea about it? Thank you!
--
郑旭东
Zheng, Xudong