Hello,
I use Spark 1.4.1 and Hadoop 2.2.0.
It may be a stupid question but I cannot understand why "dfs.blocksize" in
hadoop option doesn't affect the number of blocks sometimes.
When I run the script below,
val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
sc.parallelize(1 to 500000000,
24).repartition(3).toDF.saveAsTable("partition_test")
it creates 3 files like this.
221.1 M /user/hive/warehouse/partition_test/part-r-00001.gz.parquet
221.1 M /user/hive/warehouse/partition_test/part-r-00002.gz.parquet
221.1 M /user/hive/warehouse/partition_test/part-r-00003.gz.parquet
To check how many blocks in a file, I enter the command "hdfs fsck
/user/hive/warehouse/partition_test/part-r-00001.gz.parquet -files -blocks".
Total blocks (validated): 1 (avg. block size 231864402 B)
It is normal case because maximum blocksize change from 128MB to 512MB.
In the real world, I have a bunch of files.
14.4 M /user/hive/warehouse/data_1g/part-r-00001.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00002.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00003.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00004.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00005.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00006.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00007.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00008.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00009.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00010.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00011.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00012.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00013.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00014.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00015.gz.parquet
14.4 M /user/hive/warehouse/data_1g/part-r-00016.gz.parquet
Each file consists of 1block (avg. block size 15141395 B) and I run the almost
same code as first.
val BLOCK_SIZE = 1024 * 1024 * 512 // set to 512MB, hadoop default is 128MB
sc.hadoopConfiguration.setInt("parquet.block.size", BLOCK_SIZE)
sc.hadoopConfiguration.setInt("dfs.blocksize",BLOCK_SIZE)
sqlContext.table("data_1g").repartition(1).saveAsTable("partition_test2")
It creates one file.
231.0 M /user/hive/warehouse/partition_test2/part-r-00001.gz.parquet
But it consists of 2 blocks. It seems dfs.blocksize is not applicable.
/user/hive/warehouse/partition_test2/part-r-00001.gz.parquet 242202143 bytes,
2 block(s): OK
0. BP-2098986396-192.168.100.1-1389779750403:blk_1080124727_6385839
len=134217728 repl=2
1. BP-2098986396-192.168.100.1-1389779750403:blk_1080124728_6385840
len=107984415 repl=2
Because of this, Spark read it as 2partition even though I repartition data
into 1partition. If the file size after repartitioning is a little more 128MB
and save it again, it writes 2 files like 128Mb, 1MB.
It is very important for me because I use repartition method many times. Please
help me figure out.
Jung