Strange results of running Spark GenSort.scala

Sam Liu Sun, 28 Dec 2014 04:59:32 -0800

Hi Experts,
I am confusing on the input parameters of GenSort.scala and encountered strange 
issues. 
It requires 3 parameters: " [num-parts] [records-per-part] [output-path]".
Like Hadoop, I think the sizing of any one row(or record) of the sorting file 
equals to 100 bytes. So if I want to generate and sort 100 GB data using 4 
partitions, is that 
correct to set the parameters as '4, 268435456, /tmp/sort-output'? I computed 
the records(rows) number as following way:


100 GB data = 107374182400 byte = 1073741824 row * 100 byte/row = 268435456 row 
* 4 partition * 100 byte/row 

So each partition should compute 268435456 row(record), right?


However, If I save the output as sequence file, the size of output 
files is only 20.8 GB(5.2 GB * 4 partition).  if I save the output as text 
file, not sequence 
file, the size of output files is 309.2 GB(77.3 GB * 4 partition), but 
NOT 100 GB. Why? 

Thanks！

--------------------------------
Sam Liu

Strange results of running Spark GenSort.scala

Reply via email to