Re: parallelize method v.s. textFile method

2015-06-24 Thread Reynold Xin
How did you exclude it? I am not sure if it is possible since each task needs to contain the chunk of data. > On Jun 24, 2015, at 6:07 PM, xing wrote: > > When we compare the performance, we already excluded this part of time > difference. > > > > -- > View this message in context: > http://a

Re: parallelize method v.s. textFile method

2015-06-24 Thread xing
When we compare the performance, we already excluded this part of time difference. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/parallelize-method-v-s-textFile-method-tp12871p12873.html Sent from the Apache Spark Developers List mailing list archive

Re: parallelize method v.s. textFile method

2015-06-24 Thread Reynold Xin
If you read the file one by one and then use parallelize, it is read by a single thread on a single machine. On Wednesday, June 24, 2015, xing wrote: > We have a large file and we used to read chunks and then use parallelize > method (distData = sc.parallelize(chunk)) and then do the map/reduce

parallelize method v.s. textFile method

2015-06-24 Thread xing
We have a large file and we used to read chunks and then use parallelize method (distData = sc.parallelize(chunk)) and then do the map/reduce chunk by chunk. Recently we read the whole file using textFile method and found the map/reduce job is much faster. Anybody can help us to understand why? We