How to disable input split

2014-10-17 Thread Larry Liu
Is it possible to disable input split if input is already small?


Re: input split size

2014-10-17 Thread Larry Liu
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash and...@andrewash.com wrote:

 When reading out of HDFS it's the HDFS block size.

 On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu larryli...@gmail.com wrote:

 What is the default input split size? How to change it?





wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
A wordcounting job for about 1G text file takes 1 hour while input from a
NFS mount. The same job took 30 seconds while input from local file system.

Is there any tuning required for a NFS mount input?

Thanks
Larry


wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Hi,

A wordcounting job for about 1G text file takes 1 hour while input from a
NFS mount. The same job took 30 seconds while input from local file system.

Is there any tuning required for a NFS mount input?

Thanks
Larry


Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Hi, Matei

Thanks for your response.

I tried to copy the file (1G) from NFS and took 10 seconds. The NFS mount
is a LAN environment and the NFS server is running on the same server that
Spark is running on. So basically I mount the NFS on the same bare metal
machine.

Larry

On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 The problem is very likely NFS, not Spark. What kind of network is it
 mounted over? You can also test the performance of your NFS by copying a
 file from it to a local disk or to /dev/null and seeing how many bytes per
 second it can copy.

 Matei

  On Dec 17, 2014, at 9:38 AM, Larryliu larryli...@gmail.com wrote:
 
  A wordcounting job for about 1G text file takes 1 hour while input from
 a NFS
  mount. The same job took 30 seconds while input from local file system.
 
  Is there any tuning required for a NFS mount input?
 
  Thanks
 
  Larry
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/wordcount-job-slow-while-input-from-NFS-mount-tp20747.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 




Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Thanks, Matei.

I will give it a try.

Larry

On Wed, Dec 17, 2014 at 1:01 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I see, you may have something else configured weirdly then. You should
 look at CPU and disk utilization while your Spark job is reading from NFS
 and, if you see high CPU use, run jstack to see where the process is
 spending time. Also make sure Spark's local work directories
 (spark.local.dir) are not on NFS. They shouldn't be though, that should be
 /tmp.

 Matei

 On Dec 17, 2014, at 11:56 AM, Larry Liu larryli...@gmail.com wrote:

 Hi, Matei

 Thanks for your response.

 I tried to copy the file (1G) from NFS and took 10 seconds. The NFS mount
 is a LAN environment and the NFS server is running on the same server that
 Spark is running on. So basically I mount the NFS on the same bare metal
 machine.

 Larry

 On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 The problem is very likely NFS, not Spark. What kind of network is it
 mounted over? You can also test the performance of your NFS by copying a
 file from it to a local disk or to /dev/null and seeing how many bytes per
 second it can copy.

 Matei

  On Dec 17, 2014, at 9:38 AM, Larryliu larryli...@gmail.com wrote:
 
  A wordcounting job for about 1G text file takes 1 hour while input from
 a NFS
  mount. The same job took 30 seconds while input from local file system.
 
  Is there any tuning required for a NFS mount input?
 
  Thanks
 
  Larry
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/wordcount-job-slow-while-input-from-NFS-mount-tp20747.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com
 .
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 





Re: How to use more executors

2015-01-21 Thread Larry Liu
Will  SPARK-1706 be included in next release?

On Wed, Jan 21, 2015 at 2:50 PM, Ted Yu yuzhih...@gmail.com wrote:

 Please see SPARK-1706

 On Wed, Jan 21, 2015 at 2:43 PM, Larry Liu larryli...@gmail.com wrote:

 I tried to submit a job with  --conf spark.cores.max=6
  or --total-executor-cores 6 on a standalone cluster. But I don't see more
 than 1 executor on each worker. I am wondering how to use multiple
 executors when submitting jobs.

 Thanks
 larry





How to use more executors

2015-01-21 Thread Larry Liu
I tried to submit a job with  --conf spark.cores.max=6
 or --total-executor-cores 6 on a standalone cluster. But I don't see more
than 1 executor on each worker. I am wondering how to use multiple
executors when submitting jobs.

Thanks
larry


where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
I would like to persist RDD TO HDFS or NFS mount. How to change the
location?


Re: where storagelevel DISK_ONLY persists RDD to

2015-01-25 Thread Larry Liu
Hi, Charles

Thanks for your reply.

Is it possible to persist RDD to HDFS? What is the default location to
persist RDD with storagelevel DISK_ONLY?

On Sun, Jan 25, 2015 at 6:26 AM, Charles Feduke charles.fed...@gmail.com
wrote:

 I think you want to instead use `.saveAsSequenceFile` to save an RDD to
 someplace like HDFS or NFS it you are attempting to interoperate with
 another system, such as Hadoop. `.persist` is for keeping the contents of
 an RDD around so future uses of that particular RDD don't need to
 recalculate its composite parts.


 On Sun Jan 25 2015 at 3:36:31 AM Larry Liu larryli...@gmail.com wrote:

 I would like to persist RDD TO HDFS or NFS mount. How to change the
 location?




Re: Shuffle to HDFS

2015-01-25 Thread Larry Liu
Hi,Jerry

Thanks for your reply.

The reason I have this question is that in Hadoop, mapper intermediate
output (shuffle) will be stored in HDFS. I think the default location for
spark is /tmp I think.

Larry

On Sun, Jan 25, 2015 at 9:44 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Hi Larry,



 I don’t think current Spark’s shuffle can support HDFS as a shuffle
 output. Anyway, is there any specific reason to spill shuffle data to HDFS
 or NFS, this will severely increase the shuffle time.



 Thanks

 Jerry



 *From:* Larry Liu [mailto:larryli...@gmail.com]
 *Sent:* Sunday, January 25, 2015 4:45 PM
 *To:* u...@spark.incubator.apache.org
 *Subject:* Shuffle to HDFS



 How to change shuffle output to HDFS or NFS?