Re: Spark performance comparison for research

2016-02-29 Thread Jörn Franke
I am not sure what you compare here. You would need to provide additional details, such as algorithms and functionality supported by your framework. For instance, Spark has built-in fault-tolerance and is a generic framework, which has advantage with respect to development and operations, but

Re: Spark performance comparison for research

2016-02-29 Thread Reynold Xin
That seems reasonable, but it seems pretty unfair to the HPC setup in which the master is reading all the data. Basically you can make HPC perform infinitely worse by just adding more modes to Spark. On Monday, February 29, 2016, yasincelik wrote: > Hello, > > I am

Spark performance comparison for research

2016-02-29 Thread yasincelik
Hello, I am working on a project as a part of my research. The system I am working on is basically an in-memory computing system. I want to compare its performance with Spark. Here is how I conduct experiments. For my project: I have a software defined network(SDN) that allows HPC applications to

Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
I have created jira for this feature , comments and feedback are welcome about how to improve it and whether it's valuable for users. https://issues.apache.org/jira/browse/SPARK-13587 Here's some background info and status of this work. Currently, it's not easy for user to add third party

Re: What should be spark.local.dir in spark on yarn?

2016-02-29 Thread Jeff Zhang
In yarn mode, spark.local.dir is yarn.nodemanager.local-dirs for shuffle data and block manager disk data. What do you mean "But output files to upload to s3 still created in /tmp on slaves" ? You should have control on where to store your output data if that means your job's output. On Tue, Mar

What should be spark.local.dir in spark on yarn?

2016-02-29 Thread Alexander Pivovarov
I have Spark on yarn I defined yarn.nodemanager.local-dirs to be /data01/yarn/nm,/data02/yarn/nm when I look at yarn executor container log I see that blockmanager files created in /data01/yarn/nm,/data02/yarn/nm But output files to upload to s3 still created in /tmp on slaves I do not want

Re: Mapper side join with DataFrames API

2016-02-29 Thread Deepak Gopalakrishnan
Hello All, Just to add to this question a bit more context I have a join as stated above and I see in my executor logs the below : 16/02/29 17:02:35 INFO TaskSetManager: Finished task 198.0 in stage 7.0 (TID 1114) in 20354 ms on localhost (196/200) 16/02/29 17:02:35 INFO

Mapper side join with DataFrames API

2016-02-29 Thread Deepak Gopalakrishnan
Hello, I'm trying to join 2 dataframes A and B with a sqlContext.sql("SELECT * FROM A INNER JOIN B ON A.a=B.a"); Now what I have done is that I have registeredTempTables for A and B after loading these DataFrames from different sources. I need the join to be really fast and I was wondering if

Re: Spark log4j fully qualified class name

2016-02-29 Thread Steve Loughran
On 27 Feb 2016, at 20:40, Prabhu Joseph > wrote: Hi All, When i change the spark log4j.properties conversion pattern to know the fully qualified class name, all the logs has the FQCN as org.apache.spark.Logging. The actual

Re: Control the stdout and stderr streams in a executor JVM

2016-02-29 Thread Anuruddha Premalal
Hi, You can create log4j.properties for executors, and use "--files > log4j.properties" when submitting In the case when we are initializing spark context via java, how can we pass the same parameter? jsc = new JavaSparkContext(conf); Is it possible to set this parameter in