Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
This will get you started https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Mon, Jul 13, 2015 at 5:29 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hello everyone, I am interested to contribute to apache spark. I

problems with build of latest the master

2015-07-14 Thread Gil Vernik
I just did checkout of the master and tried to build it with mvn -Dhadoop.version=2.6.0 -DskipTests clean package Got: [ERROR] /Users/gilv/Dev/Spark/spark/core/src/test/java/org/apache/spark/shuffle/unsafe/UnsafeShuffleWriterSuite.java:117: error: cannot find symbol [ERROR]

Re: Contributiona nd choice of langauge

2015-07-14 Thread Akhil Das
You can try to resolve some Jira issues, to start with try out some newbie JIRA's. Thanks Best Regards On Tue, Jul 14, 2015 at 4:10 PM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: I saw the contribution sections. As a new contibutor, should I try to build patches or can I add

Re: Contributiona nd choice of langauge

2015-07-14 Thread srinivasraghavansr71
I saw the contribution sections. As a new contibutor, should I try to build patches or can I add some new algorithm to MLlib. I am comfortable with python and R. Are they enough to contribute for spark? -- View this message in context:

Re: question related partitions of the DataFrame

2015-07-14 Thread Gil Vernik
I see that most recent code doesn't has RDDApi anymore. But i still would like to understand the logic of partitions of DataFrame. Does DataFrame has it's own partitions and is sort of RDD by itself, or it depends on the partitions of the underline RDD that was used to load the data? For

Re: problems with build of latest the master

2015-07-14 Thread Gil Vernik
I figured it out. I tried to build Spark configured to access OpenStack Swift and hadoop-openstack.jar has the same issue as were described here https://github.com/apache/spark/pull/7090/commits So for those who wants to build Spark 1.5.- master with OpenStack Swift support, just remove

Re: problems with build of latest the master

2015-07-14 Thread Ted Yu
Looking at Jenkins, master branch compiles. Can you try the following command ? mvn -Phive -Phadoop-2.6 -DskipTests clean package What version of Java are you using ? Cheers On Tue, Jul 14, 2015 at 2:23 AM, Gil Vernik g...@il.ibm.com wrote: I just did checkout of the master and tried to

Re: Contributiona nd choice of langauge

2015-07-14 Thread Rakesh Chalasani
Here is a more specific MLlib related Umbrella for 1.5 that can help you get started https://issues.apache.org/jira/browse/SPARK-8445?jql=text%20~%20%22mllib%201.5%22 Rakesh On Tue, Jul 14, 2015 at 6:52 AM Akhil Das ak...@sigmoidanalytics.com wrote: You can try to resolve some Jira issues, to

Re: BlockMatrix multiplication

2015-07-14 Thread Rakesh Chalasani
Block matrix stores the data as key-Matrix pairs and multiply does a reduceByKey operations, aggregating matrices per key. Since you said each block is residing in a separate partition, reduceByKey might be effectively shuffling all of the data. A better way to go about this is to allow multiple

Re: Spark Core and ways of talking to it for enhancing application language support

2015-07-14 Thread Shivaram Venkataraman
Both SparkR and the PySpark API call into the JVM Spark API (i.e. JavaSparkContext, JavaRDD etc.). They use different methods (Py4J vs. the R-Java bridge) to call into the JVM based on libraries available / features supported in each language. So for Haskell, one would need to see what is the best

Re: BlockMatrix multiplication

2015-07-14 Thread Rakesh Chalasani
Hi Alexander: Aw, I missed the 'cogroup' on BlockMatrix multiply! I stand corrected. Check https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala#L361 BlockMatrix multiply uses a custom

Re: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Rakesh, Thanks for suggestion. Each block of original matrix is in separate partition. Each block of transposed matrix is also in a separate partition. The partition numbers are the same for the blocks that undergo multiplication. Each partition is on a separate worker. Basically, I want to

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Sean Busbey
Responses inline, with some liberties on ordering. On Sun, Jul 12, 2015 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Sean B, Would you mind outlining for me how we go about changing this policy - I think it's outdated and doesn't make much sense. Ideally I'd like to propose a

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Rakesh, I am not interested in a particular case of A^T*A. This case is a handy setup so I don’t need to create another matrix and force the blocks to co-locate. Basically, I am trying to understand the effectiveness of BlockMatrix for multiplication of distributed matrices. It seems that I

Re: BlockMatrix multiplication

2015-07-14 Thread Burak Yavuz
Hi Alexander, From your example code, using the GridPartitioner, you will have 1 column, and 5 rows. When you perform an A^T^A multiplication, you will generate a separate GridPartitioner with 5 columns and 5 rows. Therefore you are observing a huge shuffle. If you would generate a diagonal-block

Re: question related partitions of the DataFrame

2015-07-14 Thread Eugene Morozov
Gil, I’d say that DataFrame is a result of transformation of any other RDD. Your input RDD might contains strings and numbers. But as a result of transformation you end up with RDD that contains GenericRowWithSchema, which is what DataFrame actually is. So, I’d say that DataFrame is just sort

Re: Spark Core and ways of talking to it for enhancing application language support

2015-07-14 Thread Vasili I. Galchin
thanks. On Tuesday, July 14, 2015, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Both SparkR and the PySpark API call into the JVM Spark API (i.e. JavaSparkContext, JavaRDD etc.). They use different methods (Py4J vs. the R-Java bridge) to call into the JVM based on libraries

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app with updateStateByKey will not start until you set checkpoint directory. 2. The

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread Tathagata Das
BTW, this is more like a user-list kind of mail, than a dev-list. The dev-list is for Spark developers. On Tue, Jul 14, 2015 at 4:23 PM, Tathagata Das t...@databricks.com wrote: 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread swetha
OK. Thanks a lot TD. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Does-RDD-checkpointing-store-the-entire-state-in-HDFS-tp7368p13231.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

RE: BlockMatrix multiplication

2015-07-14 Thread Ulanov, Alexander
Hi Burak, Thank you for explanation! I will try to make a diagonal block matrix and report you the results. Column- or row based partitioner make sense to me, because it is a direct analogy from column or row-based data storage in matrices, which is used in BLAS. Best regards, Alexander

PySpark GroupByKey implementation question

2015-07-14 Thread Matt Cheah
Hi everyone, I was examining the Pyspark implementation of groupByKey in rdd.py. I would like to submit a patch improving Scala RDD¹s groupByKey that has a similar robustness against large groups, as Pyspark¹s implementation has logic to spill part of a single group to disk along the way. Its

Re: Does RDD checkpointing store the entire state in HDFS?

2015-07-14 Thread swetha
Hi TD, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed or

RestSubmissionClient Basic Auth

2015-07-14 Thread Joel Zambrano
Hi! We have a gateway with basic auth that relays calls to the head node in our cluster. Is adding support for basic auth the wrong approach? Should we use a relay proxy? I've seen the code and it would probably require adding a few configs and appending the header on the get and post request

Regarding sessionization with updateStateByKey

2015-07-14 Thread swetha
Hi, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed? Thanks,

Re: Should spark-ec2 get its own repo?

2015-07-14 Thread Matt Goodman
I concur with the things Sean said about keeping the same JIRA. Frankly, its a pretty small part of spark, and as mentioned by Nicholas, a reference implementation of getting Spark running in ec2. I can see wanting to grow it to a little more general tool that implements launchers for other

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Sean Busbey
Point well taken. Allow me to walk back a little and move us in a more productive direction. I can personally empathize with the desire to have nightly builds. I'm a passionate advocate for tight feedback cycles between a project and its downstream users. I am personally involved in several

Re: Contributiona nd choice of langauge

2015-07-14 Thread Feynman Liang
I would suggest starting with some starter tasks

Re: Foundation policy on releases and Spark nightly builds

2015-07-14 Thread Mark Hamstra
Please keep in mind that you are also ASF people, as is the entire Spark community (users and all)[4]. Phrasing things in terms of us and them by drawing a distinction on [they] get in a fight on our mailing list is not helpful. whineBut they started it!/whine A bit more seriously, my