Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA /

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com wrote: Cool I will make a jira after I check in to my hotel. And try to get a patch

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:32 GMT-07:00 Prashant Sharma scrapco...@gmail.com: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma Yes, I would like to contribute to it once we clarify the appropriate path. --Alexey On Tue, Jul 21,

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM,

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 23:29 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com: I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This

Re: Make off-heap store pluggable

2015-07-21 Thread Zhan Zhang
Hi Alexey, SPARK-6479https://issues.apache.org/jira/browse/SPARK-6479 is for the plugin API, and SPARK-6112https://issues.apache.org/jira/browse/SPARK-6112 is for hdfs plugin. Thanks. Zhan Zhang On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Reynold Xin
Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the

Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:40 GMT-07:00 Reynold Xin r...@databricks.com: I sent it prematurely. They are already pluggable, or at least in the process to be more pluggable. In 1.4, instead of calling the external system's API directly, we added an API for that. There is a patch to add support for HDFS

-Phive-thriftserver when compiling for use in pyspark and JDBC connections

2015-07-21 Thread Aaron
I compile/make a distribution, with either the 1.4 branch or master, using the -Phive-thriftserver, and attempt a JDBC connection to a mysql DB..using latest connector (5.1.36) jar. When I setup the pyspark shell doing: bin/pyspark --jars mysql-connection...jar --driver-class-path

What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-21 Thread Yu Ishikawa
Hi all, When we send a PR, it seems that two requests to run tests are thrown to the Jenkins sometimes. What is the difference between SparkPullRequestBuilder and SlowSparkPullRequestBuilder? Thanks, Yu - -- Yu Ishikawa -- View this message in context:

Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
That sounds good. Thanks for clarifying ! Regards, Mridul On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items?

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we

Re: Make off-heap store pluggable

2015-07-21 Thread Matei Zaharia
I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to

Re: Make off-heap store pluggable

2015-07-21 Thread Sean Owen
(Related, not important comment: it would also be nice to separate out the Tachyon dependency from core, as it's conceptually pluggable but is still hard-coded into several places in the code, and a lot of the comments/docs in the code.) On Tue, Jul 21, 2015 at 5:40 AM, Reynold Xin

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21,

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that :

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway,