There is technically no PMC for the spark-ec2 project (I guess we are kind
of establishing one right now). I haven't heard anything from the Spark PMC
on the dev list that might suggest a need for a vote so far. I will send
another round of email notification to the dev list when we have a JIRA /
100% I would love to do it. Who a good person to review the design with.
All I need is a quick chat about the design and approach and I'll create
the jira and push a patch.
Ted Malaska
On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
Hi Ted,
The
I added the following jira
https://issues.apache.org/jira/browse/SPARK-9237
Please help me get it assigned to myself thanks.
Ted Malaska
On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com
wrote:
Cool I will make a jira after I check in to my hotel. And try to get a
patch
If I am not wrong, since the code was hosted within mesos project
repo, I assume (atleast part of it) is owned by mesos project and so
its PMC ?
- Mridul
On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
There is technically no PMC for the spark-ec2
2015-07-20 21:32 GMT-07:00 Prashant Sharma scrapco...@gmail.com:
+1 Looks like a nice idea(I do not see any harm). Would you like to work
on the patch to support it ?
Prashant Sharma
Yes, I would like to contribute to it once we clarify the appropriate path.
--Alexey
On Tue, Jul 21,
Thats part of the confusion we are trying to fix here -- the repository
used to live in the mesos github account but was never a part of the Apache
Mesos project. It was a remnant part of Spark from when Spark used to live
at github.com/mesos/spark.
Shivaram
On Tue, Jul 21, 2015 at 11:03 AM,
2015-07-20 23:29 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:
I agree with this -- basically, to build on Reynold's point, you should be
able to get almost the same performance by implementing either the Hadoop
FileSystem API or the Spark Data Source API over Ignite in the right way.
This
Hi Alexey,
SPARK-6479https://issues.apache.org/jira/browse/SPARK-6479 is for the plugin
API, and SPARK-6112https://issues.apache.org/jira/browse/SPARK-6112 is for
hdfs plugin.
Thanks.
Zhan Zhang
On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk
Is this just frequent items?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
wrote:
100% I would love to do it. Who a good person to review the
2015-07-20 21:40 GMT-07:00 Reynold Xin r...@databricks.com:
I sent it prematurely.
They are already pluggable, or at least in the process to be more
pluggable. In 1.4, instead of calling the external system's API directly,
we added an API for that. There is a patch to add support for HDFS
I compile/make a distribution, with either the 1.4 branch or master,
using the -Phive-thriftserver, and attempt a JDBC connection to a
mysql DB..using latest connector (5.1.36) jar.
When I setup the pyspark shell doing:
bin/pyspark --jars mysql-connection...jar --driver-class-path
Hi all,
When we send a PR, it seems that two requests to run tests are thrown to the
Jenkins sometimes.
What is the difference between SparkPullRequestBuilder and
SlowSparkPullRequestBuilder?
Thanks,
Yu
-
-- Yu Ishikawa
--
View this message in context:
That sounds good. Thanks for clarifying !
Regards,
Mridul
On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Thats part of the confusion we are trying to fix here -- the repository used
to live in the mesos github account but was never a part of the
Look at the implementation for frequently items. It is a different from
true count.
On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:
Is this just frequent items?
yes and freqItems does not give you an ordered count (right ?) + the
threshold makes it difficult to calibrate it + we noticed some strange
behaviour when testing it on small datasets.
2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:
Look at the implementation for frequently
Cool I will make a jira after I check in to my hotel. And try to get a
patch early next week.
On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com
wrote:
yes and freqItems does not give you an ordered count (right ?) + the
threshold makes it difficult to calibrate it + we
I agree with this -- basically, to build on Reynold's point, you should be able
to get almost the same performance by implementing either the Hadoop FileSystem
API or the Spark Data Source API over Ignite in the right way. This would let
people save data persistently in Ignite in addition to
(Related, not important comment: it would also be nice to separate out the
Tachyon dependency from core, as it's conceptually pluggable but is still
hard-coded into several places in the code, and a lot of the comments/docs
in the code.)
On Tue, Jul 21, 2015 at 5:40 AM, Reynold Xin
I'm guessing you want something like what I put in this blog post.
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
This is a very common use case. If there is a +1 I would love to add it to
dataframes.
Let me know
Ted Malaska
On Tue, Jul 21,
Yop,
actually the generic part does not work, the countByValue on one column
gives you the count for each value seen in the column.
I would like a generic (multi-column) countByValue to give me the same kind
of output for each column, not considering each n-uples of each column
value as the key
Ha ok !
Then generic part would have that signature :
def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
+1 for more work (blog / api) for data quality checks.
Cheers,
Jonathan
TopCMSParams and some other monoids from Algebird are really cool for that :
Hi Ted,
The TopNList would be great to see directly in the Dataframe API and my
wish would be to be able to apply it on multiple columns at the same time
and get all these statistics.
the .describe() function is close to what we want to achieve, maybe we
could try to enrich its output.
Anyway,
22 matches
Mail list logo