Fwd: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Reynold Xin
Adding user list too. -- Forwarded message -- From: Reynold Xin Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" To provide more context, if we do remove this

Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Herman van Hövell tot Westerflier
We could also fallback to approximate count distincts when the user requests multiple count distincts. This is less invasive than throwing an AnalysisException, but it could violate the principle of least surprise. Met vriendelijke groet/Kind regards, Herman van Hövell tot Westerflier

Re: multiple count distinct in SQL/DataFrame?

2015-10-07 Thread Mayank Pradhan
Is this limited only to grand multiple count distincts or does it extends to all kinds of multiple count distincts? More precisely would the following multiple count distinct query also be affected? select a, b, count(distinct x), count(distinct y) from foo group by a,b; It would be unfortunate

What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread YiZhi Liu
Hi everyone, I'm curious about the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS. Both of them are optimized using LBFGS, the only difference I see is LogisticRegression takes DataFrame while LogisticRegressionWithLBFGS takes RDD. So

RE: GraphX PageRank keeps 3 copies of graph in memory

2015-10-07 Thread Ulanov, Alexander
Hi Ankur, Could you help with explanation of the problem below? Best regards, Alexander From: Ulanov, Alexander Sent: Friday, October 02, 2015 11:39 AM To: 'Robin East' Cc: dev@spark.apache.org Subject: RE: GraphX PageRank keeps 3 copies of graph in memory Hi Robin, Sounds interesting. I am

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Understanding code/closure shipment to Spark workers‏

2015-10-07 Thread Arijit
Hi, I want to understand the code flow starting from the Spark jar that I submit through spark-submit, how does Spark identify and extract the closures, clean and serialize them and ship them to workers to execute as tasks. Can someone point me to any documentation or a pointer to the source

SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Lloyd Haris
Hi Spark Devs, I am doing a performance evaluation of Spark using pyspark. I am using Spark 1.5 with a Hadoop 2.6 cluster of 4 nodes and ran these tests on local mode. After a few dozen test executions, it turned out that the very first SparkSQL query execution is always slower than the

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
Sounds good to me. For my purposes, I'm less concerned about old Spark artifacts and more concerned about the consistency of the set of artifacts that get generated with new releases. (e.g. Each new release will always include one artifact each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...) It

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Sean Owen
This is about the s3.amazonaws.com files, not dist.apache.org right? or does it affect both? (BTW you can keep as many old release artifacts around on the apache.org archives as you like; I think the suggestion is to remove all but the most recent releases from the set that's replicated to all

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Patrick Wendell
I don't think we have a firm contract around that. So far we've never removed old artifacts, but the ASF has asked us at time to decrease the size of binaries we post. In the future at some point we may drop older ones since we keep adding new ones. If downstream projects are depending on our

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas
Thanks guys. Regarding this earlier question: More importantly, is there some rough specification for what packages we should be able to expect in this S3 bucket with every release? Is the implied answer that we should continue to expect the same set of artifacts for every release for the

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Michael Armbrust
Please do. On Wed, Oct 7, 2015 at 9:49 AM, Russell Spitzer wrote: > Should I make up a new ticket for this? Or is there something already > underway? > > On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer > wrote: > >> That sounds fine to me,

Spark standalone hangup during shuffle flatMap or explode in cluster

2015-10-07 Thread Saif.A.Ellafi
When running stand-alone cluster mode job, the process hangs up randomly during a DataFrame flatMap or explode operation, in HiveContext: -->> df.flatMap(r => for (n <- 1 to r.getInt(ind)) yield r) This does not happen either with SQLContext in cluster, or Hive/SQL in local mode, where it

Re: Dataframes: PrunedFilteredScan without Spark Side Filtering

2015-10-07 Thread Russell Spitzer
Should I make up a new ticket for this? Or is there something already underway? On Mon, Oct 5, 2015 at 4:31 PM Russell Spitzer wrote: > That sounds fine to me, we already do the filtering so populating that > field would be pretty simple. > > On Sun, Sep 27, 2015 at

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We