Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Will this be able to handle projection pushdown if a given job doesn't utilize all the columns in the schema? Or should people have a per-job schema? On Wed, Sep 28, 2016 at 2:17 PM, Michael Armbrust wrote: > Burak, you can configure what happens with corrupt records for

Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Cody Koeninger
Regarding documentation debt, is there a reason not to deploy documentation updates more frequently than releases? I recall this used to be the case. On Wed, Sep 28, 2016 at 3:35 PM, Joseph Bradley wrote: > +1 for 4 months. With QA taking about a month, that's very

Dynamic allocation / killing executors work? Perhaps it's just web UI?

2016-09-29 Thread Jacek Laskowski
Hi, I'm doubtful that dynamic allocation / killing executors work in Standalone and YARN as far as web UI's concerned (perhaps it's just web UI). I can successfully request as many executors as I want using sc.requestTotalExecutors and they show up nicely in the web UI as ACTIVE, but whenever I

Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-29 Thread Marcin Tustin
And that PR as promised: https://github.com/apache/spark/pull/12456 On Thu, Sep 29, 2016 at 5:18 AM, Grant Digby wrote: > Yeah that would work although I was worried that they used > InheritableThreadLocal vs Threadlocal because they did want the child > threads to inherit the

Questions about DataFrame's filter()

2016-09-29 Thread Samy Dindane
Hi, I noticed that the following code compiles: val df = spark.read.format("com.databricks.spark.avro").load("/tmp/whatever/output") val count = df.filter(x => x.getAs[Int]("day") == 2).count It surprises me as `filter()` takes a Column, not a `Row => Boolean`. Also, this code returns

Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Reynold Xin
Is there any harm in supporting it? Mostly curious whether we really need to "fix" this. On Thu, Sep 29, 2016 at 7:22 PM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > Tejas, > > This is because we use the same rule to parse top level and nested data > fields. For

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread vaquar khan
+1 (non-binding) Regards, Vaquar khan On 29 Sep 2016 23:00, "Denny Lee" wrote: > +1 (non-binding) > > On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang wrote: > >> +1 >> >> On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: >> >>> +1 >>>

Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Tejas Patil
Herman : Thanks for the explanation. That makes sense. Logged a jira : https://issues.apache.org/jira/browse/SPARK-17741 Reynold : I am not affected by this but it felt odd while I was reading the code. Thanks, Tejas On Thu, Sep 29, 2016 at 7:24 PM, Reynold Xin wrote: > Is

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Denny Lee
+1 (non-binding) On Thu, Sep 29, 2016 at 9:43 PM Jeff Zhang wrote: > +1 > > On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: > >> +1 >> >> On Sep 29, 2016 4:33 PM, "Kyle Kelley" wrote: >> >>> +1 >>> >>> On Thu, Sep 29, 2016 at 4:27 PM,

Re: [question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Herman van Hövell tot Westerflier
Tejas, This is because we use the same rule to parse top level and nested data fields. For example: create table tbl_x( id bigint, nested struct ) Shows both syntaxes. We should split this rule in a top-level and nested rule. Could you open a ticket? Thanks,

Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Weiqing Yang
+1 (non binding) RC4 is compiled and tested on the system: CentOS Linux release 7.0.1406 / openjdk 1.8.0_102 / R 3.3.1 All tests passed. ./build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests clean package ./build/mvn -Pyarn -Phadoop-2.7

Re: [discuss] Spark 2.x release cadence

2016-09-29 Thread Weiqing Yang
Sorry. I think I just replied to the wrong thread. :( WQ On Thu, Sep 29, 2016 at 10:58 AM, Weiqing Yang wrote: > +1 (non binding) > > > > RC4 is compiled and tested on the system: CentOS Linux release > 7.0.1406 / openjdk 1.8.0_102 / R 3.3.1 > > All tests passed. >

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Mridul Muralidharan
+1 Regards, Mridul On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > >

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Matei Zaharia
+1 Matei > On Sep 29, 2016, at 10:59 AM, Herman van Hövell tot Westerflier > wrote: > > +1 (non binding) > > On Thu, Sep 29, 2016 at 10:59 AM, Weiqing Yang > wrote: > +1 (non binding) > > RC4 is

Re: Questions about DataFrame's filter()

2016-09-29 Thread Michael Armbrust
-dev +user It surprises me as `filter()` takes a Column, not a `Row => Boolean`. There are several overloaded versions of Dataset.filter(...) def filter(func: FilterFunction[T]): Dataset[T] def filter(func: (T) ⇒ Boolean): Dataset[T] def filter(conditionExpr: String): Dataset[T] def

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Michael Armbrust
+1 On Thu, Sep 29, 2016 at 11:51 AM, Mridul Muralidharan wrote: > +1 > > Regards, > Mridul > > On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > > Please vote on releasing the following candidate as Apache Spark version > > 2.0.1. The vote is open

Re: Spark SQL JSON Column Support

2016-09-29 Thread Michael Armbrust
> > Will this be able to handle projection pushdown if a given job doesn't > utilize all the columns in the schema? Or should people have a per-job schema? > As currently written, we will do a little bit of extra work to pull out fields that aren't needed. I think it would be pretty straight

Re: Spark SQL JSON Column Support

2016-09-29 Thread Cody Koeninger
Totally agree that specifying the schema manually should be the baseline. LGTM, thanks for working on it. Seems like it looks good to others too judging by the comment on the PR that it's getting merged to master :) On Thu, Sep 29, 2016 at 2:13 PM, Michael Armbrust

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Sean Owen
+1 from me too, same result as my RC3 vote/testing. On Wed, Sep 28, 2016 at 10:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Sameer Agarwal
+1 On Thu, Sep 29, 2016 at 12:04 PM, Sean Owen wrote: > +1 from me too, same result as my RC3 vote/testing. > > On Wed, Sep 28, 2016 at 10:14 PM, Reynold Xin wrote: > > Please vote on releasing the following candidate as Apache Spark version > > 2.0.1.

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Reynold Xin
I will kick it off with my own +1. On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at least 3+1 PMC

Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-29 Thread Grant Digby
Yeah that would work although I was worried that they used InheritableThreadLocal vs Threadlocal because they did want the child threads to inherit the parent's executionId, maybe to stop the child threads from kicking off their own queries whilst working for the parent. I think the fix would be

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Olivier Girardot
I know that the code itself would not be the same, but it would be useful to at least have the pom/build.sbt transitive dependencies different when fetching the artifact with a specific classifier, don't you think ?For now I've overriden them myself using the dependency versions defined in the

Re: [VOTE] Release Apache Spark 2.0.1 (RC2)

2016-09-29 Thread Jacek Laskowski
Hi Marcelo, The reason I asked about the mesos profile was that I thought it was part of the branch already and wondered why nobody used it to compile Spark with all the code available. I do understand no code changes were introduced during this profile maintenance, but with the profile that

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Ricardo Almeida
+1 (non-binding) Built (-Phadoop-2.7 -Dhadoop.version=2.7.3 -Phive -Phive-thriftserver -Pyarn) and tested on: - Ubuntu 16.04 / OpenJDK 1.8.0_91 - CentOS / Oracle Java 1.7.0_55 No regressions from 2.0.0 found while running our workloads (Python API) On 29 September 2016 at 08:10, Reynold Xin

Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-29 Thread Marcin Tustin
That's not possible because inherited primitive values are copied, not shared. Clearing problematic values on thread creation should eliminate this problem. As to your idea as a design goal, that's also not desirable, because Java thread pooling is implemented in a very surprising way. The

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Sean Owen
No, I think that's what dependencyManagent (or equivalent) is definitely for. On Thu, Sep 29, 2016 at 5:37 AM, Olivier Girardot wrote: > I know that the code itself would not be the same, but it would be useful to > at least have the pom/build.sbt transitive

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Joseph Bradley
+1 On Thu, Sep 29, 2016 at 2:11 PM, Dongjoon Hyun wrote: > +1 (non-binding) > > At this time, I tested RC4 on the followings. > > - CentOS 6.8 (Final) > - OpenJDK 1.8.0_101 > - Python 2.7.12 > > /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver >

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Marcelo Vanzin
+1 On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Luciano Resende
+1 (non-binding) On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Yin Huai
+1 On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende wrote: > +1 (non-binding) > > On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.0.1. The vote is open until Sat,

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Kyle Kelley
+1 On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai wrote: > +1 > > On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende > wrote: > >> +1 (non-binding) >> >> On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: >> >>> Please vote on

Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread jpuro
Hi, I recently tried deploying Spark master and slave instances to container based environments such as Docker, Nomad etc. There are two issues that I've found with how the startup scripts work. The sbin/start-master.sh and sbin/start-slave.sh start a daemon by default, but this isn't as

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Burak Yavuz
+1 On Sep 29, 2016 4:33 PM, "Kyle Kelley" wrote: > +1 > > On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai wrote: > >> +1 >> >> On Thu, Sep 29, 2016 at 4:07 PM, Luciano Resende >> wrote: >> >>> +1 (non-binding) >>> >>> On Wed, Sep 28,

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Jeff Zhang
+1 On Fri, Sep 30, 2016 at 9:27 AM, Burak Yavuz wrote: > +1 > > On Sep 29, 2016 4:33 PM, "Kyle Kelley" wrote: > >> +1 >> >> On Thu, Sep 29, 2016 at 4:27 PM, Yin Huai wrote: >> >>> +1 >>> >>> On Thu, Sep 29, 2016 at 4:07 PM, Luciano

Re: Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread Jakob Odersky
I'm curious, what kind of container solutions require foreground processes? Most init systems work fine with "starter" processes that run other processes. IIRC systemd and start-stop-daemon have an option called "fork", that will expect the main process to run another one in the background and

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Jagadeesan As
+1 (non binding) Ubuntu 14.04.2/openjdk "1.8.0_72" (-Pyarn -Phadoop-2.7 -Psparkr -Pkinesis-asl -Phive-thriftserver) Cheers, Jagadeesan A S From: Ricardo Almeida To: "dev@spark.apache.org" Date: 29-09-16 04:36 PM Subject:

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Dongjoon Hyun
+1 (non-binding) At this time, I tested RC4 on the followings. - CentOS 6.8 (Final) - OpenJDK 1.8.0_101 - Python 2.7.12 /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dpyspark -Dsparkr -DskipTests clean package /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive

Re: Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread Mike Ihbe
Our particular use case is for Nomad, using the "exec" configuration described here: https://www.nomadproject.io/docs/drivers/exec.html. It's not exactly a container, just a cgroup. It performs a simple fork/exec of a command and binds to the output fds from that process, so daemonizing is causing

[question] Why Spark SQL grammar allows : ?

2016-09-29 Thread Tejas Patil
Is there any reason why Spark SQL supports "" ":" "" while specifying columns ? eg. sql("CREATE TABLE t1 (column1:INT)") works fine. Here is relevant snippet in the grammar [0]: ``` colType : identifier ':'? dataType (COMMENT STRING)? ; ``` I do not see MySQL[1], Hive[2], Presto[3] and

Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-29 Thread satyajit vegesna
Hi ALL, i am trying to compile code using maven ,which was working with spark 1.6.2, but when i try for spark 2.0.0 then i get below error, org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project