from:"Emlyn Corrin \(JIRA\)"

[jira] [Commented] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven

2019-06-27 Thread Emlyn Corrin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874029#comment-16874029
 ] 

Emlyn Corrin commented on SPARK-28182:
--

I can work around it by adding --packages org.apache.zookeeper:zookeeper:3.4.6 
to the command line, or by setting an earlier Hive metastore version (it seems 
to work even with the default 1.2.1 jars when connecting to Hive metastore 
2.3.0).

> Spark fails to download Hive 2.2+ jars from maven
> -
>
> Key: SPARK-28182
> URL: https://issues.apache.org/jira/browse/SPARK-28182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Emlyn Corrin
>Priority: Major
>
> {{When starting Spark with spark.sql.hive.metastore.jars=maven and a 
> spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the 
> required jars. It looks like it just downloads the -tests version of the 
> zookeeper 3.4.6 jar:}}
> {noformat}
> > rm -rf ~/.ivy2
> > pyspark --conf 
> > spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 --conf 
> > spark.sql.catalogImplementation=hive --conf 
> > spark.sql.hive.metastore.jars=maven --conf 
> > spark.sql.hive.metastore.version=2.3
> Python 3.7.3 (default, Mar 27 2019, 09:23:39)
> Type 'copyright', 'credits' or 'license' for more information
> IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
> 19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>  __
> / __/__ ___ _/ /__
> _\ \/ _ \/ _ `/ __/ '_/
> /__ / .__/\_,_/_/ /_/\_\ version 2.4.3
> /_/
> Using Python version 3.7.3 (default, Mar 27 2019 09:23:39)
> SparkSession available as 'spark'.
> In [1]: spark.sql('show databases').show(){noformat}
>  
> {noformat}
> http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
> the name: repo-1
> Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache
> The jars for the packages stored in: /Users/emcorrin/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hive#hive-metastore added as a dependency
> org.apache.hive#hive-exec added as a dependency
> org.apache.hive#hive-common added as a dependency
> org.apache.hive#hive-serde added as a dependency
> com.google.guava#guava added as a dependency
> org.apache.hadoop#hadoop-client added as a dependency
> :: resolving dependencies :: 
> org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0
>  confs: [default]
>  found org.apache.hive#hive-metastore;2.3.3 in central
>  found org.apache.hive#hive-serde;2.3.3 in central
>  found org.apache.hive#hive-common;2.3.3 in central
>  found org.apache.hive#hive-shims;2.3.3 in central
>  found org.apache.hive.shims#hive-shims-common;2.3.3 in central
>  found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central
>  found org.slf4j#slf4j-api;1.7.10 in central
>  found com.google.guava#guava;14.0.1 in central
>  found commons-lang#commons-lang;2.6 in central
>  found org.apache.thrift#libthrift;0.9.3 in central
>  found org.apache.httpcomponents#httpclient;4.4 in central
>  found org.apache.httpcomponents#httpcore;4.4 in central
>  found commons-logging#commons-logging;1.2 in central
>  found commons-codec#commons-codec;1.4 in central
>  found org.apache.zookeeper#zookeeper;3.4.6 in central
>  found org.slf4j#slf4j-log4j12;1.6.1 in central
>  found log4j#log4j;1.2.16 in central
>  found jline#jline;2.12 in central
>  found io.netty#netty;3.7.0.Final in central
>  found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central
>  found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central
>  found org.apache.hadoop#hadoop-annotations;2.7.2 in central
>  found com.google.inject.extensions#guice-servlet;3.0 in central
>  found com.google.inject#guice;3.0 in central
>  found javax.inject#javax.inject;1 in central
>  found aopalliance#aopalliance;1.0 in central
>  found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central
>  found asm#asm;3.2 in central
>  found com.google.protobuf#protobuf-java;2.5.0 in central
>  found commons-io#commons-io;2.4 in central
>  found com.sun.jersey#jersey-json;1.14 in central
>  found org.codehaus.jettison#jettison;1.1 in central
>  found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central
>  found javax.xml.bind#jaxb-api;2.2.2 in central
>  found javax.xml.stream#stax-api;1.0-2 in central
>  found javax.activation#activation;1.1 in central

[jira] [Created] (SPARK-28182) Spark fails to download Hive 2.2+ jars from maven

2019-06-27 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-28182:


 Summary: Spark fails to download Hive 2.2+ jars from maven
 Key: SPARK-28182
 URL: https://issues.apache.org/jira/browse/SPARK-28182
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Emlyn Corrin


{{When starting Spark with spark.sql.hive.metastore.jars=maven and a 
spark.sql.hive.metastore.version of 2.2 or 2.3 it fails to download the 
required jars. It looks like it just downloads the -tests version of the 
zookeeper 3.4.6 jar:}}
{noformat}
> rm -rf ~/.ivy2
> pyspark --conf spark.hadoop.hive.metastore.uris=thrift://hive-metastore:1 
> --conf spark.sql.catalogImplementation=hive --conf 
> spark.sql.hive.metastore.jars=maven --conf 
> spark.sql.hive.metastore.version=2.3

Python 3.7.3 (default, Mar 27 2019, 09:23:39)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
19/06/27 12:19:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
 __
/ __/__ ___ _/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/

Using Python version 3.7.3 (default, Mar 27 2019 09:23:39)
SparkSession available as 'spark'.

In [1]: spark.sql('show databases').show(){noformat}
 
{noformat}
http://www.datanucleus.org/downloads/maven2 added as a remote repository with 
the name: repo-1
Ivy Default Cache set to: /Users/emcorrin/.ivy2/cache
The jars for the packages stored in: /Users/emcorrin/.ivy2/jars
:: loading settings :: url = 
jar:file:/usr/local/Cellar/apache-spark/2.4.3/libexec/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
com.google.guava#guava added as a dependency
org.apache.hadoop#hadoop-client added as a dependency
:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-639456d4-9614-45c9-ad3c-567e4fa69f79;1.0
 confs: [default]
 found org.apache.hive#hive-metastore;2.3.3 in central
 found org.apache.hive#hive-serde;2.3.3 in central
 found org.apache.hive#hive-common;2.3.3 in central
 found org.apache.hive#hive-shims;2.3.3 in central
 found org.apache.hive.shims#hive-shims-common;2.3.3 in central
 found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central
 found org.slf4j#slf4j-api;1.7.10 in central
 found com.google.guava#guava;14.0.1 in central
 found commons-lang#commons-lang;2.6 in central
 found org.apache.thrift#libthrift;0.9.3 in central
 found org.apache.httpcomponents#httpclient;4.4 in central
 found org.apache.httpcomponents#httpcore;4.4 in central
 found commons-logging#commons-logging;1.2 in central
 found commons-codec#commons-codec;1.4 in central
 found org.apache.zookeeper#zookeeper;3.4.6 in central
 found org.slf4j#slf4j-log4j12;1.6.1 in central
 found log4j#log4j;1.2.16 in central
 found jline#jline;2.12 in central
 found io.netty#netty;3.7.0.Final in central
 found org.apache.hive.shims#hive-shims-0.23;2.3.3 in central
 found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in central
 found org.apache.hadoop#hadoop-annotations;2.7.2 in central
 found com.google.inject.extensions#guice-servlet;3.0 in central
 found com.google.inject#guice;3.0 in central
 found javax.inject#javax.inject;1 in central
 found aopalliance#aopalliance;1.0 in central
 found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central
 found asm#asm;3.2 in central
 found com.google.protobuf#protobuf-java;2.5.0 in central
 found commons-io#commons-io;2.4 in central
 found com.sun.jersey#jersey-json;1.14 in central
 found org.codehaus.jettison#jettison;1.1 in central
 found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central
 found javax.xml.bind#jaxb-api;2.2.2 in central
 found javax.xml.stream#stax-api;1.0-2 in central
 found javax.activation#activation;1.1 in central
 found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
 found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
 found org.codehaus.jackson#jackson-jaxrs;1.9.13 in central
 found org.codehaus.jackson#jackson-xc;1.9.13 in central
 found com.sun.jersey.contribs#jersey-guice;1.9 in central
 found org.apache.hadoop#hadoop-yarn-common;2.7.2 in central
 found org.apache.hadoop#hadoop-yarn-api;2.7.2 in central
 found org.apache.commons#commons-compress;1.9 in central
 found org.mortbay.jetty#jetty-util;6.1.26 in central
 found com.sun.jersey#jersey-core;1.14 in central
 found com.sun.jersey#jersey-client;1.9 in central
 found

[jira] [Commented] (SPARK-23549) Spark SQL unexpected behavior when comparing timestamp to date

2018-04-30 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458511#comment-16458511
 ] 

Emlyn Corrin commented on SPARK-23549:
--

Will this be included in Spark 2.3.1? It only says 2.4.0

> Spark SQL unexpected behavior when comparing timestamp to date
> --
>
> Key: SPARK-23549
> URL: https://issues.apache.org/jira/browse/SPARK-23549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dong Jiang
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> {code:java}
> scala> spark.version
> res1: String = 2.2.1
> scala> spark.sql("select cast('2017-03-01 00:00:00' as timestamp) between 
> cast('2017-02-28' as date) and cast('2017-03-01' as date)").show
> +---+
> |((CAST(CAST(2017-03-01 00:00:00 AS TIMESTAMP) AS STRING) >= 
> CAST(CAST(2017-02-28 AS DATE) AS STRING)) AND (CAST(CAST(2017-03-01 00:00:00 
> AS TIMESTAMP) AS STRING) <= CAST(CAST(2017-03-01 AS DATE) AS STRING)))|
> +---+
> |                                                                             
>                                                                               
>                                                false|
> +---+{code}
> As shown above, when a timestamp is compared to date in SparkSQL, both 
> timestamp and date are downcast to string, and leading to unexpected result. 
> If run the same SQL in presto/Athena, I got the expected result
> {code:java}
> select cast('2017-03-01 00:00:00' as timestamp) between cast('2017-02-28' as 
> date) and cast('2017-03-01' as date)
>   _col0
> 1 true
> {code}
> Is this a bug for Spark or a feature?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-24 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449513#comment-16449513
 ] 

Emlyn Corrin commented on SPARK-24051:
--

I've reshuffled the pyspark version to make it even clearer:
{code:python}
from pyspark.sql import functions, Window
cols = [functions.col("a"),
functions.col("b").alias("b"),
functions.count(functions.lit(1)).over(Window.partitionBy()).alias("n")]
ds1 = spark.createDataFrame([[1,42],[2,99]], ["a","b"]).select(cols)
ds1.show()
# +---+---+---+
# |  a|  b|  n|
# +---+---+---+
# |  1| 42|  2|
# |  2| 99|  2|
# +---+---+---+

ds2 = spark.createDataFrame([[3]], ["a"]).withColumn("b", 
functions.lit(0)).select(cols)
ds2.show()
# +---+---+---+
# |  a|  b|  n|
# +---+---+---+
# |  3|  0|  1|
# +---+---+---+

ds1.union(ds2).show() # look at column b
+---+---+---+
# |  a|  b|  n|
# +---+---+---+
# |  1|  0|  2|
# |  2|  0|  2|
# |  3|  0|  1|
# +---+---+---+

ds1.union(ds2).explain() # the literal 0 as "b" has been pushed into both 
branches of the union
# == Physical Plan ==
# Union
# :- *(2) Project [a#102L, 0 AS b#0, n#2L]
# :  +- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, 
unboundedpreceding$(), unboundedfollowing$())) AS n#2L]
# : +- Exchange SinglePartition
# :+- *(1) Project [a#102L]
# :   +- Scan ExistingRDD[a#102L,b#103L]
# +- *(3) Project [a#22L, 0 AS b#126L, n#2L]
#+- Window [count(1) windowspecdefinition(specifiedwindowframe(RowFrame, 
unboundedpreceding$(), unboundedfollowing$())) AS n#2L]
#   +- Exchange SinglePartition
#  +- Scan ExistingRDD[a#22L]

ds1.union(ds2).drop("n").show() # if we drop "n", column b is correct again:
# +---+---+
# |  a|  b|
# +---+---+
# |  1| 42|
# |  2| 99|
# |  3|  0|
# +---+---+
{code}

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
>

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448390#comment-16448390
 ] 

Emlyn Corrin commented on SPARK-24051:
--

I've managed to reproduce this in {{pyspark}}:
{code}
from pyspark.sql import functions, Window
ds1 = spark.createDataFrame([[1,42],[1,99]], ["a","b"])
ds2 = spark.createDataFrame([[3]], ["a"]).withColumn("b", functions.lit(0))

cols = [functions.col("a"),
functions.col("b").alias("b"),
functions.count(functions.lit(1)).over(Window.partitionBy()).alias("n")]

ds = ds1.select(cols).union(ds2.select(cols)).where(functions.col("n") >= 
1).drop("n")
ds.show()
{code}
I've also found that (in both Java and Python) I can leave off the final 
{{where}} clause if I also leave off the following {{drop}} so that the {{n}} 
column is included in the output (I suppose as long as the it's actually 
observed so that it can't be optimised away).

> Incorrect results for certain queries using Java API on Spark 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries using Java and Python APIs 
on Spark 2.3.0  (was: Incorrect results for certain queries using Java API on 
Spark 2.3.0)

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results for certain queries using Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries using Java API on Spark 
2.3.0  (was: Incorrect results for certain queries in Java API on Spark 2.3.0)

> Incorrect results for certain queries using Java API on Spark 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries in Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447854#comment-16447854
 ] 

Emlyn Corrin commented on SPARK-24051:
--

[~hyukjin.kwon] I expanded the title a bit, but feel free to improve it further 
if you want.

> Incorrect results for certain queries in Java API on Spark 2.3.0
> 
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results for certain queries in Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries in Java API on Spark 2.3.0  
(was: Incorrect results)

> Incorrect results for certain queries in Java API on Spark 2.3.0
> 
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Description: 
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.
{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();

Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();

Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b"),
functions.count(functions.lit(1))
.over(Window.partitionBy())
.as("n")};
Dataset ds = ds1
.select(cols)
.union(ds2.select(cols))
.where(new Column("n").geq(1))
.drop("n");
ds.show();
//ds.explain(true);
}
}
{code}
It just calculates the union of 2 datasets,
{code:java}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
+---+---+
{code}
with
{code:java}
+---+---+
|  a|  b|
+---+---+
|  3|  0|
+---+---+
{code}
The expected result is:
{code:java}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
|  3|  0|
+---+---+
{code}
but instead it prints:
{code:java}
+---+---+
|  a|  b|
+---+---+
|  1|  0|
|  2|  0|
|  3|  0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values 
in rows 1 and 2.
 Making seemingly trivial changes, such as replacing {{new 
Column("b").as("b"),}} with just {{new Column("b"),}} or removing the {{where}} 
clause after the union, make it behave correctly again.

  was:
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();

Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();

Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b"),
functions.count(functions.lit(1))

[jira] [Commented] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447847#comment-16447847
 ] 

Emlyn Corrin commented on SPARK-24051:
--

I tried again to reproduce it in the scala API, keeping as close as possible to 
the Java calls, but do not get the same error. It looks like maybe the error is 
specific to the Java API.
For reference here is the scala code I came up with:
{code:scala}
import org.apache.spark.sql.{Column, RowFactory}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType}
import scala.collection.JavaConverters._

//val ds1 = Seq((1,42),(2,99)).toDF("a", "b")
val arr1 = List(RowFactory.create(Int.box(1), Int.box(42)),
RowFactory.create(Int.box(2), Int.box(99)))
val sch1 = new StructType(Array(new StructField("a", DataTypes.IntegerType),
new StructField("b", DataTypes.IntegerType)))
val ds1 = spark.createDataFrame(arr1.asJava, sch1)
ds1.show

//val ds2 = Seq((3)).toDF("a").withColumn("b",lit(0))
val arr2 = List(RowFactory.create(Int.box(3)))
val sch2 = new StructType(Array(new StructField("a", DataTypes.IntegerType)))
val ds2 = spark.createDataFrame(arr2.asJava, sch2).withColumn("b",lit(0))
ds2.show

//val cols = Array(new Column("a"),new 
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n"))

val ds = ds1.select(new Column("a"),new 
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n")).union(ds2.select(new
 Column("a"),new 
Column("b").as("b"),count(lit(1)).over(Window.partitionBy()).as("n"))).where(new
 Column("n").geq(1)).drop("n")
ds.show
{code}

> Incorrect results
> -
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
> Making seemingly trivial changes, such as replacing

[jira] [Commented] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447794#comment-16447794
 ] 

Emlyn Corrin commented on SPARK-24051:
--

I have tried before to reproduce it using the scala API, and was unable to do 
so. That's not to say it's not present there though, as I also had trouble 
translating the original code (from a Clojure wrapper around the Java API) to 
plain Java. It seems very sensitive to the exact calls made, and even small, 
seemingly trivial changes cause the behaviour to change.

(btw I just slightly simplified the Java code in the description)

> Incorrect results
> -
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
> Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
> Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
> clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Description: 
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();

Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();

Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b"),
functions.count(functions.lit(1))
.over(Window.partitionBy())
.as("n")};
Dataset ds = ds1
.select(cols)
.union(ds2.select(cols))
.where(new Column("n").geq(1))
.drop("n");
ds.show();
//ds.explain(true);
}
}
{code}
It just calculates the union of 2 datasets,
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
+---+---+
{code}
with
{code}
+---+---+
|  a|  b|
+---+---+
|  3|  0|
+---+---+
{code}
The expected result is:
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
|  3|  0|
+---+---+
{code}
but instead it prints:
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  0|
|  2|  0|
|  3|  0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values 
in rows 1 and 2.
Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
clause after the union, make it behave correctly again.

  was:
I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();

Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();

Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b", Metadata.empty()),

[jira] [Commented] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447776#comment-16447776
 ] 

Emlyn Corrin commented on SPARK-24051:
--

Not at all, I was meaning to go back and improve it but forgot before 
submitting.

> Incorrect results
> -
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b", Metadata.empty()),
> functions.count(functions.lit(1))
> .over(Window.partitionBy(new Column("a")))
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
> Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
> Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
> clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24051) Incorrect results

2018-04-23 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-24051:


 Summary: Incorrect results
 Key: SPARK-24051
 URL: https://issues.apache.org/jira/browse/SPARK-24051
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Emlyn Corrin


I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
query, demonstrated by the Java program below. It was simplified from a much 
more complex query, but I'm having trouble simplifying it further without 
removing the erroneous behaviour.

{code:java}
package sparktest;

import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;

public class Main {

public static void main(String[] args) {
SparkConf conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[*]");
SparkSession session = 
SparkSession.builder().config(conf).getOrCreate();

Row[] arr1 = new Row[]{
RowFactory.create(1, 42),
RowFactory.create(2, 99)};
StructType sch1 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty()),
new StructField("b", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
ds1.show();

Row[] arr2 = new Row[]{
RowFactory.create(3)};
StructType sch2 = new StructType(new StructField[]{
new StructField("a", DataTypes.IntegerType, true, 
Metadata.empty())});
Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
.withColumn("b", functions.lit(0));
ds2.show();

Column[] cols = new Column[]{
new Column("a"),
new Column("b").as("b", Metadata.empty()),
functions.count(functions.lit(1))
.over(Window.partitionBy(new Column("a")))
.as("n")};
Dataset ds = ds1
.select(cols)
.union(ds2.select(cols))
.where(new Column("n").geq(1))
.drop("n");
ds.show();
//ds.explain(true);
}
}
{code}
It just calculates the union of 2 datasets,
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
+---+---+
{code}
with
{code}
+---+---+
|  a|  b|
+---+---+
|  3|  0|
+---+---+
{code}
The expected result is:
{code}
+---+---+
|  a|  b|
+---+---+
|  1| 42|
|  2| 99|
|  3|  0|
+---+---+
{code}
but instead it prints:
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  0|
|  2|  0|
|  3|  0|
+---+---+
{code}
notice how the value in column c is always zero, overriding the original values 
in rows 1 and 2.
Making seemingly trivial changes, such as replacing {{new Column("b").as("b", 
Metadata.empty()),}} with just {{new Column("b"),}} or removing the {{where}} 
clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24033) LAG Window function broken in Spark 2.3

2018-04-20 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-24033:


 Summary: LAG Window function broken in Spark 2.3
 Key: SPARK-24033
 URL: https://issues.apache.org/jira/browse/SPARK-24033
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Emlyn Corrin


The {{LAG}} window function appears to be broken in Spark 2.3.0, always failing 
with an AnalysisException. Interestingly, {{LEAD}} is not affected, so it can 
be worked around by negating the lag and using lead instead.

Reproduction (run in {{spark-shell}}):
{code:java}
import org.apache.spark.sql.expressions.Window
val ds = Seq((1,1),(1,2),(1,3),(2,1),(2,2)).toDF("n", "i")
// The following works:
ds.withColumn("m", lead("i", 
-1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
// The following (equivalent) fails:
ds.withColumn("m", lag("i", 
1).over(Window.partitionBy("n").orderBy("i").rowsBetween(-1, -1))).show
{code}

Here is the stacktrace:
{quote}
org.apache.spark.sql.AnalysisException: Window Frame 
specifiedwindowframe(RowFrame, -1, -1) must match the required frame 
specifiedwindowframe(RowFrame, -1, -1);
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2034)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31$$anonfun$applyOrElse$10.applyOrElse(Analyzer.scala:2030)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:120)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:120)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:125)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:85)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:76)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2030)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$31.applyOrElse(Analyzer.scala:2029)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$.apply(Analyzer.scala:2029)
  at

[jira] [Commented] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-03-29 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946882#comment-15946882
 ] 

Emlyn Corrin commented on SPARK-18971:
--

Will this fix go into Spark 2.1.1?

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.2.0
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-02-23 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880338#comment-15880338
 ] 

Emlyn Corrin commented on SPARK-8480:
-

If anyone just wants a way to identify the RDDs in the storage tab, it turns 
out it's already possible to name them by creating a temporary view (see 
https://github.com/apache/spark/pull/16609#issuecomment-281865742).

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-18 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828512#comment-15828512
 ] 

Emlyn Corrin commented on SPARK-8480:
-

[~skp33] OK, I can see that could be useful, but I think it's outside the scope 
of my PR.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-18 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827932#comment-15827932
 ] 

Emlyn Corrin commented on SPARK-8480:
-

[~skp33] with this change, you can do similar:
{code}
scala> val df = sc.range(1,1000).toDF
df: org.apache.spark.sql.DataFrame = [value: bigint]

scala> df.setName("MyDataset")
res0: df.type = MyDataset

scala> df.cache
res1: df.type = MyDataset

scala> df.count
res2: Long = 999

scala> sc.getPersistentRDDs.foreach(println)
(4,In-memory table MyDataset MapPartitionsRDD[4] at cache at :27)

scala> sc.getPersistentRDDs.filter(_._2.name == "In-memory table 
MyDataset").foreach(_._2.unpersist())

scala> sc.getPersistentRDDs.foreach(println)

scala>
{code}
Although the name is not identical (it has some string prefixed). Is this good 
enough, or did you have something else in mind?

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2017-01-17 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826360#comment-15826360
 ] 

Emlyn Corrin commented on SPARK-8480:
-

[~skp33] I'm not sure what you mean, maybe a code snippet would clarify?

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13210) NPE in Sort

2016-12-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754237#comment-15754237
 ] 

Emlyn Corrin commented on SPARK-13210:
--

I've just had this suspiciously similar stack trace in Spark 2.0.2, not sure if 
it's the same thing:
{code}
java.lang.NullPointerException
at 
org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:347)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:197)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:259)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
at 
org.apache.spark.shuffle.sort.ShuffleInMemorySorter.reset(ShuffleInMemorySorter.java:100)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.spill(ShuffleExternalSorter.java:261)
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:171)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:364)
at 
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:387)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:241)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:162)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> NPE in Sort
> ---
>
> Key: SPARK-13210
> URL: https://issues.apache.org/jira/browse/SPARK-13210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> When run TPCDS query Q78 with scale 10:
> {code}
> 16/02/04 22:39:09 ERROR Executor: Managed memory leak detected; size = 
> 268435456 bytes, TID = 143
> 16/02/04 22:39:09 ERROR Executor: Exception in task 0.0 in stage 47.0 (TID 
> 143)
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:333)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:60)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:39)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:239)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:415)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:116)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:87)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:60)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$23.apply(RDD.scala:735)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at

[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-17 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673931#comment-15673931
 ] 

Emlyn Corrin commented on SPARK-18172:
--

I'm not sure I've got the time to build from source at the moment to verify 
this, I think if it now works for you and [~hvanhovell] it's most likely fixed 
now. If it reoccurs for me with 2.0.3 or 2.1.0 once they're released I'll 
reopen this.

Thanks.

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
> Fix For: 2.0.3, 2.1.0
>
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
>

[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671227#comment-15671227
 ] 

Emlyn Corrin commented on SPARK-18172:
--

It occurs on 2.0.1 and 2.0.2 (on Mac, installed via Homebrew), but I'd need 
more time to compile a more recent version and test that.
Since it was literally just the snippet above that triggered it, I think it's 
probably fixed if you can't reproduce it.
Maybe it is the same underlying problem as SPARK-18300, and the fix for that 
fixed this too?


> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
>

[jira] [Updated] (SPARK-18300) ClassCastException during count distinct

2016-11-11 Thread Emlyn Corrin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-18300:
-
Component/s: SQL

> ClassCastException during count distinct
> 
>
> Key: SPARK-18300
> URL: https://issues.apache.org/jira/browse/SPARK-18300
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>
> While trying to reproduce SPARK-18172 in SQL, I found the following SQL query 
> fails in Spark 2.0.1 (via spark-beeline) with a CassCastException:
> {code}
> select count(distinct b), count(distinct b, c) from (select 1 as a, 2 as b, 3 
> as c) group by a;
> {code}
> Selecting either of the two counts on their own runs fine, it is only when 
> both are selected in the same query that it fails. I also tried the same 
> query in pyspark:
> {code}
> spark.sql('select count(distinct b) as x, count(distinct b, c) as y from 
> (select 1 as a, 2 as b, 3 as c) group by a').show()
> {code}
> And get the same error, with a full stack trace:
> {code}
> : java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.Attribute
>   at 
> org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:388)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:388)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:242)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$10.apply(WholeStageCodegenExec.scala:454)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$10.apply(WholeStageCodegenExec.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages.supportCodegen(WholeStageCodegenExec.scala:454)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen(WholeStageCodegenExec.scala:482)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen$2.apply(WholeStageCodegenExec.scala:485)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen$2.apply(WholeStageCodegenExec.scala:485)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen(WholeStageCodegenExec.scala:485)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter(WholeStageCodegenExec.scala:469)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter$1.apply(WholeStageCodegenExec.scala:471)
>   at 
> org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter$1.apply(WholeStageCodegenExec.scala:471)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at

[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-07 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15644358#comment-15644358
 ] 

Emlyn Corrin commented on SPARK-18172:
--

I can also reproduce it in Spark SQL with the following query:
{code:sql}
create table if not exists tmp as select 1 as a, 2 as b, 3 as c;
select first(b), count(distinct b), count(distinct b, c) from tmp group by a;
{code}
As above, removing any of the three expressions from the select allows it to 
succeed.
While investigating this, I ran into another issue that I've reported 
separately in SPARK-18300.

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
>

[jira] [Created] (SPARK-18300) ClassCastException during count distinct

2016-11-07 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-18300:


 Summary: ClassCastException during count distinct
 Key: SPARK-18300
 URL: https://issues.apache.org/jira/browse/SPARK-18300
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
Reporter: Emlyn Corrin


While trying to reproduce SPARK-18172 in SQL, I found the following SQL query 
fails in Spark 2.0.1 (via spark-beeline) with a CassCastException:
{code}
select count(distinct b), count(distinct b, c) from (select 1 as a, 2 as b, 3 
as c) group by a;
{code}
Selecting either of the two counts on their own runs fine, it is only when both 
are selected in the same query that it fails. I also tried the same query in 
pyspark:
{code}
spark.sql('select count(distinct b) as x, count(distinct b, c) as y from 
(select 1 as a, 2 as b, 3 as c) group by a').show()
{code}
And get the same error, with a full stack trace:
{code}
: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to 
org.apache.spark.sql.catalyst.expressions.Attribute
at 
org.apache.spark.sql.types.StructType$$anonfun$fromAttributes$1.apply(StructType.scala:388)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.types.StructType$.fromAttributes(StructType.scala:388)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:242)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:242)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$10.apply(WholeStageCodegenExec.scala:454)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$10.apply(WholeStageCodegenExec.scala:454)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.execution.CollapseCodegenStages.supportCodegen(WholeStageCodegenExec.scala:454)
at 
org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen(WholeStageCodegenExec.scala:482)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen$2.apply(WholeStageCodegenExec.scala:485)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen$2.apply(WholeStageCodegenExec.scala:485)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertWholeStageCodegen(WholeStageCodegenExec.scala:485)
at 
org.apache.spark.sql.execution.CollapseCodegenStages.org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter(WholeStageCodegenExec.scala:469)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter$1.apply(WholeStageCodegenExec.scala:471)
at 
org.apache.spark.sql.execution.CollapseCodegenStages$$anonfun$org$apache$spark$sql$execution$CollapseCodegenStages$$insertInputAdapter$1.apply(WholeStageCodegenExec.scala:471)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at

[jira] [Comment Edited] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-30 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613656#comment-15613656
 ] 

Emlyn Corrin edited comment on SPARK-16648 at 10/30/16 10:07 AM:
-

Edit: I've opened a new issue for this at SPARK-18172.

Since Spark 2.0.1, the following pyspark snippet fails (I believe it worked 
under 2.0.0, so this issue seems like the most likely cause of change in 
behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at

[jira] [Created] (SPARK-18172) AnalysisException in first/last during aggregation

2016-10-30 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-18172:


 Summary: AnalysisException in first/last during aggregation
 Key: SPARK-18172
 URL: https://issues.apache.org/jira/browse/SPARK-18172
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
Reporter: Emlyn Corrin


Since Spark 2.0.1, the following pyspark snippet fails with 
{{AnalysisException: The second argument of First should be a boolean literal}} 
(but it's not restricted to Python, similar code with in Java fails in the same 
way).
It worked in Spark 2.0.0, so I believe it may be related to the fix for 
SPARK-16648.
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at

[jira] [Comment Edited] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-27 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613656#comment-15613656
 ] 

Emlyn Corrin edited comment on SPARK-16648 at 10/27/16 11:33 PM:
-

Since Spark 2.0.1, the following pyspark snippet fails (I believe it worked 
under 2.0.0, so this issue seems like the most likely cause of change in 
behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)

[jira] [Commented] (SPARK-16648) LAST_VALUE(FALSE) OVER () throws IndexOutOfBoundsException

2016-10-27 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613656#comment-15613656
 ] 

Emlyn Corrin commented on SPARK-16648:
--

Since Spark 2.0.1, the following snippet fails (I believe it worked under 
2.0.0, so this issue seems like the most likely cause of change in behaviour):
{code}
from pyspark.sql import functions as F
ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
F.countDistinct(ds._2, ds._3)).show()
{code}
It works if any of the three arguments to {{.agg}} is removed.

The stack trace is:
{code}
Py4JJavaError Traceback (most recent call last)
 in ()
> 1 
ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
 ds._3)).show()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py in 
show(self, n, truncate)
285 +---+-+
286 """
--> 287 print(self._jdf.showString(n, truncate))
288
289 def __repr__(self):

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134
   1135 for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
tree: first(_2#1L)()
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
at 
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
at

[jira] [Created] (SPARK-15846) Allow passing a PrintStream to DataFrame.explain

2016-06-09 Thread Emlyn Corrin (JIRA)

Emlyn Corrin created SPARK-15846:


 Summary: Allow passing a PrintStream to DataFrame.explain
 Key: SPARK-15846
 URL: https://issues.apache.org/jira/browse/SPARK-15846
 Project: Spark
  Issue Type: Improvement
Reporter: Emlyn Corrin
Priority: Minor


DataFrame.explain would be more useful if it could take a PrintStream or 
similar as a parameter so that the output can be sent somewhere other than 
standard out.
I'd be willing to make a pull request if there's a reasonable chance of it 
being accepted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8480) Add setName for Dataframe

2016-02-19 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15153988#comment-15153988
 ] 

Emlyn Corrin commented on SPARK-8480:
-

This would be really useful. We have a fairly large Spark application, and the 
UI becomes very hard to follow because the descriptions all take one of only a 
handful of values (pointing to locations in our wrapper code). It would be much 
easier to understand if we could set meaningful descriptions.

> Add setName for Dataframe
> -
>
> Key: SPARK-8480
> URL: https://issues.apache.org/jira/browse/SPARK-8480
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Peter Rudenko
>Priority: Minor
>
> Rdd has a method setName, so in spark UI, it's more easily to understand 
> what's this cache for. E.g. ("data for LogisticRegression model", etc.). 
> Would be nice to have the same method for Dataframe, since it displays a 
> logical schema, in cache page, which could be quite big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-02-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148472#comment-15148472
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Thanks, the {{registerTempTable}} + {{sql}} workaround is fine for now, I guess 
I'll just wait for 2.0 to clean that up.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-02-16 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148458#comment-15148458
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Any update on this? Should I open a new issue for it so it doesn't fall through 
the cracks?

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-26 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116977#comment-15116977
 ] 

Emlyn Corrin commented on SPARK-9740:
-

I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
if (args.length != 1) {
System.err.println("Usage: SparkTest ");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
.orderBy(df.col("time"))
.rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9740) first/last aggregate NULL behavior

2016-01-26 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116977#comment-15116977
 ] 

Emlyn Corrin edited comment on SPARK-9740 at 1/26/16 9:32 AM:
--

I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
assert args.length == 1;
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
.orderBy(df.col("time"))
.rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}


was (Author: emlyn):
I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
if (args.length != 1) {
System.err.println("Usage: SparkTest ");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
.orderBy(df.col("time"))
.rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at

[jira] [Comment Edited] (SPARK-9740) first/last aggregate NULL behavior

2016-01-26 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116977#comment-15116977
 ] 

Emlyn Corrin edited comment on SPARK-9740 at 1/26/16 9:33 AM:
--

I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
assert args.length == 1;
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
  .orderBy(df.col("time"))
  .rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}


was (Author: emlyn):
I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
assert args.length == 1;
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
.orderBy(df.col("time"))
.rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at

[jira] [Comment Edited] (SPARK-9740) first/last aggregate NULL behavior

2016-01-26 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116977#comment-15116977
 ] 

Emlyn Corrin edited comment on SPARK-9740 at 1/26/16 9:40 AM:
--

I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
assert args.length == 1;
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(df.col("id"),
  df.col("time"),
  functions.expr("LAST(value,true)")
   
.over(Window.partitionBy(df.col("id"))
   .orderBy(df.col("time"))
   
.rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I compiled that and ran it with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}

And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'LAST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:22)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

Where {{test.json}} just contains:
{code}
{"id":1,"time":1,"value":null}
{"id":1,"time":2,"value":"a"}
{"id":1,"time":3,"value":null}
{"id":2,"time":1,"value":"b"}
{"id":2,"time":2,"value":"c"}
{"id":2,"time":3,"value":null}
{"id":3,"time":1,"value":null}
{"id":3,"time":2,"value":null}
{"id":3,"time":3,"value":"d"}
{code}


was (Author: emlyn):
I've put together a minimal example to demonstrate the problem:
{code}
package spark_test;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.hive.HiveContext;

public class SparkTestMain {
public static void main(String[] args) {
assert args.length == 1;
SparkConf sparkConf = new SparkConf().setAppName("SparkTest");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlCtx = new HiveContext(ctx); // I tried both SQLContext 
and HiveContext here
DataFrame df = sqlCtx.read().json(args[0]);
System.out.println(df.schema().simpleString());
DataFrame df2 = df.select(functions.expr("FIRST(value,true)")
  .over(Window.partitionBy(df.col("id"))
  .orderBy(df.col("time"))
  .rowsBetween(Long.MIN_VALUE, 0)));
System.out.println(df2.take(5));
}
}
{code}

I ran that with:
{code}
spark-submit --master local[*] spark-test.jar test.json
{code}
And it fails with:
{code}
struct
Exception in thread "main" java.lang.UnsupportedOperationException: 
'FIRST('value,true) is not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at spark_test.SparkTestMain.main(SparkTestMain.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115904#comment-15115904
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Thanks for the help. I've tried with {{callUDF}} and that gives me the same 
error as when I use {{expr}}. For now I've managed to work around it by calling 
{{registerTempTable("tempTable")}} on the DataFrame, and then 
{{SQLContext.sql("SELECT LAST(colName,true) OVER(...) FROM tempTable")}}, which 
works, but feels a bit hacky.
I'll try to put together a minimal example that demonstrates this, as it is 
currently in the middle of a fairly big Clojure application that calls Spark 
through Java interop.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116297#comment-15116297
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Here's the stack trace, it fails during compilation:
{code}
java.lang.UnsupportedOperationException: 'last('timestamp,true) is not 
supported in a window operation., compiling:(test.clj:8:1)
Exception in thread "main" java.lang.UnsupportedOperationException: 
'last('timestamp,true) is not supported in a window operation., 
compiling:(test.clj:8:1)
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3657)
at clojure.lang.Compiler$IfExpr.eval(Compiler.java:2695)
at clojure.lang.Compiler.compile1(Compiler.java:7474)
at clojure.lang.Compiler.compile(Compiler.java:7541)
at clojure.lang.RT.compile(RT.java:406)
at clojure.lang.RT.load(RT.java:451)
at clojure.lang.RT.load(RT.java:419)
at clojure.core$load$fn__5677.invoke(core.clj:5893)
at clojure.core$load.invokeStatic(core.clj:5892)
at clojure.core$load.doInvoke(core.clj:5876)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invokeStatic(core.clj:5697)
at clojure.core$load_one.invoke(core.clj:5692)
at clojure.core$load_lib$fn__5626.invoke(core.clj:5737)
at clojure.core$load_lib.invokeStatic(core.clj:5736)
at clojure.core$load_lib.doInvoke(core.clj:5717)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invokeStatic(core.clj:648)
at clojure.core$load_libs.invokeStatic(core.clj:5774)
at clojure.core$load_libs.doInvoke(core.clj:5758)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invokeStatic(core.clj:648)
at clojure.core$require.invokeStatic(core.clj:5796)
at sparkalytics.core$loading__5569__auto7689.invoke(core.clj:1)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3652)
at clojure.lang.Compiler.compile1(Compiler.java:7474)
at clojure.lang.Compiler.compile1(Compiler.java:7464)
at clojure.lang.Compiler.compile(Compiler.java:7541)
at clojure.lang.RT.compile(RT.java:406)
at clojure.lang.RT.load(RT.java:451)
at clojure.lang.RT.load(RT.java:419)
at clojure.core$load$fn__5677.invoke(core.clj:5893)
at clojure.core$load.invokeStatic(core.clj:5892)
at clojure.core$load.doInvoke(core.clj:5876)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invokeStatic(core.clj:5697)
at clojure.core$compile$fn__5682.invoke(core.clj:5903)
at clojure.core$compile.invokeStatic(core.clj:5903)
at user$eval13$fn__22.invoke(form-init8921303069087101413.clj:1)
at user$eval13.invokeStatic(form-init8921303069087101413.clj:1)
at user$eval13.invoke(form-init8921303069087101413.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:6927)
at clojure.lang.Compiler.eval(Compiler.java:6917)
at clojure.lang.Compiler.load(Compiler.java:7379)
at clojure.lang.Compiler.loadFile(Compiler.java:7317)
at clojure.main$load_script.invokeStatic(main.clj:275)
at clojure.main$init_opt.invokeStatic(main.clj:277)
at clojure.main$init_opt.invoke(main.clj:277)
at clojure.main$initialize.invokeStatic(main.clj:308)
at clojure.main$null_opt.invokeStatic(main.clj:342)
at clojure.main$null_opt.invoke(main.clj:339)
at clojure.main$main.invokeStatic(main.clj:421)
at clojure.main$main.doInvoke(main.clj:384)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:383)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.Var.applyTo(Var.java:700)
at clojure.main.main(main.java:37)
Caused by: java.lang.UnsupportedOperationException: 'last('timestamp,true) is 
not supported in a window operation.
at 
org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1049)
at sparkalytics.jobs.test$fn__8460.invokeStatic(test.clj:10)
at sparkalytics.jobs.test$fn__8460.invoke(test.clj:8)
at clojure.lang.AFn.applyToHelper(AFn.java:152)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.lang.Compiler$InvokeExpr.eval(Compiler.java:3652)
... 59 more
Compilation failed: Subprocess failed
{code}
(I'm not sure how to get the 59 missing lines from the end, but hopefully this 
is enough)

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116329#comment-15116329
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Version 1.6.0

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114935#comment-15114935
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Thanks [~yhuai], that's just what I was looking for!

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115115#comment-15115115
 ] 

Emlyn Corrin commented on SPARK-9740:
-

[~yhuai] actually, that doesn't seem to work, when I try it, I get:
{code}
java.lang.UnsupportedOperationException: 'LAST('column,true) is not supported 
in a window operation.
{code}

While {{functions.last}} seems to work fine in window operations.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-21 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110688#comment-15110688
 ] 

Emlyn Corrin commented on SPARK-9740:
-

How do you use FIRST/LAST from the Java API with ignoreNulls now? I can't find 
a way to specify ignoreNulls=true.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

45 matches

Mail list logo