from:"kevin yu \(JIRA\)"

[jira] [Commented] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-22 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968711#comment-14968711
 ] 

kevin yu commented on SPARK-5966:
-

Hello Tathagata & Andrew:
I have code a possible fix, and the error message will be like this.
$ ./bin/spark-submit --master local[10] --deploy-mode cluster 
examples/src/main/python/pi.py
Error: Cluster deploy mode is not compatible with master "local"
Run with --help for usage help or --verbose for debug output

Let me know if you have any comments. Otherwise, I am going to submit a PR 
shortly. Thanks.

Kevin


> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6043) Error when trying to rename table with alter table after using INSERT OVERWITE to populate the table

2015-10-29 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980899#comment-14980899
 ] 

kevin yu commented on SPARK-6043:
-

Hello Trystan: I tried your testcase, and it works on spark 1.5, seems the 
problem has been fixed. Can you verify and close this jira? Thanks.

> Error when trying to rename table with alter table after using INSERT 
> OVERWITE to populate the table
> 
>
> Key: SPARK-6043
> URL: https://issues.apache.org/jira/browse/SPARK-6043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Trystan Leftwich
>Priority: Minor
>
> If you populate a table using INSERT OVERWRITE and then try to rename the 
> table using alter table it fails with:
> {noformat}
> Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
> Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
> Unable to alter table. (state=,code=0)
> {noformat}
> Using the following SQL statement creates the error:
> {code:sql}
> CREATE TABLE `tmp_table` (salesamount_c1 DOUBLE);
> INSERT OVERWRITE table tmp_table SELECT
>MIN(sales_customer.salesamount) salesamount_c1
> FROM
> (
>   SELECT
>  SUM(sales.salesamount) salesamount
>   FROM
>  internalsales sales
> ) sales_customer;
> ALTER TABLE tmp_table RENAME to not_tmp;
> {code}
> But if you change the 'OVERWRITE' to be 'INTO' the SQL statement works.
> This is happening on our CDH5.3 cluster with multiple workers, If we use the 
> CDH5.3 Quickstart VM the SQL does not produce an error. Both cases were spark 
> 1.2.1 built for hadoop2.4+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000763#comment-15000763
 ] 

kevin yu commented on SPARK-11657:
--

Hello Virgil: Can you try to toDF().show()? then do toDF().take(2)?

Thanks
Kevin

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11772) DataFrame.show() fails with non-ASCII strings

2015-11-17 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15008996#comment-15008996
 ] 

kevin yu commented on SPARK-11772:
--

Hello Greg:
I running against the latest spark, it works for me.

Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0-SNAPSHOT
  /_/

Using Python version 2.7.10 (default, Jul 14 2015 19:46:27)
SparkContext available as sc, HiveContext available as sqlContext.
>>> df = sqlContext.createDataFrame([[u'ab\u0255']])
>>> df.show()
+---+
| _1|
+---+
|abɕ|
+---+

>>> 


> DataFrame.show() fails with non-ASCII strings
> -
>
> Key: SPARK-11772
> URL: https://issues.apache.org/jira/browse/SPARK-11772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Greg Baker
>Priority: Minor
>
> When given a non-ASCII string (in pyspark at least), the DataFrame.show() 
> method fails.
> {code:none}
> df = sqlContext.createDataFrame([[u'ab\u0255']])
> df.show()
> {code}
> Results in:
> {code:none}
> 15/11/16 21:36:54 INFO DAGScheduler: ResultStage 1 (showString at 
> NativeMethodAccessorImpl.java:-2) finished in 0.148 s
> 15/11/16 21:36:54 INFO DAGScheduler: Job 1 finished: showString at 
> NativeMethodAccessorImpl.java:-2, took 0.192634 s
> Traceback (most recent call last):
>   File ".../show_bug.py", line 8, in 
> df.show()
>   File 
> ".../spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 256, in show
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u0255' in 
> position 21: ordinal not in range(128)
> 15/11/16 21:36:54 INFO SparkContext: Invoking stop() from shutdown hook
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-10 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1490#comment-1490
 ] 

kevin yu commented on SPARK-11447:
--

Hi Kapil: I have a possible fix ready, now I am working on the test case. 

Kevin


> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
>

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-02 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985569#comment-14985569
 ] 

kevin yu commented on SPARK-11447:
--

Hello Kapil:

When you say Doesn't work, does it mean that you got exception? 
 
I tried spark 1.5 , and this scala code works for me. 
Can you verify which spark version you are running? I saw you put 1.5.1 there. 

//DOESN'T WORK
val filteredDF = df.filter(df("column") <=> (new Column(Literal(null

I run this on my spark shell at the latest version

scala> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

thanks.



> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
>

[jira] [Commented] (SPARK-11447) Null comparison requires type information but type extraction fails for complex types

2015-11-05 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992271#comment-14992271
 ] 

kevin yu commented on SPARK-11447:
--

Hello Kapil : Thanks a lot. I am looking into it now. Kevin

> Null comparison requires type information but type extraction fails for 
> complex types
> -
>
> Key: SPARK-11447
> URL: https://issues.apache.org/jira/browse/SPARK-11447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Kapil Singh
>
> While comparing a Column to a null literal, comparison works only if type of 
> null literal matches type of the Column it's being compared to. Example scala 
> code (can be run from spark shell):
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq("abc"),Seq(null),Seq("xyz"))
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", StringType, true)))
> val df = sqlContext.createDataFrame(sc.makeRDD(inputRows), dfSchema)
> //DOESN'T WORK
> val filteredDF = df.filter(df("column") <=> (new Column(Literal(null
> //WORKS
> val filteredDF = df.filter(df("column") <=> (new Column(Literal.create(null, 
> SparkleFunctions.dataType(df("column"))
> Why should type information be required for a null comparison? If it's 
> required, it's not always possible to extract type information from complex  
> types (e.g. StructType). Following scala code (can be run from spark shell), 
> throws org.apache.spark.sql.catalyst.analysis.UnresolvedException:
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.catalyst.expressions._
> val inputRowsData = Seq(Seq(Row.fromSeq(Seq("abc", 
> "def"))),Seq(Row.fromSeq(Seq(null, "123"))),Seq(Row.fromSeq(Seq("ghi", 
> "jkl"
> val inputRows = for(seq <- inputRowsData) yield Row.fromSeq(seq)
> val dfSchema = StructType(Seq(StructField("column", 
> StructType(Seq(StructField("p1", StringType, true), StructField("p2", 
> StringType, true))), true)))
> val filteredDF = df.filter(df("column")("p1") <=> (new 
> Column(Literal.create(null, SparkleFunctions.dataType(df("column")("p1"))
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: column#0[p1]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedExtractValue.dataType(unresolved.scala:243)
>   at 
> org.apache.spark.sql.ArithmeticFunctions$class.dataType(ArithmeticFunctions.scala:76)
>   at 
> org.apache.spark.sql.SparkleFunctions$.dataType(SparkleFunctions.scala:14)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>   at $iwC$$iwC$$iwC$$iwC.(:61)
>   at $iwC$$iwC$$iwC.(:63)
>   at $iwC$$iwC.(:65)
>   at $iwC.(:67)
>   at (:69)
>   at .(:73)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>   at 
>

[jira] [Created] (SPARK-11533) [SPARK-11447] Null comparison requires type information but type extraction fails for complex types

2015-11-05 Thread kevin yu (JIRA)

kevin yu created SPARK-11533:


 Summary: [SPARK-11447] Null comparison requires type information 
but type extraction fails for complex types
 Key: SPARK-11533
 URL: https://issues.apache.org/jira/browse/SPARK-11533
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: kevin yu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext

2015-10-19 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963835#comment-14963835
 ] 

kevin yu commented on SPARK-11186:
--

Hello Santiago: How did you run the above code? did you get any stack trace? I 
tried on spark-shell, I got the error, it seems that the SQLContext.value is a 
protected field. 
 can't access the . scala> sqlc.catalog.registerTable(relationName :: Nil, 
LogicalRelation(new BaseRelation {
 |   override def sqlContext: SQLContext = sqlc
 |   override def schema: StructType = StructType(Nil)
 | }))
:26: error: lazy value catalog in class SQLContext cannot be accessed 
in org.apache.spark.sql.SQLContext
 Access to protected value catalog not permitted because
 enclosing class $iwC is not a subclass of 
 class SQLContext in package sql where target is defined
  sqlc.catalog.registerTable(relationName :: Nil, 
LogicalRelation(new BaseRelation {

> Caseness inconsistency between SQLContext and HiveContext
> -
>
> Key: SPARK-11186
> URL: https://issues.apache.org/jira/browse/SPARK-11186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Santiago M. Mola
>Priority: Minor
>
> Default catalog behaviour for caseness is different in {{SQLContext}} and 
> {{HiveContext}}.
> {code}
>   test("Catalog caseness (SQL)") {
> val sqlc = new SQLContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
>   test("Catalog caseness (Hive)") {
> val sqlc = new HiveContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
> {code}
> Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
> But the reason that this is needed seems undocumented (both in the manual or 
> in the source code comments).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-10 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14951661#comment-14951661
 ] 

kevin yu commented on SPARK-7099:
-

Hello Ryan: I tried the similar query with exponent format on spark-submit and 
spark-shell on 1.5.1, both worked, can you try on 1.5 or the latest version? 
scala> sqlContext.sql("LOAD DATA LOCAL INPATH 
'sql/hive/src/test/resources/data/files/kv1.txt' INTO TABLE src")
res2: org.apache.spark.sql.DataFrame = [result: string]

scala> sqlContext.sql("FROM src select key WHERE key = 1E6 
").collect().foreach(println)

scala> sqlContext.sql("FROM src select key WHERE key < 1E6 
").collect().foreach(println)
[238]
[86]
[311]
[27]
[165]
[409]
[255]
[278]
[98]
 

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7099) Floating point literals cannot be specified using exponent

2015-10-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955790#comment-14955790
 ] 

kevin yu commented on SPARK-7099:
-

Hello Ryan: Can you close this JIRA? Thanks.
Kevin

> Floating point literals cannot be specified using exponent
> --
>
> Key: SPARK-7099
> URL: https://issues.apache.org/jira/browse/SPARK-7099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: Windows, Linux, Mac OS X
>Reporter: Peter Hagelund
>Priority: Minor
>
> Floating point literals cannot be expressed in scientific notation using an 
> exponent, like e.g. 1.23E4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2015-10-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954463#comment-14954463
 ] 

kevin yu commented on SPARK-10777:
--

Hello Campbell: I tried the same query on hive version 1.2.1, and I got the 
same failure. It looks like hive or spark sql are not supporting this yet.

@a : Hello, I am new to spark, and I wish to contribute to the spark community. 
I can recreate the problem on hive and spark sql. I think spark sql and hive 
doesn't have the support yet. Should I look into how to add this support in 
spark sql or there is already other plan for this? Thanks for your advice.  
Kevin

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7101) Spark SQL should support java.sql.Time

2015-12-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15045318#comment-15045318
 ] 

kevin yu commented on SPARK-7101:
-

I will work on this.

> Spark SQL should support java.sql.Time
> --
>
> Key: SPARK-7101
> URL: https://issues.apache.org/jira/browse/SPARK-7101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
> Environment: All
>Reporter: Peter Hagelund
>Priority: Minor
>
> Several RDBMSes support the TIME data type; for more exact mapping between 
> those and Spark SQL, support for java.sql.Time with an associated 
> DataType.TimeType would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048237#comment-15048237
 ] 

kevin yu commented on SPARK-12231:
--

Hello Yahsuan: I am looking at this problem now. I can recreate the problem.  
but when you say 'if write data without partitionBy, the error won't happen'. 
are you trying with this? 

df1.write.parquet('./data')

df2 = sqlc.read.parquet('./data')
df2.dropna()
df2.count()

I tried without partitionBy, and using 

df2 = sqlc.read.parquet('./data')
df2.dropna().count()

I still get the exception.

I will update with my progress. Thanks.


> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> # read.py
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-04 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041954#comment-15041954
 ] 

kevin yu commented on SPARK-12128:
--

Hello Philip: Thanks for reporting this problem, this looks like bug for me. I 
can recreate the problem also. Are you planning to fix this problem? If not, I 
can look into the code. Thanks.

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-04 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042041#comment-15042041
 ] 

kevin yu commented on SPARK-12128:
--

Hello Philip: I see, yah, seems other DBs could happen also. Thanks. 

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12231) Failed to generate predicate Error when using dropna

2015-12-09 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049498#comment-15049498
 ] 

kevin yu commented on SPARK-12231:
--

Hello Michael: Thanks for the suggestion. Yes, I can recreate the problem in 
spark 1.6.
 

> Failed to generate predicate Error when using dropna
> 
>
> Key: SPARK-12231
> URL: https://issues.apache.org/jira/browse/SPARK-12231
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
> Environment: python version: 2.7.9
> os: ubuntu 14.04
>Reporter: yahsuan, chang
>
> code to reproduce error
> # write.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df = sqlc.range(10)
> df1 = df.withColumn('a', df['id'] * 2)
> df1.write.partitionBy('id').parquet('./data')
> {code}
> # read.py
> {code}
> import pyspark
> sc = pyspark.SparkContext()
> sqlc = pyspark.SQLContext(sc)
> df2 = sqlc.read.parquet('./data')
> df2.dropna().count()
> {code}
> $ spark-submit write.py
> $ spark-submit read.py
> # error message
> {code}
> 15/12/08 17:20:34 ERROR Filter: Failed to generate predicate, fallback to 
> interpreted org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: a#0L
> ...
> {code}
> If write data without partitionBy, the error won't happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-14 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056948#comment-15056948
 ] 

kevin yu commented on SPARK-12317:
--

I talked with Bo, I will work on this PR. Thanks.

Kevin

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-05 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083397#comment-15083397
 ] 

kevin yu commented on SPARK-12648:
--

I can recreate the problem, I will look into this issue. Thanks.
Kevin

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12317) Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf

2016-01-05 Thread kevin yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-12317:
-
Summary: Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and 
SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf  
(was: Support configurate value with unit(e.g. kb/mb/gb) in SQL)

> Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and 
> SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf
> 
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089758#comment-15089758
 ] 

kevin yu commented on SPARK-12648:
--

Hi Mikael: I see, I am looking into to see if it is doable or not. 

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2015-11-18 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15012165#comment-15012165
 ] 

kevin yu commented on SPARK-11827:
--

I will take a look at this one. Kevin

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Priority: Minor
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2015-11-19 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014042#comment-15014042
 ] 

kevin yu commented on SPARK-11827:
--

I think I can recreate the problem now. I am looking into the code. 

scala.MatchError: 1234567 (of class java.math.BigInteger)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at 
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1314)
at 
org.apache.spark.sql.SQLContext$$anonfun$org$apache$spark$sql$SQLContext$$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1314)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Priority: Minor
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11772) DataFrame.show() fails with non-ASCII strings

2015-11-17 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15010358#comment-15010358
 ] 

kevin yu commented on SPARK-11772:
--

Hello Greg: Glad you find solution. Can we close this jira? Thanks.

> DataFrame.show() fails with non-ASCII strings
> -
>
> Key: SPARK-11772
> URL: https://issues.apache.org/jira/browse/SPARK-11772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Greg Baker
>Priority: Minor
>
> When given a non-ASCII string (in pyspark at least), the DataFrame.show() 
> method fails.
> {code:none}
> df = sqlContext.createDataFrame([[u'ab\u0255']])
> df.show()
> {code}
> Results in:
> {code:none}
> 15/11/16 21:36:54 INFO DAGScheduler: ResultStage 1 (showString at 
> NativeMethodAccessorImpl.java:-2) finished in 0.148 s
> 15/11/16 21:36:54 INFO DAGScheduler: Job 1 finished: showString at 
> NativeMethodAccessorImpl.java:-2, took 0.192634 s
> Traceback (most recent call last):
>   File ".../show_bug.py", line 8, in 
> df.show()
>   File 
> ".../spark-1.5.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 256, in show
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u0255' in 
> position 21: ordinal not in range(128)
> 15/11/16 21:36:54 INFO SparkContext: Invoking stop() from shutdown hook
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12754) Data type mismatch on two array values when using filter/where

2016-01-12 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094499#comment-15094499
 ] 

kevin yu commented on SPARK-12754:
--

Hello Jesse:  It looks there is changing for the nullable checking after spark 
1.4. For your testcase, the default nullable is true for createArrayType in 
StructField("point", DataTypes.createArrayType(LongType), false)

and the nullable is false for

val  targetPoint:Array[Long] = Array(0L,9L)

that is why caused the failure. 

you can change the nullable to false at createArrayType, it will work. 

StructField("point", DataTypes.createArrayType(LongType, false), false)


> Data type mismatch on two array values when using filter/where
> --
>
> Key: SPARK-12754
> URL: https://issues.apache.org/jira/browse/SPARK-12754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+
>Reporter: Jesse English
>
> The following test produces the error 
> _org.apache.spark.sql.AnalysisException: cannot resolve '(point = 
> array(0,9))' due to data type mismatch: differing types in '(point = 
> array(0,9))' (array and array)_
> This is not the case on 1.4.x, but has been introduced with 1.5+.  Is there a 
> preferred method for making this sort of arbitrarily sized array comparison?
> {code:title=test.scala}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType), false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> var dataframe = sqlContext.createDataFrame(data, schema)
> val  targetPoint:Array[Long] = Array(0L,9L)
> //This is the line where it fails
> //org.apache.spark.sql.AnalysisException: cannot resolve 
> // '(point = array(0,9))' due to data type mismatch:
> // differing types in '(point = array(0,9))' 
> // (array and array).
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-10 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091484#comment-15091484
 ] 

kevin yu commented on SPARK-12648:
--

Hello Jakob & Liang-Chi: Thanks for the help. Kevin

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15731) orc writer directory permissions

2016-06-03 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314364#comment-15314364
 ] 

kevin yu commented on SPARK-15731:
--

Hi Ran: I tried on my machine for orc file, the partition directories has the x 
permission. Did I do anything different than yours? 

scala> spark.createDataFrame(data).toDF("a", 
"b").write.format("orc").mode("append").partitionBy("a").save("/Users/qianyangyu/sparkcp/spark-15731orc2")

Qianyangs-MBP:sparkcp qianyangyu$ ls -al spark-15731orc2/
total 8
drwxr-xr-x  10 qianyangyu  staff  340 Jun  3 09:11 .
drwxr-xr-x  15 qianyangyu  staff  510 Jun  3 09:11 ..
-rw-r--r--   1 qianyangyu  staff8 Jun  3 09:11 ._SUCCESS.crc
-rw-r--r--   1 qianyangyu  staff0 Jun  3 09:11 _SUCCESS
drwxr-xr-x   4 qianyangyu  staff  136 Jun  3 09:11 a=1
drwxr-xr-x   4 qianyangyu  staff  136 Jun  3 09:11 a=3
drwxr-xr-x   4 qianyangyu  staff  136 Jun  3 09:11 a=5
drwxr-xr-x   4 qianyangyu  staff  136 Jun  3 09:11 a=7
drwxr-xr-x   4 qianyangyu  staff  136 Jun  3 09:11 a=9
drwxr-xr-x  12 qianyangyu  staff  408 Jun  3 09:11 a=__HIVE_DEFAULT_PARTITION__
Qianyangs-MBP:sparkcp qianyangyu$ 


> orc writer directory permissions
> 
>
> Key: SPARK-15731
> URL: https://issues.apache.org/jira/browse/SPARK-15731
> Project: Spark
>  Issue Type: Bug
>Reporter: Ran Haim
>
> When saving orc files with partitions, the partition directories created do 
> not have x permission (even tough umask is 002), then no other users can get 
> inside those directories to read the orc file.
> When writing parquet files there is no such issue.
> code example:
> datafrmae.write.format("orc").mode("append").partitionBy("date").save("/path")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319706#comment-15319706
 ] 

kevin yu commented on SPARK-15804:
--

I will submit a PR soon. Thanks.

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15804) Manually added metadata not saving with parquet

2016-06-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320180#comment-15320180
 ] 

kevin yu commented on SPARK-15804:
--

https://github.com/apache/spark/pull/13555

> Manually added metadata not saving with parquet
> ---
>
> Key: SPARK-15804
> URL: https://issues.apache.org/jira/browse/SPARK-15804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Charlie Evans
>
> Adding metadata with col().as(_, metadata) then saving the resultant 
> dataframe does not save the metadata. No error is thrown. Only see the schema 
> contains the metadata before saving and does not contain the metadata after 
> saving and loading the dataframe. Was working fine with 1.6.1.
> {code}
> case class TestRow(a: String, b: Int)
> val rows = TestRow("a", 0) :: TestRow("b", 1) :: TestRow("c", 2) :: Nil
> val df = spark.createDataFrame(rows)
> import org.apache.spark.sql.types.MetadataBuilder
> val md = new MetadataBuilder().putString("key", "value").build()
> val dfWithMeta = df.select(col("a"), col("b").as("b", md))
> println(dfWithMeta.schema.json)
> dfWithMeta.write.parquet("dfWithMeta")
> val dfWithMeta2 = spark.read.parquet("dfWithMeta")
> println(dfWithMeta2.schema.json)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15763) Add DELETE FILE command support in spark

2016-06-03 Thread kevin yu (JIRA)

kevin yu created SPARK-15763:


 Summary: Add DELETE FILE command support in spark
 Key: SPARK-15763
 URL: https://issues.apache.org/jira/browse/SPARK-15763
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: kevin yu


Currently Spark support "Add File/Jar  " in SPARK SQL, but 
not "Delete File/Jar ", I am adding support for the "Delete 
File" from the Spark context. Hive support "ADD/DELETE/LIST FILE/Jar" commands. 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli
I will submit the DELETE Jar in another jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16152) `In` predicate does not work with null values

2016-06-22 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345424#comment-15345424
 ] 

kevin yu commented on SPARK-16152:
--

Hello Ashar: I think it is working as design, it is added by this jira 
https://issues.apache.org/jira/browse/SPARK-10323

> `In` predicate does not work with null values
> -
>
> Key: SPARK-16152
> URL: https://issues.apache.org/jira/browse/SPARK-16152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ashar Fuadi
>
> According to 
> https://github.com/apache/spark/blob/v1.6.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L134..L136:
> {code}
>  override def eval(input: InternalRow): Any = {
> val evaluatedValue = value.eval(input)
> if (evaluatedValue == null) {
>   null
> } else {
>   ...
> {code}
> we always return {{null}} when the current value is null, ignoring the 
> elements of {{list}}. Therefore, we cannot have a predicate which tests 
> whether a column contains values in e.g. {{[1, 2, 3, null]}}
> Is this a bug, or is this actually the expected behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12754) Data type mismatch on two array values when using filter/where

2016-01-11 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092268#comment-15092268
 ] 

kevin yu commented on SPARK-12754:
--

I will look into this. 

> Data type mismatch on two array values when using filter/where
> --
>
> Key: SPARK-12754
> URL: https://issues.apache.org/jira/browse/SPARK-12754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+
>Reporter: Jesse English
>
> The following test produces the error 
> _org.apache.spark.sql.AnalysisException: cannot resolve '(point = 
> array(0,9))' due to data type mismatch: differing types in '(point = 
> array(0,9))' (array and array)_
> This is not the case on 1.4.x, but has been introduced with 1.5+.  Is there a 
> preferred method for making this sort of arbitrarily sized array comparison?
> {code:title=test.scala}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType), false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> var dataframe = sqlContext.createDataFrame(data, schema)
> val  targetPoint:Array[Long] = Array(0L,9L)
> //This is the line where it fails
> //org.apache.spark.sql.AnalysisException: cannot resolve 
> // '(point = array(0,9))' due to data type mismatch:
> // differing types in '(point = array(0,9))' 
> // (array and array).
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102361#comment-15102361
 ] 

kevin yu commented on SPARK-12783:
--

Hello Muthu: do the import first, it seems working.
scala> import scala.collection.Map
import scala.collection.Map



scala> case class MyMap(map: Map[String, String]) 
defined class MyMap

scala> 

scala> case class TestCaseClass(a: String, b: String)  {
 |   def toMyMap: MyMap = {
 | MyMap(Map(a->b))
 |   }
 | 
 |   def toStr: String = {
 | a
 |   }
 | }
defined class TestCaseClass

scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", 
"data1"), TestCaseClass("2015-05-01", "data2"))).toDF()
df1: org.apache.spark.sql.DataFrame = [a: string, b: string]

scala> df1.as[TestCaseClass].map(_.toMyMap).show() 
++  
| map|
++
|Map(2015-05-01 ->...|
|Map(2015-05-01 ->...|
++


> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class:

[jira] [Commented] (SPARK-13253) Error aliasing array columns.

2016-02-10 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141252#comment-15141252
 ] 

kevin yu commented on SPARK-13253:
--

I can recreate the problem, I am looking at it now

> Error aliasing array columns.
> -
>
> Key: SPARK-13253
> URL: https://issues.apache.org/jira/browse/SPARK-13253
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Rakesh Chalasani
>
> Getting an "UnsupportedOperationException" when trying to alias an
> array column. 
> The issue seems over "toString" on Column. "CreateArray" expression -> 
> dataType, which checks for nullability of its children, while aliasing is 
> creating a PrettyAttribute that does not implement nullability.
> Code to reproduce the error:
> {code}
> import org.apache.spark.sql.SQLContext 
> val sqlContext = new SQLContext(sparkContext) 
> import sqlContext.implicits._ 
> import org.apache.spark.sql.functions 
> case class Test(a:Int, b:Int) 
> val data = sparkContext.parallelize(Array.range(0, 10).map(x => Test(x, 
> x+1))) 
> val df = data.toDF() 
> val arrayCol = functions.array(df("a"), df("b")).as("arrayCol")
> arrayCol.toString()
> {code}
> Error message:
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.PrettyAttribute.nullable(namedExpressions.scala:289)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray$$anonfun$dataType$3.apply(complexTypeCreator.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
>   at 
> scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:47)
>   at scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
>   at scala.collection.AbstractSeq.prefixLength(Seq.scala:40)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
>   at scala.collection.mutable.ArrayBuffer.exists(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateArray.dataType(complexTypeCreator.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:136)
>   at 
> org.apache.spark.sql.catalyst.expressions.NamedExpression$class.typeSuffix(namedExpressions.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.typeSuffix(namedExpressions.scala:120)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:155)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:207)
>   at org.apache.spark.sql.Column.toString(Column.scala:138)
>   at java.lang.String.valueOf(String.java:2994)
>   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:331)
>   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
>   at .(:20)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12987) Drop fails when columns contain dots

2016-02-01 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127349#comment-15127349
 ] 

kevin yu commented on SPARK-12987:
--

@thomas @jayadevan: Are you still working on this problem? I was looking at the 
problem also, and here is the fix for this critical jira, sorry for submitting 
this PR,   Can you help review the fix and provide any suggestion? 

> Drop fails when columns contain dots
> 
>
> Key: SPARK-12987
> URL: https://issues.apache.org/jira/browse/SPARK-12987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a_b").collect()
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input 
> columns a_b, a.c;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
>   at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-12987) Drop fails when columns contain dots

2016-02-02 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128659#comment-15128659
 ] 

kevin yu commented on SPARK-12987:
--

it seems this jira is the duplicate of 12988. I closed my pr.

> Drop fails when columns contain dots
> 
>
> Key: SPARK-12987
> URL: https://issues.apache.org/jira/browse/SPARK-12987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> {code}
> val df = Seq((1, 1)).toDF("a_b", "a.c")
> df.drop("a_b").collect()
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'a.c' given input 
> columns a_b, a.c;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:751)
>   at org.apache.spark.sql.DataFrame.drop(DataFrame.scala:1286)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2016-01-20 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108715#comment-15108715
 ] 

kevin yu commented on SPARK-12911:
--

I will look into this . Thanks.
Kevin

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> ---
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13831) TPC-DS Query 35 fails with the following compile error

2016-03-19 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197944#comment-15197944
 ] 

kevin yu commented on SPARK-13831:
--

The same query will fail at spark sql 2.0 . And the failure can simply to 
select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns)

or 

select c_customer_sk from customer where exists (select cr_refunded_customer_sk 
from catalog_returns where cr_refunded_customer_sk = customer.c_customer_sk)

in Hive, it can pass the syntax. 
[~davies] can you confirm that spark sql is not supporting subquery with exist 
yet? 

> TPC-DS Query 35 fails with the following compile error
> --
>
> Key: SPARK-13831
> URL: https://issues.apache.org/jira/browse/SPARK-13831
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Roy Cecil
>
> TPC-DS Query 35 fails with the following compile error.
> Scala.NotImplementedError: 
> scala.NotImplementedError: No parse rules for ASTNode type: 864, text: 
> TOK_SUBQUERY_EXPR :
> TOK_SUBQUERY_EXPR 1, 439,797, 1370
>   TOK_SUBQUERY_OP 1, 439,439, 1370
> exists 1, 439,439, 1370
>   TOK_QUERY 1, 441,797, 1508
> Pasting Query 35 for easy reference.
> select
>   ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   count(*) cnt1,
>   min(cd_dep_count) cd_dep_count1,
>   max(cd_dep_count) cd_dep_count2,
>   avg(cd_dep_count) cd_dep_count3,
>   cd_dep_employed_count,
>   count(*) cnt2,
>   min(cd_dep_employed_count) cd_dep_employed_count1,
>   max(cd_dep_employed_count) cd_dep_employed_count2,
>   avg(cd_dep_employed_count) cd_dep_employed_count3,
>   cd_dep_college_count,
>   count(*) cnt3,
>   min(cd_dep_college_count) cd_dep_college_count1,
>   max(cd_dep_college_count) cd_dep_college_count2,
>   avg(cd_dep_college_count) cd_dep_college_count3
>  from
>   customer c
>   JOIN customer_address ca ON c.c_current_addr_sk = ca.ca_address_sk
>   JOIN customer_demographics ON cd_demo_sk = c.c_current_cdemo_sk
>   LEFT SEMI JOIN
>   (select ss_customer_sk
>   from store_sales
>JOIN date_dim ON ss_sold_date_sk = d_date_sk
>   where
> d_year = 2002 and
> d_qoy < 4) ss_wh1
>   ON c.c_customer_sk = ss_wh1.ss_customer_sk
>  where
>exists (
> select tmp.customer_sk from (
> select ws_bill_customer_sk  as customer_sk
> from web_sales,date_dim
> where
>   ws_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>UNION ALL
> select cs_ship_customer_sk  as customer_sk
> from catalog_sales,date_dim
> where
>   cs_sold_date_sk = d_date_sk and
>   d_year = 2002 and
>   d_qoy < 4
>   ) tmp where c.c_customer_sk = tmp.customer_sk
> )
>  group by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  order by ca_state,
>   cd_gender,
>   cd_marital_status,
>   cd_dep_count,
>   cd_dep_employed_count,
>   cd_dep_college_count
>  limit 100;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-14878) Support Trim characters in the string trim function

2016-04-23 Thread kevin yu (JIRA)

kevin yu created SPARK-14878:


 Summary: Support Trim characters in the string trim function
 Key: SPARK-14878
 URL: https://issues.apache.org/jira/browse/SPARK-14878
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: kevin yu


The current Spark SQL does not support the trim characters in the string trim 
function, which is part of ANSI SQL2003’s standard. For example, IBM DB2 fully 
supports it as shown in the 
https://www.ibm.com/support/knowledgecenter/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0023198.html.
 We propose to implement it in this JIRA..
The ANSI SQL2003's trim Syntax:

SQL
 ::= TRIM   
 ::= [ [  ] [  ] FROM ] 

 ::= 
 ::=
  LEADING
| TRAILING
| BOTH
 ::= 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18871) New test cases for IN/NOT IN subquery

2017-01-10 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815876#comment-15815876
 ] 

kevin yu commented on SPARK-18871:
--

Hello Reyold: Sorry, I misunderstood your comment. The pr16337 is closed and 
merged, your mean submit other prs under this jira? Thanks

> New test cases for IN/NOT IN subquery
> -
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: kevin yu
> Fix For: 2.2.0
>
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18871) New test cases for IN/NOT IN subquery

2017-01-05 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803410#comment-15803410
 ] 

kevin yu commented on SPARK-18871:
--

[~hvanhovell][~smilegator][~rxin][~nsyca][~dongjoon]:  PR #16337 delivered a 
sub-set of the IN/NOT IN test case as we discussed, it is merged. Should we 
create the rest PRs under the jira spark-18871 and merge all the PRs under this 
jira? Thanks

> New test cases for IN/NOT IN subquery
> -
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: kevin yu
> Fix For: 2.2.0
>
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18871) New test cases for IN/NOT IN subquery

2017-01-05 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15803712#comment-15803712
 ] 

kevin yu commented on SPARK-18871:
--

Thanks. Will submit soon.

> New test cases for IN/NOT IN subquery
> -
>
> Key: SPARK-18871
> URL: https://issues.apache.org/jira/browse/SPARK-18871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Nattavut Sutyanyong
>Assignee: kevin yu
> Fix For: 2.2.0
>
>
> This JIRA is open for submitting a PR for new test cases for IN/NOT IN 
> subquery. We plan to put approximately 100+ test cases under 
> `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with 
> simple SELECT in both parent and subquery to subqueries with more complex 
> constructs in both sides (joins, aggregates, etc.) Test data include null 
> value, and duplicate values. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22110) Enhance function description trim string function

2017-09-22 Thread kevin yu (JIRA)

kevin yu created SPARK-22110:


 Summary: Enhance function description trim string function
 Key: SPARK-22110
 URL: https://issues.apache.org/jira/browse/SPARK-22110
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.3.0
Reporter: kevin yu
Priority: Minor
 Fix For: 2.3.0


This JIRA will enhance the function description for string function TRIM, 
specific for these three fields: usage, argument and examples. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22110) Enhance function description trim string function

2017-09-22 Thread kevin yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-22110:
-
Description: This JIRA will enhance the function description for string 
function `trim`, specific for these three fields: usage, argument and examples. 
  (was: This JIRA will enhance the function description for string function 
`trim`
, specific for these three fields: usage, argument and examples. )

> Enhance function description trim string function
> -
>
> Key: SPARK-22110
> URL: https://issues.apache.org/jira/browse/SPARK-22110
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: kevin yu
>Priority: Minor
> Fix For: 2.3.0
>
>
> This JIRA will enhance the function description for string function `trim`, 
> specific for these three fields: usage, argument and examples. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22110) Enhance function description trim string function

2017-09-22 Thread kevin yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-22110:
-
Description: 
This JIRA will enhance the function description for string function 
{code:java}
TRIM
{code}
, specific for these three fields: usage, argument and examples. 

  was:This JIRA will enhance the function description for string function TRIM, 
specific for these three fields: usage, argument and examples. 


> Enhance function description trim string function
> -
>
> Key: SPARK-22110
> URL: https://issues.apache.org/jira/browse/SPARK-22110
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: kevin yu
>Priority: Minor
> Fix For: 2.3.0
>
>
> This JIRA will enhance the function description for string function 
> {code:java}
> TRIM
> {code}
> , specific for these three fields: usage, argument and examples. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22110) Enhance function description trim string function

2017-09-22 Thread kevin yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-22110:
-
Description: 
This JIRA will enhance the function description for string function `trim`
, specific for these three fields: usage, argument and examples. 

  was:
This JIRA will enhance the function description for string function 
{code:java}
TRIM
{code}
, specific for these three fields: usage, argument and examples. 


> Enhance function description trim string function
> -
>
> Key: SPARK-22110
> URL: https://issues.apache.org/jira/browse/SPARK-22110
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: kevin yu
>Priority: Minor
> Fix For: 2.3.0
>
>
> This JIRA will enhance the function description for string function `trim`
> , specific for these three fields: usage, argument and examples. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22307) NOT condition working incorrectly

2017-10-20 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213432#comment-16213432
 ] 

kevin yu commented on SPARK-22307:
--

It is correct behavior based on SQL standards, as Marco said. Your query has 
623 records: 617 records are null, 2 records are 'true', and 4 records are 
'false'. So the not(col1) return 4. 

> NOT condition working incorrectly
> -
>
> Key: SPARK-22307
> URL: https://issues.apache.org/jira/browse/SPARK-22307
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Andrey Yakovenko
> Attachments: Catalog.json.gz
>
>
> Suggest test case: table with x record filtered by expression expr returns y 
> records (< x), not(expr) does not returns x-y records. Work around: 
> when(expr, false).otherwise(true) is working fine.
> {code}
> val ctg = spark.sqlContext.read.json("/user/Catalog.json")
> scala> ctg.printSchema
> root
>  |-- Id: string (nullable = true)
>  |-- Name: string (nullable = true)
>  |-- Parent: struct (nullable = true)
>  ||-- Id: string (nullable = true)
>  ||-- Name: string (nullable = true)
>  ||-- Parent: struct (nullable = true)
>  |||-- Id: string (nullable = true)
>  |||-- Name: string (nullable = true)
>  |||-- Parent: struct (nullable = true)
>  ||||-- Id: string (nullable = true)
>  ||||-- Name: string (nullable = true)
>  ||||-- Parent: string (nullable = true)
>  ||||-- SKU: string (nullable = true)
>  |||-- SKU: string (nullable = true)
>  ||-- SKU: string (nullable = true)
>  |-- SKU: string (nullable = true)
> val col1 = expr("Id IN ('13MXIIAA4', '13MXIBAA4')) OR (Parent.Id IN 
> ('13MXIIAA4', '13MXIBAA4'))) OR (Parent.Parent.Id IN ('13MXIIAA4', 
> '13MXIBAA4'))) OR (Parent.Parent.Parent.Id IN ('13MXIIAA4', '13MXIBAA4')))")
> col1: org.apache.spark.sql.Column = Id IN (13MXIIAA4, 13MXIBAA4)) OR 
> (Parent.Id IN (13MXIIAA4, 13MXIBAA4))) OR (Parent.Parent.Id IN (13MXIIAA4, 
> 13MXIBAA4))) OR (Parent.Parent.Parent.Id IN (13MXIIAA4, 13MXIBAA4)))
> scala> ctg.count
> res5: Long = 623
> scala> ctg.filter(col1).count
> res2: Long = 2
> scala> ctg.filter(not(col1)).count
> res3: Long = 4
> scala> ctg.filter(when(col1, false).otherwise(true)).count
> res4: Long = 621
> {code}
> Table is hierarchy like structure and has a records with different number of 
> levels filled up. I have a suspicion that due to partly filled hierarchy 
> condition return null/undefined/failed/nan some times (neither true or false).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data

2018-05-07 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466263#comment-16466263
 ] 

kevin yu commented on SPARK-24176:
--

I am looking at this one, will provide a proposal fix soon. 

> The hdfs file path with wildcard can not be identified when loading data
> 
>
> Key: SPARK-24176
> URL: https://issues.apache.org/jira/browse/SPARK-24176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version:2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> # Launch spark-sql
>  # create table wild1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',' stored as textfile;
>  # loaded data in table as below and it failed some cases not consistent
>  # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success
> load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - 
> *Failed*
> load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - 
> Success
> Exception as below
> > load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1;
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_database: one
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> *Error in query: LOAD DATA input path does not exist: 
> /user/testdemo1/user1/t??eddata61.txt;*
> spark-sql>
> Behavior is not consistent. Need to fix with all combination of wild card 
> char as it is not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363110#comment-16363110
 ] 

kevin yu commented on SPARK-23402:
--

I just tried with 

PostgreSQL 9.5.6 on x86_64-pc-linux-gnu

with jdbc driver: 

postgresql-9.4.1210

 

I can't reproduce the error.

scala> val df1 = Seq((1)).toDF("c1")

df1: org.apache.spark.sql.DataFrame = [c1: int]

scala> 
df1.write.mode(SaveMode.Append).jdbc("jdbc:postgresql://9.30.167.220:5432/mydb",
 "emptytable", destProperties)

                                                                                

scala> val df3 = spark.read.jdbc("jdbc:postgresql://9.30.167.220:5432/mydb", 
"emptytable", destProperties).show

+---+                                                                           

| c1|

+---+

|  1|

+---+

 

df3: Unit = ()

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>

[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363584#comment-16363584
 ] 

kevin yu commented on SPARK-23402:
--

Thanks, I will install the 9.5.8, and try again.

Sent from my iPhone



> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23402) Dataset write method not working as expected for postgresql database

2018-02-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363539#comment-16363539
 ] 

kevin yu commented on SPARK-23402:
--

Yes, I create empty table (emptytable) in database (mydb) in the postgresql, 
then I run the above statement from spark-shell, it works fine. The only 
difference I see is that my postgres is at 9.5.6, yours is at 9.5.8 +. 

> Dataset write method not working as expected for postgresql database
> 
>
> Key: SPARK-23402
> URL: https://issues.apache.org/jira/browse/SPARK-23402
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
> Environment: PostgreSQL: 9.5.8 (10 + Also same issue)
> OS: Cent OS 7 & Windows 7,8
> JDBC: 9.4-1201-jdbc41
>  
> Spark:  I executed in both 2.1.0 and 2.2.1
> Mode: Standalone
> OS: Windows 7
>Reporter: Pallapothu Jyothi Swaroop
>Priority: Major
> Attachments: Emsku[1].jpg
>
>
> I am using spark dataset write to insert data on postgresql existing table. 
> For this I am using  write method mode as append mode. While using i am 
> getting exception like table already exists. But, I gave option as append 
> mode.
> It's strange. When i change options to sqlserver/oracle append mode is 
> working as expected.
>  
> *Database Properties:*
> {{destinationProps.put("driver", "org.postgresql.Driver"); 
> destinationProps.put("url", "jdbc:postgresql://127.0.0.1:30001/dbmig"); 
> destinationProps.put("user", "dbmig");}}
> {{destinationProps.put("password", "dbmig");}}
>  
> *Dataset Write Code:*
> {{valueAnalysisDataset.write().mode(SaveMode.Append).jdbc(destinationDbMap.get("url"),
>  "dqvalue", destinationdbProperties);}} 
>  
>  
> {{Exception in thread "main" org.postgresql.util.PSQLException: ERROR: 
> relation "dqvalue" already exists at 
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2412)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2125)
>  at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:297) 
> at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:428) at 
> org.postgresql.jdbc.PgStatement.execute(PgStatement.java:354) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:301) at 
> org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:287) at 
> org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:264) at 
> org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:244) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:806)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:469)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) 
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) 
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at 
> org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:460) at 
> com.ads.dqam.action.impl.PostgresValueAnalysis.persistValueAnalysis(PostgresValueAnalysis.java:25)
>  at 
> com.ads.dqam.action.AbstractValueAnalysis.persistAnalysis(AbstractValueAnalysis.java:81)
>  at com.ads.dqam.Analysis.doAnalysis(Analysis.java:32) at 
> com.ads.dqam.Client.main(Client.java:71)}}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-23486) LookupFunctions should not check the same function name more than once

2018-02-22 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16373851#comment-16373851
 ] 

kevin yu commented on SPARK-23486:
--

Hello [~lian cheng]: I think the easy way is to build a hash map around the 
LookupFunctions, if the function exists in the external catalog,  put into the 
hash map for the first time, next time when call the LookupFunctions, first 
check the hash map to avoid the metastore accesses, does this approach look ok 
to you? If you think it is ok, I can provide a pr for reviewing. Thanks.

Another approach is to cache the external catalog functions in the share state, 
 many queries can use, but it will be more involved to do the invalidation.  

> LookupFunctions should not check the same function name more than once
> --
>
> Key: SPARK-23486
> URL: https://issues.apache.org/jira/browse/SPARK-23486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Cheng Lian
>Priority: Major
>  Labels: starter
>
> For a query invoking the same function multiple times, the current 
> {{LookupFunctions}} rule performs a check for each invocation. For users 
> using Hive metastore as external catalog, this issues unnecessary metastore 
> accesses and can slow down the analysis phase quite a bit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24977) input_file_name() result can't save and use for partitionBy()

2018-07-31 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16564490#comment-16564490
 ] 

kevin yu commented on SPARK-24977:
--

Hello Srinivasarao: Can you show the steps you encountered the problem? I just 
did a quick test, seems work fine, but not sure it is the same as yours.

 

scala> spark.read.textFile("file:///etc/passwd")

res3: org.apache.spark.sql.Dataset[String] = [value: string]

scala> res3.select(input_file_name() as "input", expr("10 as 
col2")).write.partitionBy("input").saveAsTable("passwd3")

18/07/31 16:11:59 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException

 

scala> spark.sql("select * from passwd3").show

++--+

|col2|             input|

++--+

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

|  10|file:///etc/passwd|

++--+

only showing top 20 rows

 

> input_file_name() result can't save and use for partitionBy()
> -
>
> Key: SPARK-24977
> URL: https://issues.apache.org/jira/browse/SPARK-24977
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.3.1
>Reporter: Srinivasarao Padala
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578766#comment-16578766
 ] 

kevin yu commented on SPARK-25105:
--

I will try to fix it. Thanks. Kevin

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2018-08-24 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592126#comment-16592126
 ] 

kevin yu commented on SPARK-19335:
--

[~drew222]: I am still working on it, right now, I am waiting for the data 
source v2 to be finished. Thanks.

Kevin

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19737) New analysis rule for reporting unregistered functions without relying on relation resolution

2018-03-11 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394550#comment-16394550
 ] 

kevin yu commented on SPARK-19737:
--

[~LANDAIS Christophe], I submit a PR under  SPARK-23486, can you try and to see 
if it helps ?

 

> New analysis rule for reporting unregistered functions without relying on 
> relation resolution
> -
>
> Key: SPARK-19737
> URL: https://issues.apache.org/jira/browse/SPARK-19737
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Major
> Fix For: 2.2.0
>
>
> Let's consider the following simple SQL query that reference an undefined 
> function {{foo}} that is never registered in the function registry:
> {code:sql}
> SELECT foo(a) FROM t
> {code}
> Assuming table {{t}} is a partitioned  temporary view consisting of a large 
> number of files stored on S3, it may take the analyzer a long time before 
> realizing that {{foo}} is not registered yet.
> The reason is that the existing analysis rule {{ResolveFunctions}} requires 
> all child expressions to be resolved first. Therefore, {{ResolveRelations}} 
> has to be executed first to resolve all columns referenced by the unresolved 
> function invocation. This further leads to partition discovery for {{t}}, 
> which may take a long time.
> To address this case, we propose a new lightweight analysis rule 
> {{LookupFunctions}} that
> # Matches all unresolved function invocations
> # Look up the function names from the function registry
> # Report analysis error for any unregistered functions
> Since this rule doesn't actually try to resolve the unresolved functions, it 
> doesn't rely on {{ResolveRelations}} and therefore doesn't trigger partition 
> discovery.
> We may put this analysis rule in a separate {{Once}} rule batch that sits 
> between the "Substitution" batch and the "Resolution" batch to avoid running 
> it repeatedly and make sure it gets executed before {{ResolveRelations}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23162) PySpark ML LinearRegressionSummary missing r2adj

2018-03-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392097#comment-16392097
 ] 

kevin yu commented on SPARK-23162:
--

Currently testing the code.. will open an pr soon. Kevin

> PySpark ML LinearRegressionSummary missing r2adj
> 
>
> Key: SPARK-23162
> URL: https://issues.apache.org/jira/browse/SPARK-23162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> Missing the Python API for {{r2adj}} in {{LinearRegressionSummary}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-23 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16661324#comment-16661324
 ] 

kevin yu commented on SPARK-25807:
--

I am looking into option 1, option 3 causes to change behavior, probably 
require more discussion.

Kevin

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-24 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662830#comment-16662830
 ] 

kevin yu commented on SPARK-25807:
--

Thanks Sean, ok, I will leave as it is. 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663332#comment-16663332
 ] 

kevin yu edited comment on SPARK-25807 at 10/25/18 6:42 AM:


[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

'def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]'

 


was (Author: kevinyu98):
[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

`def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]`

 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663332#comment-16663332
 ] 

kevin yu commented on SPARK-25807:
--

[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

`def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]`

 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25807) Mitigate 1-based substr() confusion

2018-10-25 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16663332#comment-16663332
 ] 

kevin yu edited comment on SPARK-25807 at 10/25/18 6:43 AM:


[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]

 


was (Author: kevinyu98):
[~oron.navon]: You can also try to implement a simply python UDF to do the 
0-based substr(), here is an example:

'def substr0(x,y,z):
 if y < 0:
 y = 0
 if z < 0:
 z = 0
 return x[y:z+y]

from pyspark.sql.functions import UserDefinedFunction

sub_str0 = spark.catalog.registerFunction("subStr0", UserDefinedFunction(lambda 
x,y,z: substr0(x,y,z)))

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 0, 2)=u'ke')]

spark.sql("select substr0('kevin', 0, 2)").collect()

[Row(subStr0(kevin, 1, 2)=u'ev')]'

 

> Mitigate 1-based substr() confusion
> ---
>
> Key: SPARK-25807
> URL: https://issues.apache.org/jira/browse/SPARK-25807
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, PySpark
>Affects Versions: 1.3.0, 2.3.2, 2.4.0, 2.5.0, 3.0.0
>Reporter: Oron Navon
>Priority: Minor
>
> The method {{Column.substr()}} is 1-based, conforming with SQL and Hive's 
> {{SUBSTRING}}, and contradicting both Python's {{substr}} and Java's 
> {{substr}}, which are zero-based.  Both PySpark users and Java API users 
> often naturally expect a 0-based {{substr()}}. Adding to the confusion, 
> {{substr()}} currently allows a {{startPos}} value of 0, which returns the 
> same result as {{startPos==1}}.
> Since changing {{substr()}} to 0-based is probably NOT a reasonable option 
> here, I suggest making one or more of the following changes:
>  # Adding a method {{substr0}}, which would be zero-based
>  # Renaming {{substr}} to {{substr1}}
>  # Making the existing {{substr()}} throw an exception on {{startPos==0}}, 
> which should catch and alert most users who expect zero-based behavior.
> This is my first discussion on this project, apologies for any faux pas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25892) AttributeReference.withMetadata method should have return type AttributeReference

2018-10-31 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670735#comment-16670735
 ] 

kevin yu commented on SPARK-25892:
--

I am looking into this now. Kevin

> AttributeReference.withMetadata method should have return type 
> AttributeReference
> -
>
> Key: SPARK-25892
> URL: https://issues.apache.org/jira/browse/SPARK-25892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jari Kujansuu
>Priority: Trivial
>
> AttributeReference.withMetadata method should have return type 
> AttributeReference instead of Attribute.
> AttributeReference overrides withMetadata method defined in Attribute super 
> class and returns AttributeReference instance but method's return type is 
> Attribute unlike in other with... methods overridden by AttributeReference.
> In some cases you have to cast the return value back to AttributeReference.
> For example if you want to modify metadata for AttributeReference in 
> LogicalRelation you have to cast return value of withMetadata back to 
> AttributeReference because LogicalRelation takes Seq[AttributeReference] as 
> argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678700#comment-16678700
 ] 

kevin yu commented on SPARK-25967:
--

Hello Victor: I see, by SQL2003 standard, the TRIM function removes spaces by 
default. 

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25967) sql.functions.trim() should remove trailing and leading tabs

2018-11-07 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678614#comment-16678614
 ] 

kevin yu commented on SPARK-25967:
--

Hello Victor: You can specify the tabs as specified characters to remove in the 
trim function, the detail syntax is here.

https://spark.apache.org/docs/2.3.0/api/sql/#trim

> sql.functions.trim() should remove trailing and leading tabs
> 
>
> Key: SPARK-25967
> URL: https://issues.apache.org/jira/browse/SPARK-25967
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.2
>Reporter: Victor Sahin
>Priority: Minor
>
> sql.functions.trim removes only trailing and leading whitespaces. Removing 
> tabs as well helps use the function for the same use case e.g. artifact 
> cleaning.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25993) Add test cases for resolution of ORC table location

2018-11-09 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682141#comment-16682141
 ] 

kevin yu commented on SPARK-25993:
--

I am looking into it now. Kevin

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`

2018-11-26 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699370#comment-16699370
 ] 

kevin yu commented on SPARK-26176:
--

I will look into it. Kevin

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed

[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-01-30 Thread kevin yu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756245#comment-16756245
 ] 

kevin yu commented on SPARK-26176:
--

Hi Mikhail:
Sorry for the delay, yes, I am still looking into it.

Kevin

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to

[jira] [Commented] (SPARK-28802) Document DESCRIBE DATABASE in SQL Reference.

2019-08-21 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912469#comment-16912469
 ] 

kevin yu commented on SPARK-28802:
--

[~huaxingao]: np, somehow the pr is not linked to this jira

> Document DESCRIBE DATABASE in SQL Reference.
> 
>
> Key: SPARK-28802
> URL: https://issues.apache.org/jira/browse/SPARK-28802
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28828) Document REFRESH statement in SQL Reference.

2019-08-21 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912502#comment-16912502
 ] 

kevin yu commented on SPARK-28828:
--

I will work on this one

> Document REFRESH statement in SQL Reference.
> 
>
> Key: SPARK-28828
> URL: https://issues.apache.org/jira/browse/SPARK-28828
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-23 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16914795#comment-16914795
 ] 

kevin yu commented on SPARK-28833:
--

@aman omer : I am about the half way there.. and you can help review? thanks

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28833) Document ALTER VIEW Statement in SQL Reference.

2019-08-21 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912428#comment-16912428
 ] 

kevin yu commented on SPARK-28833:
--

I will work on this one.

> Document ALTER VIEW Statement in SQL Reference.
> ---
>
> Key: SPARK-28833
> URL: https://issues.apache.org/jira/browse/SPARK-28833
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2019-12-11 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994166#comment-16994166
 ] 

kevin yu commented on SPARK-19335:
--

[~danny-seismic] [~Vdarshankb] [~nstudenski] [~mrayandutta] [~rinazbelhaj] 
[~drew222]: can you list the reasoning why your organization need this feature? 
We are assessing whether we should resume this work or not.

 

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30962) Document ALTER TABLE statement in SQL Reference [Phase 2]

2020-02-28 Thread kevin yu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047806#comment-17047806
 ] 

kevin yu commented on SPARK-30962:
--

I work on it.

> Document ALTER TABLE statement in SQL Reference [Phase 2]
> -
>
> Key: SPARK-30962
> URL: https://issues.apache.org/jira/browse/SPARK-30962
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28791 only covers a subset of 
> ALTER TABLE statements. See the doc in preview-2 
> [https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-ddl-alter-table.html]
>  
> We should add all the supported ALTER TABLE syntax. See 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L157-L198]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31349) Document builtin aggregate function

2020-04-04 Thread kevin yu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevin yu updated SPARK-31349:
-
Parent: SPARK-28588
Issue Type: Sub-task  (was: Task)

> Document builtin aggregate function
> ---
>
> Key: SPARK-31349
> URL: https://issues.apache.org/jira/browse/SPARK-31349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: kevin yu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31349) Document builtin aggregate function

2020-04-04 Thread kevin yu (Jira)

kevin yu created SPARK-31349:


 Summary: Document builtin aggregate function
 Key: SPARK-31349
 URL: https://issues.apache.org/jira/browse/SPARK-31349
 Project: Spark
  Issue Type: Task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: kevin yu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

76 matches

Mail list logo