[jira] [Created] (SPARK-14594) Improve error messages for RDD API

2016-04-13 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-14594:
---

 Summary: Improve error messages for RDD API
 Key: SPARK-14594
 URL: https://issues.apache.org/jira/browse/SPARK-14594
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.6.0
Reporter: Marco Gaido


When you have an error in your R code using the RDD API, you always get as 
error message:

Error in if (returnStatus != 0) { : argument is of length zero

This is not very useful and I think it might be better to catch the R exception 
and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-22 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253462#comment-15253462
 ] 

Marco Gaido commented on SPARK-14594:
-

Yes, it works with few data. But if you put a lot of data (let's say 10 milion 
of rows per id with about 5 ids), it crashes with the error above...

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15255186#comment-15255186
 ] 

Marco Gaido commented on SPARK-14594:
-

Yes, I do believe that this is what is happening

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14594) Improve error messages for RDD API

2016-04-20 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-14594:

Affects Version/s: (was: 1.6.0)
   1.5.2

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-20 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249521#comment-15249521
 ] 

Marco Gaido commented on SPARK-14594:
-

I am using Spark1.5.2. Maybe the issue is resolved now..

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14594) Improve error messages for RDD API

2016-04-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251507#comment-15251507
 ] 

Marco Gaido commented on SPARK-14594:
-

The code is quite simple, what I can't give you is the data since it's on the 
cluster of a client and I can't access it:


df <- sql(sqlContext, "select id, data, time from the_table")
rdd<-SparkR:::toRDD(df)
gb<-SparkR:::groupByKey(rdd, 1000)
... //here you can do any action and it will crash with the above error

If you can generate by yourself some fake data I think it will be fine anyway...

> Improve error messages for RDD API
> --
>
> Key: SPARK-14594
> URL: https://issues.apache.org/jira/browse/SPARK-14594
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Marco Gaido
>
> When you have an error in your R code using the RDD API, you always get as 
> error message:
> Error in if (returnStatus != 0) { : argument is of length zero
> This is not very useful and I think it might be better to catch the R 
> exception and show it instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21738) Thriftserver doesn't cancel jobs when session is closed

2017-08-15 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-21738:
---

 Summary: Thriftserver doesn't cancel jobs when session is closed
 Key: SPARK-21738
 URL: https://issues.apache.org/jira/browse/SPARK-21738
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


When a session is closed, the jobs launched by that session should be killed in 
order to avoid waste of resources. Instead, this doesn't happen.

So at the moment, if a user launches a query and then closes his connection, 
the query goes on running until completion. This behavior should be changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21340) Bring PySpark MLLib evaluation metrics to parity with Scala API

2017-07-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089487#comment-16089487
 ] 

Marco Gaido commented on SPARK-21340:
-

[~jake.charland] I submitted a PR but I am not sure it will be merged. If not, 
please use the new ml package.

> Bring PySpark MLLib evaluation metrics to parity with Scala API
> ---
>
> Key: SPARK-21340
> URL: https://issues.apache.org/jira/browse/SPARK-21340
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Jake Charland
>
> This JIRA is a request to bring in PySparks MLLib evaluation metrics to 
> parity with the Scala API. For example in BinaryClassificationMetrics there 
> are only two eval metrics exposed to pyspark, areaUnderROC and areaUnderPR 
> while scala has support for a much wider set of eval metrics including 
> precision recall curves and the ability to set thresholds for recall and 
> precision values. These evaluation metrics are critical for understanding and 
> seeing the performance of trained models and should be available to those 
> using the pyspak api's.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20990) Multi-line support for JSON

2017-07-27 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102975#comment-16102975
 ] 

Marco Gaido commented on SPARK-20990:
-

A PR fixing it is ready: https://github.com/apache/spark/pull/18731.

> Multi-line support for JSON
> ---
>
> Key: SPARK-20990
> URL: https://issues.apache.org/jira/browse/SPARK-20990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> When `multiLine` option is on, the existing JSON parser only reads the first 
> record. We should read the other records in the same file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14516) Clustering evaluator

2017-06-30 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069815#comment-16069815
 ] 

Marco Gaido commented on SPARK-14516:
-

Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view.

Please tell me if you have any comment, doubt on it.

Thanks.


> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118097#comment-16118097
 ] 

Marco Gaido commented on SPARK-21658:
-

[~viirya] ok, thanks.

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16118040#comment-16118040
 ] 

Marco Gaido commented on SPARK-21658:
-

Though, the documentation points out that there is no default for the {{value}} 
parameter in {{DataFrameNaFunctions}}.
Anyway, if for you it is ok, I can work and fix this.

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-08-20 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134424#comment-16134424
 ] 

Marco Gaido commented on SPARK-21725:
-

[~zhangxin0112zx] I followed your instructions, but I am unable to reproduce 
the problem in the current master. May you please try and check whether it is 
still present in the current code?


> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
> -
> 17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> ..
> ..
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> source 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
> 512282-2/-ext-1/part-0 to destination 
> hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
> at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
> ... 45 more
> Caused by: java.io.IOException: Filesystem closed
> 
> -
> the doc about the parquet table desc here 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files
> Hive metastore Parquet table conversion
> When reading from and writing to Hive metastore Parquet tables, Spark SQL 
> will try to use its own Parquet support instead of Hive SerDe for better 
> performance. This behavior is controlled by the 
> spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
> default.
> I am confused the problem appear in the table(partitions)  but it is ok with 
> table(with out partitions) . It means spark do not use its own parquet ?
> Maybe someone give any suggest how could I avoid the issue?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21772) HiveException unable to move results from srcf to destf in InsertIntoHiveTable

2017-08-22 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137260#comment-16137260
 ] 

Marco Gaido commented on SPARK-21772:
-

can someone of the admin close this as 'Invalid' please? For reference, check 
the PR please. Thanks.

> HiveException unable to move results from srcf to destf in 
> InsertIntoHiveTable 
> ---
>
> Key: SPARK-21772
> URL: https://issues.apache.org/jira/browse/SPARK-21772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: JDK1.7
> CentOS 6.3
> Spark2.1
>Reporter: liupengcheng
>  Labels: sql
>
> Currently, when execute {code:java} create table as select {code} would 
> return Exception:
> {code:java}
> 2017-08-17,16:14:18,792 ERROR 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:346)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:770)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:316)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:262)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:261)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:769)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:765)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:763)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:119)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:92)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
> at 
> 

[jira] [Commented] (SPARK-21768) spark.csv.read Empty String Parsed as NULL when nullValue is Set

2017-08-18 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16132191#comment-16132191
 ] 

Marco Gaido commented on SPARK-21768:
-

This is a duplicate of SPARK-17916.

> spark.csv.read Empty String Parsed as NULL when nullValue is Set
> 
>
> Key: SPARK-21768
> URL: https://issues.apache.org/jira/browse/SPARK-21768
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
> PySpark
>Reporter: Andrew Gross
>
> In a CSV with quoted fields, empty strings will be interpreted as NULL even 
> when a nullValue is explicitly set:
> Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX
> {{"XXNULLXX"|""|"XXNULLXX"|"foo"}}
> PySpark Script to load the file (from S3):
> {code:title=load.py|borderStyle=solid}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructField, StructType
> spark = SparkSession.builder.appName("test_csv").getOrCreate()
> fields = []
> fields.append(StructField("First Null Field", StringType(), True))
> fields.append(StructField("Empty String Field", StringType(), True))
> fields.append(StructField("Second Null Field", StringType(), True))
> fields.append(StructField("Non Empty String Field", StringType(), True))
> schema = StructType(fields)
> keys = ['s3://mybucket/test/demo.csv']
> bad_data = spark.read.csv(keys, timestampFormat="-MM-dd HH:mm:ss", 
> mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
> bad_data.show()
> {code}
> Output
> {noformat}
> ++--+-+--+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> ++--+-+--+
> |null|  null| null|   foo|
> ++--+-+--+
> {noformat}
> Expected Output:
> {noformat}
> ++--+-+--+
> |First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
> ++--+-+--+
> |null|  | null|   foo|
> ++--+-+--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-06-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16051963#comment-16051963
 ] 

Marco Gaido commented on SPARK-19909:
-

IMHO the best option to deal with this problem is to force the setting of the 
{{checkpointLocation}} if the default filesystem (the one the metadata dir is 
written on) is different from the filesystem for the temporary directory.
In this way, if the {{checkpointLocation}}  is not set, we get a much more 
meaningful exception that suggests the proper solution.

I am creating the PR with this implementation.

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-06-14 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049105#comment-16049105
 ] 

Marco Gaido commented on SPARK-19909:
-

[~rvoyer] there is a workaround and it is easy: you have to set the 
{{checkpointLocation}} option or the {{spark.sql.streaming.checkpointLocation}} 
parameter.

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169023#comment-16169023
 ] 

Marco Gaido commented on SPARK-22036:
-

Yes, it is only for multiplications. The reason is that for the multiplication 
it expects the result to have a scale which is the sum of the two scales of the 
operands. When there is an overflow in the result of the operations, the result 
is rounded up and the scale is one less than the expected. In this situation, 
the result is set to null.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16168969#comment-16168969
 ] 

Marco Gaido commented on SPARK-22036:
-

This happens because there is an overflow in the operation. I am not sure of 
what should be done in this case. The current implementation returns null when 
an operation cause a loss of precision.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169074#comment-16169074
 ] 

Marco Gaido commented on SPARK-22036:
-

Maybe the "bad" part is that by default spark creates the columns as 
{{Decimal(38, 18)}}. This is the problem. With a multiplication this leads to a 
{{Decimal(38, 36)}}, which as you can easily understand is the root of the 
problem of your operation. If you cast the two columns before the 
multiplication, like {{ds("a").cast(DecimalType(20,14))}}, you won't have any 
problem anymore.
Currently you should suggest Spark which are the right values to use.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22040) current_date function with timezone id

2017-09-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169124#comment-16169124
 ] 

Marco Gaido commented on SPARK-22040:
-

May I work on this?

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-09-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169059#comment-16169059
 ] 

Marco Gaido commented on SPARK-22036:
-

Honestly I don't know, that is why I said that I don't know what should be done.

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. This 
> issue we discovered while doing property based testing for the frameless 
> project. Here is a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22215) Add a configuration parameter to set max size for generated classes

2017-10-06 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22215:
---

 Summary: Add a configuration parameter to set max size for 
generated classes
 Key: SPARK-22215
 URL: https://issues.apache.org/jira/browse/SPARK-22215
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


SPARK-18016 introduced an arbitrary threshold for the size of a generated class 
(https://github.com/apache/spark/blob/83488cc3180ca18f829516f550766efb3095881e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L286).
 This value is hardcoded.

Since this is just a guess, in some cases making it smaller can help avoiding 
the error of exceeding the maximum number of entries in the Constant Pool.

Then, I suggest to introduce a new configuration parameter, which defaults to 
the previous value, but it allows to set this to a smaller one if needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22226) Code generation fails for dataframes with 10000 columns

2017-10-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197111#comment-16197111
 ] 

Marco Gaido commented on SPARK-6:
-

I am not sure about what the current open PR is going to address: in the 
current state it doesn't solve the problem I am facing and I'd like to address 
with the PR I have prepared.
I think that there are many issues about code generation and many things in the 
current implementation which limit the scalability on the number of columns. 
Therefore I guess that there are cases which need to be handled differently.
Anyway, since that PR is not yet ready, I am unable to state what it will 
address and what it won't.
The only thing that I can say is that as you can see from my branch 
(https://github.com/mgaido91/spark/commits/SPARK-6), I am doing something 
completely different to what is done in the open PR.

> Code generation fails for dataframes with 1 columns
> ---
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22226) Code generation fails for dataframes with 10000 columns

2017-10-09 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-6:
---

 Summary: Code generation fails for dataframes with 1 columns
 Key: SPARK-6
 URL: https://issues.apache.org/jira/browse/SPARK-6
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


Code generation for very wide datasets can fail because of the Constant Pool 
limit reached.

This can be caused by many reasons. One of them is that we are currently 
splitting the definition of the generated methods among several {{NestedClass}} 
but all these methods are called in the main class. Since we have entries added 
to the constant pool for each method invocation, this is limiting the number of 
rows and is leading for very wide dataset to:

{noformat}
org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
 has grown past JVM limit of 0x
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22220) Spark SQL: LATERAL VIEW OUTER null pointer exception with GROUP BY

2017-10-10 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198295#comment-16198295
 ] 

Marco Gaido commented on SPARK-0:
-

Please may you provide some sample data and easy code to reproduce the issue?if 
you cannot reproduce it without the SparkHBase Library then it likely is a bug 
there and it must be solved there. Thanks.

> Spark SQL: LATERAL VIEW OUTER null pointer exception with GROUP BY
> --
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: We have Zeppelin using Spark and Livy (error is 
> reproducible without Livy) on an Ambari cluster.
>Reporter: Dian Fay
>
> Given a DataFrame having the fields name (a string) and tags (an array of 
> strings), the following Spark SQL query fails with a NullPointerException:
> {code}
> SELECT name, tag, COUNT(*)
> FROM records
> LATERAL VIEW OUTER explode(tags) AS tag
> GROUP BY name, tag
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 137.0 failed 4 times, most recent failure: Lost task 0.3 in stage 137.0 
> (TID 9109, $hostname, executor 1): java.lang.NullPointerException
> {code}
> The query is successful without the "outer", but obviously this excludes rows 
> with empty tags arrays. A version with outer but without aggregation also 
> succeeds, making it possible to work around this issue with a subquery:
> {code}
> SELECT name, tag
> FROM records
> LATERAL VIEW OUTER explode(tags) AS tag
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22226) Code generation fails for dataframes with 10000 columns

2017-10-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197036#comment-16197036
 ] 

Marco Gaido commented on SPARK-6:
-

[~srowen] I know that there are many ticket for this, but I wanted to submit a 
PR to solve this specific issue I mentioned in the description. This won't 
solve completely the error above (for datasets with 20.000 of columns for 
instance), but it will allow to support a larger number of columns than now. 
Thus I created this JIRA for that PR. If this is not the right approach, may 
you please tell me what I should do?
Thanks.

> Code generation fails for dataframes with 1 columns
> ---
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22226) Code generation fails for dataframes with 10000 columns

2017-10-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197254#comment-16197254
 ] 

Marco Gaido commented on SPARK-6:
-

[~kiszk] I am not sure that the PR you mentioned solves the same issue. I tried 
it and currently it doesn't.
As you can see in [the branch I 
prepared|https://github.com/mgaido91/spark/tree/SPARK-6] what I am changing 
is different from what is done in that PR. Despite this, maybe that PR will 
include also a solution to this, of course I don't know what it is going to be 
like.
As [~srowen] pointed out, I choose a bad title for the JIRA. I am updating it 
with a better one.

> Code generation fails for dataframes with 1 columns
> ---
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22226) splitExpression can create too many method calls (generating a Constant Pool limit error)

2017-10-09 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-6:

Summary: splitExpression can create too many method calls (generating a 
Constant Pool limit error)  (was: Code generation fails for dataframes with 
1 columns)

> splitExpression can create too many method calls (generating a Constant Pool 
> limit error)
> -
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22226) splitExpression can create too many method calls (generating a Constant Pool limit error)

2017-10-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16197286#comment-16197286
 ] 

Marco Gaido commented on SPARK-6:
-

Exactly [~kiszk], sorry for the bad initial title of the JIRA. Do you think I 
can/should reopen this JIRA and submit the PR then?

> splitExpression can create too many method calls (generating a Constant Pool 
> limit error)
> -
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22226) splitExpression can create too many method calls (generating a Constant Pool limit error)

2017-10-12 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reopened SPARK-6:
-

> splitExpression can create too many method calls (generating a Constant Pool 
> limit error)
> -
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Code generation for very wide datasets can fail because of the Constant Pool 
> limit reached.
> This can be caused by many reasons. One of them is that we are currently 
> splitting the definition of the generated methods among several 
> {{NestedClass}} but all these methods are called in the main class. Since we 
> have entries added to the constant pool for each method invocation, this is 
> limiting the number of rows and is leading for very wide dataset to:
> {noformat}
> org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
>  has grown past JVM limit of 0x
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 9:57 AM:
-

[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`


was (Author: mgaido):
[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 10:31 AM:
--

[~KevinZwx] you should define the watermark on the column {{"time"}}, not the 
column {{"window"}}


was (Author: mgaido):
[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido commented on SPARK-21944:
-

[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21957) Add current_user function

2017-09-08 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-21957:
---

 Summary: Add current_user function
 Key: SPARK-21957
 URL: https://issues.apache.org/jira/browse/SPARK-21957
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido
Priority: Minor


Spark doesn't support the {{current_user}} function.

Despite the user can be retrieved in other ways, the function would help making 
easier to migrate existing Hive queries to Spark and it can also be convenient 
for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155047#comment-16155047
 ] 

Marco Gaido commented on SPARK-21918:
-

What I meant is that if we want to support doAs, we shouldn't just support it 
for DDL operations, but also for all DML & DQL. Your fix I am pretty sure won't 
affect the DML & DQL behavior, ie. we would support the doAs only for DDL 
operations with your change. This means that there would be a hybrid situation: 
for DDL we'd have doAs working, for DML & DQL no. This is not a desirable 
condition.

PS For my sake of curiosity, may I ask you how you tested that your DDL 
commands were run using the session user?
Thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155162#comment-16155162
 ] 

Marco Gaido commented on SPARK-21918:
-

Yes, I think this would be great, thanks.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153888#comment-16153888
 ] 

Marco Gaido commented on SPARK-21888:
-

[~tgraves] Sorry, I misread. Of course, this doesn't add it to the client, only 
to the driver and the executors. But in the example you made, ie. writing to 
HBase, I can't see why you would need it: it is enough to load the conf in 
driver and the executors.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files to Client classpath. An example for this is that suppose you 
> want to run an application that uses hbase. Then, unless and until we do not 
> copy the necessary config files required by hbase to Spark Config folder, we 
> cannot specify or set their exact locations in classpath on Client end which 
> we could do so earlier by setting the environment variable "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154033#comment-16154033
 ] 

Marco Gaido commented on SPARK-21918:
-

What do you mean by "works correctly"? Actually all the jobs are executed using 
the user who started STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-07 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157091#comment-16157091
 ] 

Marco Gaido commented on SPARK-21944:
-

May you please provide some sample data to reproduce the issue? Thanks.

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21938) Spark partial CSV write fails silently

2017-09-06 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16156118#comment-16156118
 ] 

Marco Gaido commented on SPARK-21938:
-

It would be helpful if you can post a sample code to reproduce the issue with 
some sample data, thanks.

> Spark partial CSV write fails silently
> --
>
> Key: SPARK-21938
> URL: https://issues.apache.org/jira/browse/SPARK-21938
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR 5.8, varying instance types
>Reporter: Abbi McClintic
>
> Hello,
> My team has been experiencing a recurring unpredictable bug where only a 
> partial write to CSV in S3 on one partition of our Dataset is performed. For 
> example, in a Dataset of 10 partitions written to CSV in S3, we might see 9 
> of the partitions as 2.8 GB in size, but one of them as 1.6 GB. However, the 
> job does not exit with an error code. 
> This becomes problematic in the following ways:
> 1. When we copy the data to Redshift, we get a bad decrypt error on the 
> partial file, suggesting that the failure occurred at a weird byte in the 
> file. 
> 2. We lose data - sometimes as much as 10%. 
> We don't see this problem with parquet, which we also use, but moving all of 
> our data to parquet is not currently feasible. We're using the Java API.
> Any help on resolving this would be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162764#comment-16162764
 ] 

Marco Gaido commented on SPARK-21981:
-

[~yanboliang] yes, thanks. I will post a PR asap, thank you.

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22119) Add cosine distance to KMeans

2017-09-25 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22119:
---

 Summary: Add cosine distance to KMeans
 Key: SPARK-22119
 URL: https://issues.apache.org/jira/browse/SPARK-22119
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Marco Gaido
Priority: Minor


Currently, KMeans assumes the only possible distance measure to be used is the 
Euclidean.

In some use cases, eg. text mining, other distance measures like the cosine 
distance are widely used. Thus, for such use cases, it would be good to support 
multiple distance measures.

This ticket is to support the cosine distance measure on KMeans. Later, other 
algorithms can be extended to support several distance measures and other 
distance measures can be added.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22040) current_date function with timezone id

2017-10-02 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22040.
-
Resolution: Invalid

> current_date function with timezone id
> --
>
> Key: SPARK-22040
> URL: https://issues.apache.org/jira/browse/SPARK-22040
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> {{current_date}} function creates {{CurrentDate}} expression that accepts 
> optional timezone id, but there's no function to allow for this.
> This is to have another {{current_date}} with the timezone id, i.e.
> {code}
> def current_date(timeZoneId: String): Column
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table

2017-09-04 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152955#comment-16152955
 ] 

Marco Gaido commented on SPARK-21905:
-

This is likely to be caused by a bug in the Magellan package. It expects to 
receive an InternalRow to deserialize but in this case it doesn't happen. So it 
should be fixed there.

> ClassCastException when call sqlContext.sql on temp table
> -
>
> Key: SPARK-21905
> URL: https://issues.apache.org/jira/browse/SPARK-21905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: bluejoe
>
> {code:java}
> val schema = StructType(List(
>   StructField("name", DataTypes.StringType, true),
>   StructField("location", new PointUDT, true)))
> val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 
> 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
> val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
> dataFrame.createOrReplaceTempView("person");
> sqlContext.sql("SELECT * FROM person").foreach(println(_));
> {code}
> the last statement throws exception:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 18 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21888) Cannot add stuff to Client Classpath for Yarn Cluster Mode

2017-09-01 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151128#comment-16151128
 ] 

Marco Gaido commented on SPARK-21888:
-

It is enough to add {{hbase-site.xml}} using {{--files}} in cluster mode to 
have it interpreted. The problem is in client mode: in this case it should be 
added to the Spark conf dir to be added to the classpath.

> Cannot add stuff to Client Classpath for Yarn Cluster Mode
> --
>
> Key: SPARK-21888
> URL: https://issues.apache.org/jira/browse/SPARK-21888
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> While running Spark on Yarn in cluster mode, currently there is no way to add 
> any config files, jars etc. to Client classpath. An example for this is that 
> suppose you want to run an application that uses hbase. Then, unless and 
> until we do not copy the necessary config files required by hbase to Spark 
> Config folder, we cannot specify or set their exact locations in classpath on 
> Client end which we could do so earlier by setting the environment variable 
> "SPARK_CLASSPATH".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153672#comment-16153672
 ] 

Marco Gaido commented on SPARK-21918:
-

hive.server2.enable.doAs=true is currently not supported in STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21918) HiveClient shouldn't share Hive object between different thread

2017-09-05 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153672#comment-16153672
 ] 

Marco Gaido edited comment on SPARK-21918 at 9/5/17 1:54 PM:
-

{{hive.server2.enable.doAs=true}} is currently not supported in STS.


was (Author: mgaido):
hive.server2.enable.doAs=true is currently not supported in STS.

> HiveClient shouldn't share Hive object between different thread
> ---
>
> Key: SPARK-21918
> URL: https://issues.apache.org/jira/browse/SPARK-21918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hu Liu,
>
> I'm testing the spark thrift server and found that all the DDL statements are 
> run by user hive even if hive.server2.enable.doAs=true
> The root cause is that Hive object is shared between different thread in 
> HiveClientImpl
> {code:java}
>   private def client: Hive = {
> if (clientLoader.cachedHive != null) {
>   clientLoader.cachedHive.asInstanceOf[Hive]
> } else {
>   val c = Hive.get(conf)
>   clientLoader.cachedHive = c
>   c
> }
>   }
> {code}
> But in impersonation mode, we should just share the Hive object inside the 
> thread so that the  metastore client in Hive could be associated with right 
> user.
> we can  pass the Hive object of parent thread to child thread when running 
> the sql to fix it
> I have already had a initial patch for review and I'm glad to work on it if 
> anyone could assign it to me.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22220) Spark SQL: LATERAL VIEW OUTER null pointer exception with GROUP BY

2017-10-07 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195630#comment-16195630
 ] 

Marco Gaido commented on SPARK-0:
-

Your version is quite old and Spark 1.6 is no longer supported. May you try 
whether the issue still exist in the current master branch? Thanks.

> Spark SQL: LATERAL VIEW OUTER null pointer exception with GROUP BY
> --
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3
> Environment: We have Zeppelin using Spark and Livy (error is 
> reproducible without Livy) on an Ambari cluster.
>Reporter: Dian Fay
>
> Given a DataFrame having the fields name (a string) and tags (an array of 
> strings), the following Spark SQL query fails with a NullPointerException:
> {code}
> SELECT name, tag, COUNT(*)
> FROM records
> LATERAL VIEW OUTER explode(tags) AS tag
> GROUP BY name, tag
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 137.0 failed 4 times, most recent failure: Lost task 0.3 in stage 137.0 
> (TID 9109, $hostname, executor 1): java.lang.NullPointerException
> {code}
> The query is successful without the "outer", but obviously this excludes rows 
> with empty tags arrays. A version with outer but without aggregation also 
> succeeds, making it possible to work around this issue with a subquery:
> {code}
> SELECT name, tag
> FROM records
> LATERAL VIEW OUTER explode(tags) AS tag
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22146:
---

 Summary: FileNotFoundException while reading ORC files containing 
'%'
 Key: SPARK-22146
 URL: https://issues.apache.org/jira/browse/SPARK-22146
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


Reading ORC files containing "strange" characters like '%' fails with a 
FileNotFoundException.

For instance, if you have:

{noformat}
/tmp/orc_test/folder %3Aa/orc1.orc
/tmp/orc_test/folder %3Ab/orc2.orc
{noformat}

and you try to read the ORC files with:


{noformat}
spark.read.format("orc").load("/tmp/orc_test/*/*").show
{noformat}

you will get a:

{noformat}
java.io.FileNotFoundException: File file:/tmp/orc_test/folder%20%253Aa/orc1.orc 
does not exist
  at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
  at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
  at 
org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
  at 
org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
  at 
org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
  at scala.Option.orElse(Option.scala:289)
  at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
  ... 48 elided
{noformat}

Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16182650#comment-16182650
 ] 

Marco Gaido commented on SPARK-22146:
-

If you look carefully at the file which Spark is looking for, you'll see that 
it doesn't exist because it is the result of a improper encoding.
So, yes, the right file exists. But Spark is looking for a wrong one.
We tried both on HDFS and on the local filesystem, the error is the same, and 
it is due to the encoding of the path in the inferSchema process. I am 
preparing a PR to fix it. I will post it as soon as it is ready.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16182686#comment-16182686
 ] 

Marco Gaido edited comment on SPARK-22146 at 9/27/17 2:58 PM:
--

Yes, that is a local file and I am running `spark-shell` locally on my machine 
from the current master.


was (Author: mgaido):
Yes, that is a local file and I am running `spark-shell` locally from the 
current master.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-27 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16182686#comment-16182686
 ] 

Marco Gaido commented on SPARK-22146:
-

Yes, that is a local file and I am running `spark-shell` locally from the 
current master.

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido edited comment on SPARK-20617 at 10/19/17 11:43 AM:


This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.


was (Author: mgaido):
This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
// Some comments here
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido edited comment on SPARK-20617 at 10/19/17 11:43 AM:


This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:

{code:java}
// Some comments here
spark.sql("select null in ('a')")
{code}

Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.


was (Author: mgaido):
This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:
```
spark.sql("select null in ('a')")
```
Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16210898#comment-16210898
 ] 

Marco Gaido commented on SPARK-20617:
-

This is not a bug. This is the right and expected behavior according to SQL 
standards. Indeed, every operation involving null, is evaluated to null. You 
can easily check this behavior running:
```
spark.sql("select null in ('a')")
```
Then, in a filter expression null is considered to be false. So you have this 
behavior which is the right one. Your first "workaround" is the right way to go.
Thanks.

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20617) pyspark.sql filtering fails when using ~isin when there are nulls in column

2017-10-19 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-20617.
-
Resolution: Not A Bug

> pyspark.sql filtering fails when using ~isin when there are nulls in column
> ---
>
> Key: SPARK-20617
> URL: https://issues.apache.org/jira/browse/SPARK-20617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
> Environment: Ubuntu Xenial 16.04, Python 3.5
>Reporter: Ed Lee
>
> Hello encountered a filtering bug using 'isin' in pyspark sql on version 
> 2.2.0, Ubuntu 16.04.
> Enclosed below an example to replicate:
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as sf
> import pandas as pd
> spark = SparkSession.builder.master("local").appName("Word 
> Count").getOrCreate()
> test_df = pd.DataFrame({"col1": [None, None, "a", "b", "c"],
> "col2": range(5)
> })
> test_sdf = spark.createDataFrame(test_df)
> test_sdf.show()
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   a|   2|
>  |   b|   3|
>  |   c|   4|
> # Below shows when filtering col1 NOT in list ['a'] the col1 rows with null 
> are missing:
> test_sdf.filter(sf.col("col1").isin(["a"]) == False).show()
> Or:
> test_sdf.filter(~sf.col("col1").isin(["a"])).show()
> *Expecting*:
>  |col1|col2|
>  |null|   0|
>  |null|   1|
>  |   b|   3|
>  |   c|   4|
> *Got*:
>  |col1|col2|
>  |   b|   3|
>  |   c|   4|
> My workarounds:
> 1.  null is considered 'in', so add OR isNull conditon:
> test_sdf.filter((sf.col("col1").isin(["a"])== False) | (
> sf.col("col1").isNull())).show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> 2.  Use left join and filter
> join_df = pd.DataFrame({"col1": ["a"],
> "isin": 1
> })
> join_sdf = spark.createDataFrame(join_df)
> test_sdf.join(join_sdf, on="col1", how="left") \
> .filter(sf.col("isin").isNull()) \
> .show()
> To get:
>  |col1|col2|isin|
>  |null|   0|null|
>  |null|   1|null|
>  |   c|   4|null|
>  |   b|   3|null|
> Thank you



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22301) Add rule to Optimizer for In with empty list of values

2017-10-17 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22301:
---

 Summary: Add rule to Optimizer for In with empty list of values
 Key: SPARK-22301
 URL: https://issues.apache.org/jira/browse/SPARK-22301
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido


For performance reason, we should resolve in operation on an empty list as 
false in the optimizations phase.
For further reference, please look at the discussion on PRs: 
https://github.com/apache/spark/pull/19522 and 
https://github.com/apache/spark/pull/19494.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22520) Support code generation also for complex CASE WHEN

2017-11-14 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22520:
---

 Summary: Support code generation also for complex CASE WHEN
 Key: SPARK-22520
 URL: https://issues.apache.org/jira/browse/SPARK-22520
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Minor


Code generation is disabled for CaseWhen when the number of branches is higher 
than {{spark.sql.codegen.maxCaseBranches}} (which defaults to 20). This was 
done in SPARK-13242 to prevent the well known 64KB method limit exception.

This tickets proposes to support code generation also in those cases (without 
causing exceptions of course). As a side effect, we could get rid of the 
{{spark.sql.codegen.maxCaseBranches}} configuration.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-11-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268602#comment-16268602
 ] 

Marco Gaido commented on SPARK-19268:
-

In my case, deleting `_spark_metadata` solved the issue. Thus likely this is 
caused by a bad status of the `_spark_metadata` dir. [~zsxwing] Should we 
reopen this or create a new ticket?

> File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
> --
>
> Key: SPARK-19268
> URL: https://issues.apache.org/jira/browse/SPARK-19268
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Environment: - hadoop2.7
> - Java 7
>Reporter: liyan
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 
> 192.168.3.110:9092 subscribe topic03
> when i run the spark example raises the following error:
> {quote}
> Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got 
> cleaning task CleanBroadcast(4)
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
> task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
> java.lang.IllegalStateException: Error reading delta file 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
> HDFSStateStoreProvider[id = (op=0, part=2), dir = 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does 
> not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
>   at 
> 

[jira] [Created] (SPARK-22635) FileNotFoundException again while reading ORC files containing special characters

2017-11-28 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22635:
---

 Summary: FileNotFoundException again while reading ORC files 
containing special characters
 Key: SPARK-22635
 URL: https://issues.apache.org/jira/browse/SPARK-22635
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.2.1, 2.3.0
Reporter: Marco Gaido


SPARK-22146 fix the issue only for the {{inferSchema}}, ie. only for the schema 
inference, but it doesn't fix the problem when actually reading the data. Thus 
nearly the same exception happens when someone tries to use the data.

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
98, host-172-22-127-77.example.com, executor 3): java.io.FileNotFoundException: 
File does not exist: 
hdfs://XXX/tmp/aaa/start=2017-11-27%2009%253A30%253A00/part-0-c1477c9f-9d48-4341-89de-81056b6b618e.c000.snappy.orc
It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268712#comment-16268712
 ] 

Marco Gaido commented on SPARK-22627:
-

This should be fixed by SPARK-19106. I think it is a duplicate. Though, 
SPARK-19106 has 2.2.0 as target version while it looks like 2.2.0 is affected.
[~srowen] do you know why 2.2.0 is affected?

> Fix formatting of headers in configuration.html page
> 
>
> Key: SPARK-22627
> URL: https://issues.apache.org/jira/browse/SPARK-22627
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> On the page https://spark.apache.org/docs/latest/configuration.html one can 
> see headers in the HTML which look like left overs from the conversion from 
> Markdown:
> {code}
> ### Execution Behavior
> ...
> ### Networking
> ...
> ### Scheduling
> ...
> etc...
> {code}
> The most problems with formatting has the paragraph 
> {code}
> ### Cluster Managers Each cluster manager in Spark has additional 
> configuration options. Configurations can be found on the pages for each 
> mode:  [YARN](running-on-yarn.html#configuration)  
> [Mesos](running-on-mesos.html#configuration)  [Standalone 
> Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
> ...
> {code}
> As a reader of the documentation I want the headers in the text to be 
> formatted correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22631) Consolidate all configuration properties into one page

2017-11-28 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22631.
-
Resolution: Duplicate

> Consolidate all configuration properties into one page
> --
>
> Key: SPARK-22631
> URL: https://issues.apache.org/jira/browse/SPARK-22631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> The page https://spark.apache.org/docs/2.2.0/configuration.html gives the 
> impression as if all configuration properties of Spark are described on this 
> page. Unfortunately this is not true. The description of important properties 
> is spread through the documentation. The following pages list properties, 
> which are not described on the configuration page: 
> https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning
> https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options
> https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration
> https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio
> https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties
> https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration
> https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts
> As a reader of the documentation I would like to have single central webpage 
> describing all Spark configuration properties. Alternatively it would be nice 
> to at least add links from the configuration page to the other pages of the 
> documentation, where configuration properties are described. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22627) Fix formatting of headers in configuration.html page

2017-11-28 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268869#comment-16268869
 ] 

Marco Gaido commented on SPARK-22627:
-

[~srowen] the issue seems not present anymore on branch-2.2 (I checked the web 
page from the last candidate 2.2.1), but it is strange that 2.2.0 is affected.

> Fix formatting of headers in configuration.html page
> 
>
> Key: SPARK-22627
> URL: https://issues.apache.org/jira/browse/SPARK-22627
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>Priority: Minor
>
> On the page https://spark.apache.org/docs/latest/configuration.html one can 
> see headers in the HTML which look like left overs from the conversion from 
> Markdown:
> {code}
> ### Execution Behavior
> ...
> ### Networking
> ...
> ### Scheduling
> ...
> etc...
> {code}
> The most problems with formatting has the paragraph 
> {code}
> ### Cluster Managers Each cluster manager in Spark has additional 
> configuration options. Configurations can be found on the pages for each 
> mode:  [YARN](running-on-yarn.html#configuration)  
> [Mesos](running-on-mesos.html#configuration)  [Standalone 
> Mode](spark-standalone.html#cluster-launch-scripts) # Environment Variables 
> ...
> {code}
> As a reader of the documentation I want the headers in the text to be 
> formatted correctly and not showing Markdown syntax. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22609.
-
Resolution: Invalid

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido closed SPARK-22609.
---

> Reuse CodeGeneration.nullSafeExec when possible
> ---
>
> Key: SPARK-22609
> URL: https://issues.apache.org/jira/browse/SPARK-22609
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> There are several places in the code where `CodeGeneration.nullSafeExec` 
> could be used, but it is not. This makes the generated code containing a lot 
> of useless:
> {code}
> if (!false) {
>   // some code here
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-22 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262395#comment-16262395
 ] 

Marco Gaido commented on SPARK-22575:
-

You can use `UNCACHE TABLE` to remove them from cache if you have cached with 
`CACHE TABLE`.

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22582) Spark SQL round throws error with negative precision

2017-11-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16264582#comment-16264582
 ] 

Marco Gaido commented on SPARK-22582:
-

I tried to run
{code}
spark.sql("select round(100.1 , 1) as c3, round(100.1 , -1) as c5").show
{code}
on branch master and it works. May you try to reproduce the error in a newer 
Spark version?

> Spark SQL round throws error with negative precision
> 
>
> Key: SPARK-22582
> URL: https://issues.apache.org/jira/browse/SPARK-22582
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>
> select  round(100.1 , 1) as c3,
> round(100.1 , -1) as c5 from orders;
> Error: java.lang.IllegalArgumentException: Error: name expected at the 
> position 10 of 'decimal(4,-1)' but '-' is found. (state=,code=0)
> The same query works fine in Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22609) Reuse CodeGeneration.nullSafeExec when possible

2017-11-26 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22609:
---

 Summary: Reuse CodeGeneration.nullSafeExec when possible
 Key: SPARK-22609
 URL: https://issues.apache.org/jira/browse/SPARK-22609
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Trivial


There are several places in the code where `CodeGeneration.nullSafeExec` could 
be used, but it is not. This makes the generated code containing a lot of 
useless:
{code}
if (!false) {
  // some code here
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19268) File does not exist: /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta

2017-11-27 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266544#comment-16266544
 ] 

Marco Gaido commented on SPARK-19268:
-

[~zsxwing] I am hitting this too and I am running 2.2.0. My code looks very 
similar to the one of the 
{{org.apache.spark.examples.sql.streaming.StructuredNetworkWordCountWindowed}} 
example.

> File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
> --
>
> Key: SPARK-19268
> URL: https://issues.apache.org/jira/browse/SPARK-19268
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Environment: - hadoop2.7
> - Java 7
>Reporter: liyan
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> bq. ./run-example sql.streaming.JavaStructuredKafkaWordCount 
> 192.168.3.110:9092 subscribe topic03
> when i run the spark example raises the following error:
> {quote}
> Exception in thread "main" 17/01/17 14:13:41 DEBUG ContextCleaner: Got 
> cleaning task CleanBroadcast(4)
> org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
> stage failure: Task 2 in stage 9.0 failed 1 times, most recent failure: Lost 
> task 2.0 in stage 9.0 (TID 46, localhost, executor driver): 
> java.lang.IllegalStateException: Error reading delta file 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta of 
> HDFSStateStoreProvider[id = (op=0, part=2), dir = 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2]: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta does 
> not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:354)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:306)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:303)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:303)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:302)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:302)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:151)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /tmp/temporary-157b89c1-27bb-49f3-a70c-ca1b75022b4d/state/0/2/1.delta
>   at 
> 

[jira] [Created] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-04 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22684:
---

 Summary: Avoid the generation of useless mutable states by 
datetime functions
 Key: SPARK-22684
 URL: https://issues.apache.org/jira/browse/SPARK-22684
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


Some datetime functions are defining mutable states which are not needed at 
all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22669) Avoid unnecessary function calls in code generation

2017-12-01 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22669:
---

 Summary: Avoid unnecessary function calls in code generation
 Key: SPARK-22669
 URL: https://issues.apache.org/jira/browse/SPARK-22669
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


In many parts of the codebase for code generation, we are splitting the code to 
avoid exceptions due to the 64KB method size limit. This is generating a lot of 
methods which are called every time, even though sometime this is not needed. 
As pointed out here: 
https://github.com/apache/spark/pull/19752#discussion_r153081547, this is a not 
negligible overhead which can be avoided.

In this JIRA, I propose to use the same approach throughout all the other 
cases, when possible. I am going to submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22693) Avoid the generation of useless mutable states in complexTypeCreator and predicates

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22693:
---

 Summary: Avoid the generation of useless mutable states in 
complexTypeCreator and predicates
 Key: SPARK-22693
 URL: https://issues.apache.org/jira/browse/SPARK-22693
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


InSet and CreateNamedStruct are defining mutable states which are not needed. 
This is bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22698) Avoid the generation of useless mutable states by GenerateUnsafeProjection

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22698:
---

 Summary: Avoid the generation of useless mutable states by 
GenerateUnsafeProjection
 Key: SPARK-22698
 URL: https://issues.apache.org/jira/browse/SPARK-22698
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


GenerateUnsafeProjection is defining mutable states which are not needed at 
all. This is bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22699) Avoid the generation of useless mutable states by GenerateSafeProjection

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22699:
---

 Summary: Avoid the generation of useless mutable states by 
GenerateSafeProjection
 Key: SPARK-22699
 URL: https://issues.apache.org/jira/browse/SPARK-22699
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


GenerateSafeProjection is defining mutable states which are not needed. This is 
bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22697) Avoid the generation of useless mutable states by GenerateMutableProjection

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22697:
---

 Summary: Avoid the generation of useless mutable states by 
GenerateMutableProjection
 Key: SPARK-22697
 URL: https://issues.apache.org/jira/browse/SPARK-22697
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


GenerateMutableProjection is defining mutable states which are not needed at 
all. This is bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22694) Avoid the generation of useless mutable states by regexp functions

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22694:
---

 Summary: Avoid the generation of useless mutable states by regexp 
functions
 Key: SPARK-22694
 URL: https://issues.apache.org/jira/browse/SPARK-22694
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


Some regexp functions are defining mutable states which are not needed. This is 
bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22692) Reduce the number of generated mutable states

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22692:
---

 Summary: Reduce the number of generated mutable states
 Key: SPARK-22692
 URL: https://issues.apache.org/jira/browse/SPARK-22692
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


A large number of mutable states can cause a error during code generation due 
to reaching the constant pool limit. There is an ongoing effort on SPARK-18016 
to fix the problem, nonetheless we can also alleviate it avoiding to create a 
global variables when they are not needed.

Therefore I am creating this umbrella ticket to track the elimination of usage 
of global variables where not needed. This is not a duplicate or an alternative 
to SPARK-18016: this is a complementary effort which can help together with it 
to support wider datasets.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22684) Avoid the generation of useless mutable states by datetime functions

2017-12-05 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-22684:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-22692

> Avoid the generation of useless mutable states by datetime functions
> 
>
> Key: SPARK-22684
> URL: https://issues.apache.org/jira/browse/SPARK-22684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> Some datetime functions are defining mutable states which are not needed at 
> all. This is bad for the well known issues related to constant pool limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22696) Avoid the generation of useless mutable states by objects functions

2017-12-05 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-22696:

Summary: Avoid the generation of useless mutable states by objects 
functions  (was: void the generation of useless mutable states by objects 
functions)

> Avoid the generation of useless mutable states by objects functions
> ---
>
> Key: SPARK-22696
> URL: https://issues.apache.org/jira/browse/SPARK-22696
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>
> Some objects functions are defining mutable states which are not needed. This 
> is bad for the well known issues related to constant pool limits.
> I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22695) Avoid the generation of useless mutable states by scalaUDF

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22695:
---

 Summary: Avoid the generation of useless mutable states by scalaUDF
 Key: SPARK-22695
 URL: https://issues.apache.org/jira/browse/SPARK-22695
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


ScalaUDF is defining mutable states which are not needed. This is bad for the 
well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22696) void the generation of useless mutable states by objects functions

2017-12-05 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22696:
---

 Summary: void the generation of useless mutable states by objects 
functions
 Key: SPARK-22696
 URL: https://issues.apache.org/jira/browse/SPARK-22696
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


Some objects functions are defining mutable states which are not needed. This 
is bad for the well known issues related to constant pool limits.

I will submit a PR soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293716#comment-16293716
 ] 

Marco Gaido commented on SPARK-22806:
-

This is the right behavior. Also Postgres works like this. if you specify the 
order by clause, by default the range is UNBOUNDED PRECEDING - CURRENT ROW.

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22806) Window Aggregate functions: unexpected result at ordered partition

2017-12-16 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22806.
-
Resolution: Invalid

> Window Aggregate functions: unexpected result at ordered partition
> --
>
> Key: SPARK-22806
> URL: https://issues.apache.org/jira/browse/SPARK-22806
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Attila Zsolt Piros
> Attachments: WindowFunctionsWithGroupByError.scala
>
>
> I got different results for aggregate functions (even for sum and count) when 
> the partition is ordered "Window.partitionBy(column).orderBy(column))" and 
> when it is not ordered 'Window.partitionBy(column)".
> Example:
> {code:java}
> test("count, sum, stddev_pop functions over window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition")),
> sum("value").over(Window.partitionBy("partition")),
> stddev_pop("value").over(Window.partitionBy("partition"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
>   test("count, sum, stddev_pop functions over ordered by window") {
> val df = Seq(
>   ("a", 1, 100.0),
>   ("b", 1, 200.0)).toDF("key", "partition", "value")
> df.createOrReplaceTempView("window_table")
> checkAnswer(
>   df.select(
> $"key",
> count("value").over(Window.partitionBy("partition").orderBy("key")),
> sum("value").over(Window.partitionBy("partition").orderBy("key")),
> 
> stddev_pop("value").over(Window.partitionBy("partition").orderBy("key"))
>   ),
>   Seq(
> Row("a", 2, 300.0, 50.0),
> Row("b", 2, 300.0, 50.0)))
>   }
> {code}
> The "count, sum, stddev_pop functions over ordered by window" fails with the 
> error:
> {noformat}
> == Results ==
> !== Correct Answer - 2 ==   == Spark Answer - 2 ==
> !struct<>   struct partition ORDER BY key ASC NULLS FIRST unspecifiedframe$()):bigint,sum(value) 
> OVER (PARTITION BY partition ORDER BY key ASC NULLS FIRST 
> unspecifiedframe$()):double,stddev_pop(value) OVER (PARTITION BY partition 
> ORDER BY key ASC NULLS FIRST unspecifiedframe$()):double>
> ![a,2,300.0,50.0]   [a,1,100.0,0.0]
>  [b,2,300.0,50.0]   [b,2,300.0,50.0]
> {noformat}
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22793) Memory leak in Spark Thrift Server

2017-12-15 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292213#comment-16292213
 ] 

Marco Gaido commented on SPARK-22793:
-

Have you tried if the problem still exists in current master branch?

> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-22793
> URL: https://issues.apache.org/jira/browse/SPARK-22793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: zuotingbing
>Priority: Critical
>
> 1. Start HiveThriftServer2.
> 2. Connect to thriftserver through beeline.
> 3. Close the beeline.
> 4. repeat step2 and step 3 for several times, which caused the leak of Memory.
> we found there are many directories never be dropped under the path
> {code:java}
> hive.exec.local.scratchdir
> {code} and 
> {code:java}
> hive.exec.scratchdir
> {code} , as we know the scratchdir has been added to deleteOnExit when it be 
> created. So it means that the cache size of FileSystem deleteOnExit will keep 
> increasing until JVM terminated.
> In addition, we use 
> {code:java}
> jmap -histo:live [PID]
> {code} to printout the size of objects in HiveThriftServer2 Process, we can 
> find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
> "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though 
> we closed all the beeline connections, which caused the leak of Memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22799) Bucketizer should throw exception if single- and multi-column params are both set

2017-12-15 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292540#comment-16292540
 ] 

Marco Gaido commented on SPARK-22799:
-

may I work on this?

> Bucketizer should throw exception if single- and multi-column params are both 
> set
> -
>
> Key: SPARK-22799
> URL: https://issues.apache.org/jira/browse/SPARK-22799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>
> See the related discussion: 
> https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22773) Empty arrays are not equal after transformation

2017-12-13 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16289491#comment-16289491
 ] 

Marco Gaido commented on SPARK-22773:
-

It is not a bug, since {res} is not an empty array, but it contains an empty 
string. What we might do is to improve the {{toString}} method of strings, so 
that these things are more evident.

> Empty arrays are not equal after transformation
> ---
>
> Key: SPARK-22773
> URL: https://issues.apache.org/jira/browse/SPARK-22773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Scala
>Reporter: Laurent Legrand
>Priority: Minor
>
> The comparison of a transformed column with another one gives inconsistent 
> results when cols contain empty arrays.
> In the following code, the column "equals" of the DF "diff" should have all 
> values to true. But two are false.
> {code:scala}
>  import org.apache.spark.ml.feature.StopWordsRemover
>   
> val tf = new 
> StopWordsRemover().setInputCol("in").setOutputCol("out").setStopWords(Array("a",
>  "b"))
> val df = spark.createDataFrame(Seq(
>   ("foo bar".split(' '), "foo bar".split(' ')),
>   ("foo a bar".split(' '), "foo bar".split(' ')),
>   ("foo bar b".split(' '), "foo bar".split(' ')),
>   ("a foo bar".split(' '), "foo bar".split(' ')),
>   ("a b".split(' '), "".split(' ')),
>   ("a".split(' '), "".split(' ')),
>   ("".split(' '), "".split(' '))
> )).toDF("in", "res")
> 
> val res = tf.transform(df)
> res.show()
> 
> val diff = res.withColumn("equals", res("res") === res("out"))
> 
> diff.show()
> 
> diff.printSchema()
> 
> println(diff.filter(diff("equals") === false).count())
> {code}
> It gives:
> {{+-+--+--+
> |   in|   res|   out|
> +-+--+--+
> |   [foo, bar]|[foo, bar]|[foo, bar]|
> |[foo, a, bar]|[foo, bar]|[foo, bar]|
> |[foo, bar, b]|[foo, bar]|[foo, bar]|
> |[a, foo, bar]|[foo, bar]|[foo, bar]|
> |   [a, b]|[]|[]|
> |  [a]|[]|[]|
> |   []|[]|[]|
> +-+--+--+
> +-+--+--+--+
> |   in|   res|   out|equals|
> +-+--+--+--+
> |   [foo, bar]|[foo, bar]|[foo, bar]|  true|
> |[foo, a, bar]|[foo, bar]|[foo, bar]|  true|
> |[foo, bar, b]|[foo, bar]|[foo, bar]|  true|
> |[a, foo, bar]|[foo, bar]|[foo, bar]|  true|
> |   [a, b]|[]|[]| false|
> |  [a]|[]|[]| false|
> |   []|[]|[]|  true|
> +-+--+--+--+
> root
>  |-- in: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- res: array (nullable = true)
>  ||-- element: string (containsNull = true)
>  |-- out: array (nullable = true)
>  ||-- element: string (containsNull = true)
> 2
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-14 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290659#comment-16290659
 ] 

Marco Gaido commented on SPARK-22752:
-

thanks [~zsxwing]. You are right. I am closing this as duplicate, thanks.

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-14 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22752.
-
Resolution: Duplicate

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22841) Select regexp_extract from table with where clause having is null throws indexoutofbounds exception

2017-12-20 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298134#comment-16298134
 ] 

Marco Gaido commented on SPARK-22841:
-

I am not able to reproduce on current master. Can you try and check whether you 
can reproduce it on current master? Otherwise this likely has been fixed.

> Select regexp_extract from table with where clause having is null throws 
> indexoutofbounds exception
> ---
>
> Key: SPARK-22841
> URL: https://issues.apache.org/jira/browse/SPARK-22841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Chetan Bhat
>
> Steps :
> Thrift server is started using the command - bin/spark-submit --master 
> yarn-client --executor-memory 10G --executor-cores 5 --driver-memory 5G 
> --num-executors 3 --class 
> org.apache.carbondata.spark.thriftserver.CarbonThriftServer 
> /srv/spark2.2Bigdata/install/spark/sparkJdbc/carbonlib/carbondata_2.11-1.3.0-SNAPSHOT-shade-hadoop2.7.2.jar
>  "hdfs://hacluster/user/sparkhive/warehouse"
> Spark shell is launched using the command - bin/spark-shell --master 
> yarn-client --executor-memory 10G --executor-cores 5 --driver-memory 5G 
> --num-executors 3 --jars 
> /srv/spark2.2Bigdata/install/spark/sparkJdbc/carbonlib/carbondata_2.11-1.3.0-SNAPSHOT-shade-hadoop2.7.2.jar
> From Spark shell the streaming table is created and data is loaded to the 
> streaming table.
> import java.io.
> {File, PrintWriter}
> import java.net.ServerSocket
> import org.apache.spark.sql.
> {CarbonEnv, SparkSession}
> import org.apache.spark.sql.hive.CarbonRelation
> import org.apache.spark.sql.streaming.
> {ProcessingTime, StreamingQuery}
> import org.apache.carbondata.core.constants.CarbonCommonConstants
> import org.apache.carbondata.core.util.CarbonProperties
> import org.apache.carbondata.core.util.path.
> {CarbonStorePath, CarbonTablePath}
> CarbonProperties.getInstance().addProperty(CarbonCommonConstants.CARBON_TIMESTAMP_FORMAT,
>  "/MM/dd")
> import org.apache.spark.sql.CarbonSession._
> val carbonSession = SparkSession.
> builder().
> appName("StreamExample").
> getOrCreateCarbonSession("hdfs://hacluster/user/hive/warehouse/carbon.store")
> carbonSession.sparkContext.setLogLevel("INFO")
> def sql(sql: String) = carbonSession.sql(sql)
> def writeSocket(serverSocket: ServerSocket): Thread = {
> val thread = new Thread() {
> override def run(): Unit = {
> // wait for client to connection request and accept
> val clientSocket = serverSocket.accept()
> val socketWriter = new PrintWriter(clientSocket.getOutputStream())
> var index = 0
> for (_ <- 1 to 1000) {
> // write 5 records per iteration
> for (_ <- 0 to 100)
> { index = index + 1 socketWriter.println(index.toString + ",name_" + index + 
> ",city_" + index + "," + (index * 1.00).toString + ",school_" + index + 
> ":school_" + index + index + "$" + index) }
> socketWriter.flush()
> Thread.sleep(2000)
> }
> socketWriter.close()
> System.out.println("Socket closed")
> }
> }
> thread.start()
> thread
> }
> def startStreaming(spark: SparkSession, tablePath: CarbonTablePath, 
> tableName: String, port: Int): Thread = {
> val thread = new Thread() {
> override def run(): Unit = {
> var qry: StreamingQuery = null
> try
> { val readSocketDF = spark.readStream .format("socket") .option("host", 
> "10.18.98.34") .option("port", port) .load() qry = readSocketDF.writeStream 
> .format("carbondata") .trigger(ProcessingTime("5 seconds")) 
> .option("checkpointLocation", tablePath.getStreamingCheckpointDir) 
> .option("tablePath", tablePath.getPath).option("tableName", tableName) 
> .start() qry.awaitTermination() }
> catch
> { case ex: Throwable => ex.printStackTrace() println("Done reading and 
> writing streaming data") }
> finally
> { qry.stop() }
> }
> }
> thread.start()
> thread
> }
> val streamTableName = "uniqdata"
> sql(s"CREATE TABLE uniqdata (CUST_ID int,CUST_NAME String,ACTIVE_EMUI_VERSION 
> string, DOB timestamp, DOJ timestamp, BIGINT_COLUMN1 bigint,BIGINT_COLUMN2 
> bigint,DECIMAL_COLUMN1 decimal(30,10), DECIMAL_COLUMN2 
> decimal(36,36),Double_COLUMN1 double, Double_COLUMN2 double,INTEGER_COLUMN1 
> int) STORED BY 'org.apache.carbondata.format' 
> TBLPROPERTIES('streaming'='true')")
> sql(s"LOAD DATA INPATH 'hdfs://hacluster/chetan/2000_UniqData.csv' into table 
> uniqdata OPTIONS( 
> 'BAD_RECORDS_ACTION'='FORCE','FILEHEADER'='CUST_ID,CUST_NAME,ACTIVE_EMUI_VERSION,DOB,DOJ,BIGINT_COLUMN1,BIGINT_COLUMN2,DECIMAL_COLUMN1,DECIMAL_COLUMN2,Double_COLUMN1,Double_COLUMN2,INTEGER_COLUMN1')")
> val carbonTable = CarbonEnv.getInstance(carbonSession).carbonMetastore.
> lookupRelation(Some("default"), 
> streamTableName)(carbonSession).asInstanceOf[CarbonRelation].carbonTable
> val tablePath = 
> 

[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257088#comment-16257088
 ] 

Marco Gaido commented on SPARK-22516:
-

not sure why but this is caused by the fact that your file contains "CR LF" as 
line separator instead of only LF

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote escape character=\
> Quote escape escape character=null
> Internal state when error was thrown: line=3, column=0, record=1, charIndex=19
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:475)
> at 
> 

[jira] [Commented] (SPARK-22493) sql null checks for Double.NaN do not work

2017-11-10 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247660#comment-16247660
 ] 

Marco Gaido commented on SPARK-22493:
-

`NaN` is not `null`. They are different things. If you want to filter `NaN` you 
can use the SQL function `isnan`.

> sql null checks for Double.NaN do not work
> --
>
> Key: SPARK-22493
> URL: https://issues.apache.org/jira/browse/SPARK-22493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: derby version is 10.14.1.0
>Reporter: MIkhail Osckin
>Priority: Minor
> Attachments: sql_null.png
>
>
> I'm using spark in standalone mode and derbydb for hive-metastore. By some 
> reason i can't filter nan values in my sql queries.
> !sql_null.png|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22493) sql null checks for Double.NaN do not work

2017-11-10 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247660#comment-16247660
 ] 

Marco Gaido edited comment on SPARK-22493 at 11/10/17 3:39 PM:
---

{{NaN}} is not {{null}}. They are different things. If you want to filter 
{{NaN}} you can use the SQL function {{isnan}}.


was (Author: mgaido):
`NaN` is not `null`. They are different things. If you want to filter `NaN` you 
can use the SQL function `isnan`.

> sql null checks for Double.NaN do not work
> --
>
> Key: SPARK-22493
> URL: https://issues.apache.org/jira/browse/SPARK-22493
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: derby version is 10.14.1.0
>Reporter: MIkhail Osckin
>Priority: Minor
> Attachments: sql_null.png
>
>
> I'm using spark in standalone mode and derbydb for hive-metastore. By some 
> reason i can't filter nan values in my sql queries.
> !sql_null.png|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22494) Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode limit exception

2017-11-10 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-22494:
---

 Summary: Coalesce and AtLeastNNonNulls can cause 64KB JVM bytecode 
limit exception
 Key: SPARK-22494
 URL: https://issues.apache.org/jira/browse/SPARK-22494
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


Both `Coalesce` and `AtLeastNNonNulls` can cause the 64KB limit exception when 
used with a lot of arguments and/or complex expressions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22516) CSV Read breaks: When "multiLine" = "true", if "comment" option is set as last line's first character

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16260904#comment-16260904
 ] 

Marco Gaido commented on SPARK-22516:
-

[~crkumaresh24] I can't reproduce the issue with the new file you have 
uploaded. I am running on a OSX, maybe it depends on the OS:

{code}
scala> val a = spark.read.option("header","true").option("inferSchema", 
"true").option("multiLine", "true").option("comment", "c").option("parserLib", 
"univocity").csv("/Users/mgaido/Downloads/test_file_without_eof_char.csv")
a: org.apache.spark.sql.DataFrame = [abc: string, def: string]

scala> a.show
+---+---+
|abc|def|
+---+---+
|ghi|jkl|
+---+---+
{code}

> CSV Read breaks: When "multiLine" = "true", if "comment" option is set as 
> last line's first character
> -
>
> Key: SPARK-22516
> URL: https://issues.apache.org/jira/browse/SPARK-22516
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Kumaresh C R
>Priority: Minor
>  Labels: csvparser
> Attachments: testCommentChar.csv, test_file_without_eof_char.csv
>
>
> Try to read attached CSV file with following parse properties,
> scala> *val csvFile = 
> spark.read.option("header","true").option("inferSchema", 
> "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/test
> CommentChar.csv");   *
>   
>   
> csvFile: org.apache.spark.sql.DataFrame = [a: string, b: string]  
>   
>  
>   
>   
>  
> scala> csvFile.show   
>   
>  
> +---+---+ 
>   
>  
> |  a|  b| 
>   
>  
> +---+---+ 
>   
>  
> +---+---+   
> {color:#8eb021}*Noticed that it works fine.*{color}
> If we add an option "multiLine" = "true", it fails with below exception. This 
> happens only if we pass "comment" == input dataset's last line's first 
> character
> scala> val csvFile = 
> *spark.read.option("header","true").{color:red}{color:#d04437}option("multiLine","true"){color}{color}.option("inferSchema",
>  "true").option("parserLib", "univocity").option("comment", 
> "c").csv("hdfs://localhost:8020/testCommentChar.csv");*
> 17/11/14 14:26:17 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
> com.univocity.parsers.common.TextParsingException: 
> java.lang.IllegalArgumentException - Unable to skip 1 lines from line 2. End 
> of input reached
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> Header extraction enabled=null
> Headers=null
> Ignore leading whitespaces=false
> Ignore trailing whitespaces=false
> Input buffer size=128
> Input reading on separate thread=false
> Keep escape sequences=false
> Keep quotes=false
> Length of content displayed on error=-1
> Line separator detection enabled=false
> Maximum number of characters per column=-1
> Maximum number of columns=20480
> Normalize escaped line separators=true
> Null value=
> Number of records to read=all
> Processor=none
> Restricting data in exceptions=false
> RowProcessor error handler=null
> Selected fields=none
> Skip empty lines=true
> Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
> CsvFormat:
> Comment character=c
> Field delimiter=,
> Line separator (normalized)=\n
> Line separator sequence=\r\n
> Quote character="
> Quote 

[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261293#comment-16261293
 ] 

Marco Gaido commented on SPARK-22576:
-

why do you expect locate to work like this and not as it is working now?

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22576) Spark SQL locate returns incorrect value when the start position is negative

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261347#comment-16261347
 ] 

Marco Gaido commented on SPARK-22576:
-

I see, but this is SAP Sysbase. Why do you think Spark should behave like that? 
Also Hive behaves like Spark, thus I don't see why Spark would be supposed to 
work differently.

> Spark SQL locate returns incorrect value when the start position is negative
> 
>
> Key: SPARK-22576
> URL: https://issues.apache.org/jira/browse/SPARK-22576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yuxin Cao
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22575) Making Spark Thrift Server clean up its cache

2017-11-21 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261557#comment-16261557
 ] 

Marco Gaido commented on SPARK-22575:
-

does it happen because you are caching some tables and never uncaching them?

> Making Spark Thrift Server clean up its cache
> -
>
> Key: SPARK-22575
> URL: https://issues.apache.org/jira/browse/SPARK-22575
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, SQL
>Affects Versions: 2.2.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>  Labels: cache, dataproc, thrift, yarn
>
> Currently, Spark Thrift Server accumulates data in its appcache, even for old 
> queries. This fills up the disk (using over 100GB per worker node) within 
> days, and the only way to clear it is to restart the Thrift Server 
> application. Even deleting the files directly isn't a solution, as Spark then 
> complains about FileNotFound.
> I asked about this on [Stack 
> Overflow|https://stackoverflow.com/questions/46893123/how-can-i-make-spark-thrift-server-clean-up-its-cache]
>  a few weeks ago, but it does not seem to be currently doable by 
> configuration.
> Am I missing some configuration option, or some other factor here?
> Otherwise, can anyone point me to the code that handles this, so maybe I can 
> try my hand at a fix?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22501) 64KB JVM bytecode limit problem with in

2017-11-12 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248983#comment-16248983
 ] 

Marco Gaido commented on SPARK-22501:
-

[~kiszk] are you working on this or can I take it?

> 64KB JVM bytecode limit problem with in
> ---
>
> Key: SPARK-22501
> URL: https://issues.apache.org/jira/browse/SPARK-22501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>
> {{In}} can throw an exception due to the 64KB JVM bytecode limit when they 
> use with a lot of arguments



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22532) Spark SQL function 'drop_duplicates' throws error when passing in a column that is an element of a struct

2017-11-17 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257020#comment-16257020
 ] 

Marco Gaido commented on SPARK-22532:
-

the reason is that `header.eventId.lo` is not a column name, but it is an 
`Expression`. It is like you are using any kind of expression to transform a 
column (eg. `a + b` or `coalesce(a, 1)`) which is not supported by the 
`dropDuplicates` operation.

> Spark SQL function 'drop_duplicates' throws error when passing in a column 
> that is an element of a struct
> -
>
> Key: SPARK-22532
> URL: https://issues.apache.org/jira/browse/SPARK-22532
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
> Environment: Attempted on the following versions:
> * Spark 2.1 (CDH 5.9.2 w/ SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904)
> * Spark 2.1 (installed via homebrew)
> * Spark 2.2 (installed via homebrew)
> Also tried on Spark 1.6 that comes with CDH 5.9.2 and it works correctly; 
> this appears to be a regression.
>Reporter: Nicholas Hakobian
>
> When attempting to use drop_duplicates with a subset of columns that exist 
> within a struct the following error it raised:
> {noformat}
> AnalysisException: u'Cannot resolve column name "header.eventId.lo" among 
> (header);'
> {noformat}
> A complete example (using old sqlContext syntax so the same code can be run 
> with Spark 1.x as well):
> {noformat}
> from pyspark.sql import Row
> from pyspark.sql.functions import *
> data = [
> Row(header=Row(eventId=Row(lo=0, hi=1))),
> Row(header=Row(eventId=Row(lo=0, hi=1))),
> Row(header=Row(eventId=Row(lo=1, hi=2))),
> Row(header=Row(eventId=Row(lo=2, hi=3))),
> ]
> df = sqlContext.createDataFrame(data)
> df.drop_duplicates(['header.eventId.lo', 'header.eventId.hi']).show()
> {noformat}
> produces the following stack trace:
> {noformat}
> ---
> AnalysisException Traceback (most recent call last)
>  in ()
>  11 df = sqlContext.createDataFrame(data)
>  12
> ---> 13 df.drop_duplicates(['header.eventId.lo', 'header.eventId.hi']).show()
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py 
> in dropDuplicates(self, subset)
>1243 jdf = self._jdf.dropDuplicates()
>1244 else:
> -> 1245 jdf = self._jdf.dropDuplicates(self._jseq(subset))
>1246 return DataFrame(jdf, self.sql_ctx)
>1247
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  67  
> e.java_exception.getStackTrace()))
>  68 if s.startswith('org.apache.spark.sql.AnalysisException: 
> '):
> ---> 69 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
>  70 if s.startswith('org.apache.spark.sql.catalyst.analysis'):
>  71 raise AnalysisException(s.split(': ', 1)[1], 
> stackTrace)
> AnalysisException: u'Cannot resolve column name "header.eventId.lo" among 
> (header);'
> {noformat}
> This works _correctly_ in Spark 1.6, but fails in 2.1 (via homebrew and CDH) 
> and 2.2 (via homebrew)
> An inconvenient workaround (but it works) is the following:
> {noformat}
> (
> df
> .withColumn('lo', col('header.eventId.lo'))
> .withColumn('hi', col('header.eventId.hi'))
> .drop_duplicates(['lo', 'hi'])
> .drop('lo')
> .drop('hi')
> .show()
> )
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   >