[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer.

2017-07-28 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-20919:

Description: Using an object pool instead of a cache for recycling objects 
in Kafka consumer cache.  (was: On the lines of SPARK-19968, guava cache can be 
used to simplify the code in CachedKafkaConsumer as well. With an additional 
feature of automatic cleanup of a consumer unused for a configurable time.)

> Simplificaiton of CachedKafkaConsumer.
> --
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> Using an object pool instead of a cache for recycling objects in Kafka 
> consumer cache.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20919) Simplificaiton of CachedKafkaConsumer.

2017-07-28 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-20919:

Summary: Simplificaiton of CachedKafkaConsumer.  (was: Simplificaiton of 
CachedKafkaConsumer using guava cache.)

> Simplificaiton of CachedKafkaConsumer.
> --
>
> Key: SPARK-20919
> URL: https://issues.apache.org/jira/browse/SPARK-20919
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>
> On the lines of SPARK-19968, guava cache can be used to simplify the code in 
> CachedKafkaConsumer as well. With an additional feature of automatic cleanup 
> of a consumer unused for a configurable time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21559) Remove Mesos Fine-grain mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-21559:
---

 Summary: Remove Mesos Fine-grain mode
 Key: SPARK-21559
 URL: https://issues.apache.org/jira/browse/SPARK-21559
 Project: Spark
  Issue Type: Task
  Components: Mesos
Affects Versions: 2.2.0
Reporter: Stavros Kontopoulos


After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grain mode. Plans are to improve cluster mode to cover any benefits 
may existed when using fine grain mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21560) Add hold mode for the LiveListenerBus

2017-07-28 Thread Li Yuanjian (JIRA)
Li Yuanjian created SPARK-21560:
---

 Summary: Add hold mode for the LiveListenerBus
 Key: SPARK-21560
 URL: https://issues.apache.org/jira/browse/SPARK-21560
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0
Reporter: Li Yuanjian


As the comments in SPARK-18838, we also face the same problem about critical 
events dropped while the event queue is full. 
There's no doubt that improving the performance of the processing thread is 
important, whether the solution is multithreading or any others like 
SPARK-20776, but maybe we still need the hold strategy when the event queue is 
full, and restart after some room released. The hold strategy open or not and 
the empty rate should both configurable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Fix Version/s: (was: 2.1.2)
   (was: 2.0.3)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-28 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104918#comment-16104918
 ] 

Wenchen Fan commented on SPARK-21067:
-

We have many tests for CREATE TABLE inside Spark SQL, so I think this issue is 
thrift-server specific.

However I'm not familiar with the thrift-server code, cc [~rxin] do you know 
who is the maintainer?

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at 

[jira] [Issue Comment Deleted] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-07-28 Thread SOMASUNDARAM SUDALAIMUTHU (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SOMASUNDARAM SUDALAIMUTHU updated SPARK-14927:
--
Comment: was deleted

(was: Is this fixed in 2.0 version ?)

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-07-28 Thread SOMASUNDARAM SUDALAIMUTHU (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105004#comment-16105004
 ] 

SOMASUNDARAM SUDALAIMUTHU commented on SPARK-14927:
---

Is this fixed in 2.0 version ?

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-21559:

Description: 
After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grained mode. Plans are to improve cluster mode to cover any 
benefits may existed when using fine grained mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857



  was:
After discussing this with people from Mesosphere we agreed that it is time to 
remove fine grain mode. Plans are to improve cluster mode to cover any benefits 
may existed when using fine grain mode.
 [~susanxhuynh]
Previous status of this can be found here:
https://issues.apache.org/jira/browse/SPARK-11857




> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grained mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grained mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-28 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-21559:

Summary: Remove Mesos fine-grained mode  (was: Remove Mesos Fine-grain mode)

> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grain mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grain mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105133#comment-16105133
 ] 

Thomas Graves commented on SPARK-21541:
---

it was merged 
https://github.com/apache/spark/commit/69ab0e4bddccb461f960fcb48a390a1517e504dd 
 but I guess the pr link didn't pick it up.

I missed that the title wasn't quite right [Spark-21541] so perhaps jira didn't 
pick it up properly.

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:46 PM:


[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka

2017-07-28 Thread Vlad Badelita (JIRA)
Vlad Badelita created SPARK-21561:
-

 Summary: spark-streaming-kafka-010 DSteam is not pulling anything 
from Kafka
 Key: SPARK-21561
 URL: https://issues.apache.org/jira/browse/SPARK-21561
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.1
Reporter: Vlad Badelita


I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka 
topic(broker version 0.10). I have checked that messages are being produced and 
used a KafkaConsumer to pull them successfully. Now, when I try to use the 
spark streaming api, I am not getting anything. If I just use 
KafkaUtils.createRDD and specify some offset ranges manually it works. But 
when, I try to use createDirectStream, all the rdds are empty and when I check 
the partition offsets it simply reports that all partitions are 0. Here is what 
I tried:

{code:scala}
 val sparkConf = new SparkConf().setAppName("kafkastream")
 val ssc = new StreamingContext(sparkConf, Seconds(3))
 val topics = Array("my_topic")

 val kafkaParams = Map[String, Object](
   "bootstrap.servers" -> "hostname:6667"
   "key.deserializer" -> classOf[StringDeserializer],
   "value.deserializer" -> classOf[StringDeserializer],
   "group.id" -> "my_group",
   "auto.offset.reset" -> "earliest",
   "enable.auto.commit" -> (true: java.lang.Boolean)
 )

 val stream = KafkaUtils.createDirectStream[String, String](
   ssc,
   PreferConsistent,
   Subscribe[String, String](topics, kafkaParams)
 )

 stream.foreachRDD { rdd =>
   val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
   rdd.foreachPartition { iter =>
 val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
 println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
   }

   val rddCount = rdd.count()
   println("rdd count: ", rddCount)

   // stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
 }

 ssc.start()
 ssc.awaitTermination()
{code}

All partitions show offset ranges from 0 to 0 and all rdds are empty. I would 
like it to start from the beginning of a partition but also pick up everything 
that is being produced to it.

I have also tried using spark-streaming-kafka-0.8 and it does work. I think it 
is a 0.10 issue because everything else works fine. Thank you!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov commented on SPARK-21274:
---

[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think [1, 2] is the correct behavior for the first query.
EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.



> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21553.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18755
[https://github.com/apache/spark/pull/18755]

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-21541.
---
   Resolution: Fixed
 Assignee: Parth Gandhi
Fix Version/s: 2.3.0

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105072#comment-16105072
 ] 

Li Jin edited comment on SPARK-21190 at 7/28/17 3:06 PM:
-

[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], have you got the chance to think more about the API?


was (Author: icexelloss):
[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], do you have chance to think more about the API?

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a 

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105072#comment-16105072
 ] 

Li Jin commented on SPARK-21190:


[~cloud_fan], thanks for pointing out `ArrowColumnVector`. [~bryanc], I think 
#18659 could serve as a basis for future udf work. My work with #18732 has some 
overlap with #18659 but I can work with [~bryanc] to merge. 

[~cloud_fan] and [~rxin], do you have chance to think more about the API?

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-07-28 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105095#comment-16105095
 ] 

Li Jin commented on SPARK-21190:


I think the use case 2 of what [~rxin] proposed originally is a good API to 
enable first. I think it can a bit better if the input of the user function is 
not a {{pandas.DataFrame}} but {{pandas.Series}} to match Spark columns. i.e., 
instead of:

{code}
@spark_udf(some way to describe the return schema)
def my_func(input):
  """ Some user-defined function.
 
  :param input: A Pandas DataFrame with two columns, a and b.
  :return: :class: A numpy array
  """
  return input[a] + input[b]
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

I think this is better:
{code}
@spark_udf(some way to describe the return schema)
def my_func(a, b):
  """ Some user-defined function.
 
  :param input: Two Pandas Series, a and b
  :return: :class: A Pandas Series
  """
  return a + b
 
df = spark.range(1000).selectExpr("id a", "id / 2 b")
df.withColumn("c", my_func(df.a, df.b))
{code}

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = 

[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21549:
--
Priority: Major  (was: Blocker)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105127#comment-16105127
 ] 

Ruslan Dautkhanov edited comment on SPARK-21274 at 7/28/17 3:47 PM:


[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL returns all records from the *first* table which are not present in 
the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return just "1,2" if you swap two 
datasets, 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}


was (Author: tagar):
[~viirya], yes it returns {noformat}[1, 2, 2]{noformat} for both of the 
queries. 

I don't think {noformat}[1, 2]{noformat} is the correct behavior for the first 
query.

EXCEPT ALL which returns all records from the *first* table which are not 
present in the second table, leaving the duplicates as is.

If you believe it should be "1,2", then it's easy to fix by just changing tab1 
to tab2 in the second query.

Or other way around, original queries would return 
{noformat}
[1, 2]
for [1, 2] intersect_all [1, 2, 2]
{noformat}

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2017-07-28 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105164#comment-16105164
 ] 

Thomas Graves commented on SPARK-17321:
---

Can you clarify?   as stated above you should not be using 
nodemanager.local-dirs.  If you are you should look at reconfiguring yarn to 
use the proper NM recovery dirs.  see 
https://issues.apache.org/jira/browse/SPARK-14963

if you aren't using NM recovery then yes we should fix this so spark doesn't 
use backup db at all.



> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Priority: Blocker  (was: Critical)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Blocker
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Parth Gandhi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105116#comment-16105116
 ] 

Parth Gandhi commented on SPARK-21541:
--

The change has been merged. Thank you.

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105092#comment-16105092
 ] 

Liang-Chi Hsieh commented on SPARK-21555:
-

The sync between PR and JIRA seems broken still. I already submitted a PR for 
this issue at https://github.com/apache/spark/pull/18761.

> GROUP BY don't work with expressions with NVL and nested objects
> 
>
> Key: SPARK-21555
> URL: https://issues.apache.org/jira/browse/SPARK-21555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Vitaly Gerasimov
>
> {code}
> spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
> Nil)).createOrReplaceTempView("test")
> spark.sql("select nvl(foo.foo1, \"value\"), count(*) from test group by 
> nvl(foo.foo1, \"value\")")
> {code}
> returns exception:
> {code}
> org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
> present in the group by, nor is it an aggregate function. Add to group by or 
> wrap in first() (or first_value) if you don't care which value you get.;;
> Aggregate [nvl(foo#4.foo1 AS foo1#8, value)], [nvl(foo#4.foo1 AS foo1#9, 
> value) AS nvl(test.`foo`.`foo1` AS `foo1`, 'value')#11, count(1) AS 
> count(1)#12L]
> +- SubqueryAlias test
>+- LogicalRDD [foo#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:280)
>   

[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of custom OutputFormat implementations

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Summary: Spark fails to complete job correctly in case of custom 
OutputFormat implementations  (was: Spark fails to abort job correctly in case 
of custom OutputFormat implementations)

> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations
> 
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Critical
>
> Spark fails to abort job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> In that case if job fails Spark executes 
> [committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106]
> {code:javascript}
> committer.abortJob(jobContext)
> {code}
> ... and fails with the following exception
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Priority: Critical  (was: Blocker)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Critical
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21556) PySpark, Unable to save pipeline of non-spark transformers

2017-07-28 Thread Saif Addin (JIRA)
Saif Addin created SPARK-21556:
--

 Summary: PySpark, Unable to save pipeline of non-spark transformers
 Key: SPARK-21556
 URL: https://issues.apache.org/jira/browse/SPARK-21556
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.1.1
Reporter: Saif Addin
Priority: Minor


We are working on creating some new ML transformers following the same Spark / 
PyPark design pattern.
When in PySpark, though, we are unable to deserialize, or read Pipelines, made 
of such new Transformers, due to a hardcoded class path name in *wrapper.py*

https://github.com/apache/spark/blob/master/python/pyspark/ml/wrapper.py#L200

So this line makes pipeline components work only if JVM classes are equivalent 
to Python classes with the root replaced. But, would not be working for more 
general use cases.

The first workaround that comes to mind, is use the same pathing for pyspark 
side than jvm side.

The error, when trying to load a Pipeline from path in such circumstances is 

{code:java}

E
==
ERROR: runTest (test.annotators.PipelineTestSpec)
--
Traceback (most recent call last):
  File "/home/saif/IdeaProjects/this_project/test/annotators.py", line 208, in 
runTest
loaded_pipeline = Pipeline.read().load(pipe_path)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/util.py",
 line 198, in load
return self._clazz._from_java(java_obj)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 155, in _from_java
py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
 line 155, in 
py_stages = [JavaParams._from_java(s) for s in java_stage.getStages()]
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
 line 173, in _from_java
py_type = __get_class(stage_name)
  File 
"/home/saif/apps/spark-2.1.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py",
 line 167, in __get_class
m = __import__(module)
ModuleNotFoundError: No module named 'com.frh'

{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Vitaly Gerasimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Gerasimov updated SPARK-21555:
-
Description: 
{code}
spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
Nil)).createOrReplaceTempView("test")
spark.sql("select nvl(cast(foo.foo1 as string), \"value\"), count(*) from test 
group by nvl(cast(foo.foo1 as string), \"value\")")
{code}

returns exception:
{code}
org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [nvl(cast(foo#249.foo1 AS foo1#253 as string), value)], 
[nvl(cast(foo#249.foo1 AS foo1#254 as string), value) AS 
nvl(CAST(test.`foo`.`foo1` AS `foo1` AS STRING), 'value')#256, count(1) AS 
count(1)#257L]
+- SubqueryAlias test
   +- LogicalRDD [foo#249]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 

[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Summary: Spark fails to complete job correctly in case of OutputFormat 
which do not write into hdfs  (was: Spark fails to complete job correctly in 
case of custom OutputFormat implementations)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Critical
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Priority: Blocker  (was: Critical)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Priority: Blocker
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104533#comment-16104533
 ] 

Sean Owen commented on SPARK-21543:
---

Well, if 1 executor fails, causing a task to fail once, it's no big deal, it 
will retry.
If all executors are failing for some reason, then this would cause the task to 
keep retrying forever.
You are talking about something else: not scheduling a task again on a failed 
executor. This is the blacklisting change, isn't it?

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-28 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang closed SPARK-21543.

Resolution: Invalid

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21548) Support insert into serial columns of table

2017-07-28 Thread LvDongrong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104558#comment-16104558
 ] 

LvDongrong commented on SPARK-21548:


Think you very much, I've made a pr, and you can see if there are any problems. 

> Support insert into serial columns of table
> ---
>
> Key: SPARK-21548
> URL: https://issues.apache.org/jira/browse/SPARK-21548
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: LvDongrong
>
> When we use the 'insert into ...' statement we can only insert all the 
> columns into table.But int some cases,our table has many columns and we are 
> only interest in some of them.So we want to support the statement "insert 
> into table tbl (column1, column2,...) values (value1, value2, value3,...)".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of custom OutputFormat implementations

2017-07-28 Thread Sergey Zhemzhitsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-21549:
---
Description: 
Spark fails to complete job correctly in case of custom OutputFormat 
implementations.

There are OutputFormat implementations which do not need to use 
*mapreduce.output.fileoutputformat.outputdir* standard hadoop property.

[But spark reads this property from the 
configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
 while setting up an OutputCommitter
{code:javascript}
val committer = FileCommitProtocol.instantiate(
  className = classOf[HadoopMapReduceCommitProtocol].getName,
  jobId = stageId.toString,
  outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
  isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
committer.setupJob(jobContext)
{code}
... and then uses this property later on while [commiting the 
job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
 [aborting the 
job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
 [creating task's temp 
path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]

In that cases when the job completes then following exception is thrown
{code}
Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.(Path.java:135)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at 
org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
  ...
{code}

So it seems that all the jobs which use OutputFormats which don't write data 
into HDFS-compatible file systems are broken.

  was:
Spark fails to abort job correctly in case of custom OutputFormat 
implementations.

There are OutputFormat implementations which do not need to use 
*mapreduce.output.fileoutputformat.outputdir* standard hadoop property.

[But spark reads this property from the 
configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
 while setting up an OutputCommitter
{code:javascript}
val committer = FileCommitProtocol.instantiate(
  className = classOf[HadoopMapReduceCommitProtocol].getName,
  jobId = stageId.toString,
  outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
  isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
committer.setupJob(jobContext)
{code}

In that case if job fails Spark executes 
[committer.abortJob|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L106]
{code:javascript}
committer.abortJob(jobContext)
{code}
... and fails with the following exception
{code}
Can not create a Path from a null string
java.lang.IllegalArgumentException: Can not create a Path from a null string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
  at org.apache.hadoop.fs.Path.(Path.java:135)
  at org.apache.hadoop.fs.Path.(Path.java:89)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
  at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
  at 
org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
  at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
  at 

[jira] [Commented] (SPARK-21547) Spark cleaner cost too many time

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104532#comment-16104532
 ] 

Sean Owen commented on SPARK-21547:
---

It depends on your app, what's in your closure, etc. I'm not sure what problem 
this causes you.
"Look into X" isn't suitable as a JIRA. I think this would have to be paired 
with some hint about what the issue is or how it could be addressed.

> Spark cleaner cost too many time
> 
>
> Key: SPARK-21547
> URL: https://issues.apache.org/jira/browse/SPARK-21547
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
>Reporter: DjvuLee
>
> Spark Streaming sometime cost so many time deal with cleaning, and this can 
> become worse when enable the dynamic allocation.
> I post the Driver's Log in the following comments, we can find that the 
> cleaner costs more than 2min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21543) Should not count executor initialize failed towards task failures

2017-07-28 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104549#comment-16104549
 ] 

zhoukang commented on SPARK-21543:
--

First i will describe my case:
There is a cluster with one node has much more resources,but this node has bad 
disk.However, yarn will always launch container on this node even if executor 
can not init successful on it.
Then job failed after 4 times retry.
I agree with you.This should consider together with blacklist.Since do not 
count towards task failures will cause new problem.
I will close this issue and related pr.
I will think about optimize blacklist and add disk checker for shuffle server 
register.
Thanks for your time [~srowen]

> Should not count executor initialize failed towards task failures
> -
>
> Key: SPARK-21543
> URL: https://issues.apache.org/jira/browse/SPARK-21543
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>
> Till now, when executor init failed and exit with error code = 1, it will 
> count toward task failures.Which i think should not count executor initialize 
> failed towards task failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Vitaly Gerasimov (JIRA)
Vitaly Gerasimov created SPARK-21555:


 Summary: GROUP BY don't work with expressions with NVL and nested 
objects
 Key: SPARK-21555
 URL: https://issues.apache.org/jira/browse/SPARK-21555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Vitaly Gerasimov


{code}
spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
Nil)).createOrReplaceTempView("test")
spark.sql("select nvl(cast(foo.foo1 as string), \"value\"), count(*) from test 
group by nvl(cast(foo.foo1 as string), \"value\")")
{code}

returns exception:
{code}
org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [nvl(cast(foo#249.foo1 AS foo1#253 as string), value)], 
[nvl(cast(foo#249.foo1 AS foo1#254 as string), value) AS 
nvl(CAST(test.`foo`.`foo1` AS `foo1` AS STRING), 'value')#256, count(1) AS 
count(1)#257L]
+- SubqueryAlias test
   +- LogicalRDD [foo#249]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 

[jira] [Updated] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Vitaly Gerasimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Gerasimov updated SPARK-21555:
-
Description: 
{code}
spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
Nil)).createOrReplaceTempView("test")
spark.sql("select nvl(foo.foo1, \"value\"), count(*) from test group by 
nvl(foo.foo1, \"value\")")
{code}

returns exception:
{code}
org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [nvl(foo#4.foo1 AS foo1#8, value)], [nvl(foo#4.foo1 AS foo1#9, value) 
AS nvl(test.`foo`.`foo1` AS `foo1`, 'value')#11, count(1) AS count(1)#12L]
+- SubqueryAlias test
   +- LogicalRDD [foo#4]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:280)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$9.apply(CheckAnalysis.scala:280)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:280)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
  at 

[jira] [Updated] (SPARK-21555) GROUP BY don't work with expressions with NVL and nested objects

2017-07-28 Thread Vitaly Gerasimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vitaly Gerasimov updated SPARK-21555:
-
Description: 
{code}
spark.read.json(spark.createDataset("""{"foo":{"foo1":"value"}}""" :: 
Nil)).createOrReplaceTempView("test")
spark.sql("select nvl(foo.foo1, \"value\"), count(*) from test group by 
nvl(foo.foo1, \"value\")")
{code}

returns exception:
{code}
org.apache.spark.sql.AnalysisException: expression 'test.`foo`' is neither 
present in the group by, nor is it an aggregate function. Add to group by or 
wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [nvl(cast(foo#249.foo1 AS foo1#253 as string), value)], 
[nvl(cast(foo#249.foo1 AS foo1#254 as string), value) AS 
nvl(CAST(test.`foo`.`foo1` AS `foo1` AS STRING), 'value')#256, count(1) AS 
count(1)#257L]
+- SubqueryAlias test
   +- LogicalRDD [foo#249]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1$5.apply(CheckAnalysis.scala:253)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:253)
  at 

[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104640#comment-16104640
 ] 

Sean Owen commented on SPARK-21554:
---

The error here doesn't show the actual error. 

> Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
> XXX' when run on yarn cluster
> --
>
> Key: SPARK-21554
> URL: https://issues.apache.org/jira/browse/SPARK-21554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: We are deploying pyspark scripts on EMR 5.7
>Reporter: Subhod Lagade
>
> Traceback (most recent call last):
>   File "Test.py", line 7, in 
> hc = HiveContext(sc)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
>  line 514, in __init__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
>  line 179, in getOrCreate
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
>  line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21557) Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)

2017-07-28 Thread prabir bhowmick (JIRA)
prabir bhowmick created SPARK-21557:
---

 Summary: Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)
 Key: SPARK-21557
 URL: https://issues.apache.org/jira/browse/SPARK-21557
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.1
Reporter: prabir bhowmick
Priority: Critical
 Fix For: 2.1.2


Hi Team,

Can you please see the below error ,when I am running the below program using 
below mvn config.Kindly tell me which version I have to use.I am running this 
program from eclipse neon.

Error at Runtime:- 

Exception in thread "main" java.lang.NoSuchMethodError: 
scala.Predef$.any2ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at 
org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:750)
at 
org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:741)
at com.MLTest.JavaPCAExample.main(JavaPCAExample.java:20)

Java Class:-

package com.MLTest;

import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;
import org.apache.spark.ml.feature.PCA;
import org.apache.spark.ml.feature.PCAModel;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.linalg.Vectors;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class JavaPCAExample {
public static void main(String[] args) {
SparkSession spark = 
SparkSession.builder().appName("JavaPCAExample3")
.config("spark.some.config.option", 
"some-value").getOrCreate();

List data = Arrays.asList(
RowFactory.create(Vectors.sparse(5, new int[] { 
1, 3 }, new double[] { 1.0, 7.0 })),
RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 
4.0, 5.0)),
RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 
6.0, 7.0)));

StructType schema = new StructType(
new StructField[] { new StructField("features", 
new VectorUDT(), false, Metadata.empty()), });

Dataset df = spark.createDataFrame(data, schema);

PCAModel pca = new 
PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(df);

Dataset result = pca.transform(df).select("pcaFeatures");
result.show(false);

spark.stop();
}
}

pom.xml:-

http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0
SparkMLTest
SparkMLTest
0.0.1-SNAPSHOT

src


maven-compiler-plugin
3.5.1

1.8
1.8






org.apache.spark
spark-core_2.10
2.2.0


org.apache.spark
spark-streaming_2.10
2.1.1


org.apache.spark
spark-mllib_2.10
2.1.1
provided


org.apache.spark
spark-sql_2.10
2.1.1


org.scala-lang
scala-library
2.13.0-M1


org.apache.parquet
parquet-hadoop-bundle
1.8.1










--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21554) Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster

2017-07-28 Thread Subhod Lagade (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104634#comment-16104634
 ] 

Subhod Lagade commented on SPARK-21554:
---

Thanks for quick reply @Hyukjin Kwon  - We have spark installed with version 
2.1.1 on EMR 5.7 cluster. 
- From any of the node when i try to submit pyspark job we are getting above 
error.

Deploy command : spark-submit --master yarn --deploy-mode cluster 
spark_installed_dir\examples\src\main\python\sql\basic.py

> Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: 
> XXX' when run on yarn cluster
> --
>
> Key: SPARK-21554
> URL: https://issues.apache.org/jira/browse/SPARK-21554
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.1
> Environment: We are deploying pyspark scripts on EMR 5.7
>Reporter: Subhod Lagade
>
> Traceback (most recent call last):
>   File "Test.py", line 7, in 
> hc = HiveContext(sc)
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/context.py",
>  line 514, in __init__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/session.py",
>  line 179, in getOrCreate
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1500357225179_0540/container_1500357225179_0540_02_01/pyspark.zip/pyspark/sql/utils.py",
>  line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 
> 'org.apache.spark.sql.hive.HiveSessionState':"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21557) Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21557.
---
  Resolution: Invalid
   Fix Version/s: (was: 2.1.2)
Target Version/s:   (was: 2.2.0)

JIRA isn't for questions - stackoverflow maybe. 

> Debug issues for SparkML(scala.Predef$.any2ArrowAssoc)
> --
>
> Key: SPARK-21557
> URL: https://issues.apache.org/jira/browse/SPARK-21557
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: prabir bhowmick
>Priority: Critical
>
> Hi Team,
> Can you please see the below error ,when I am running the below program using 
> below mvn config.Kindly tell me which version I have to use.I am running this 
> program from eclipse neon.
> Error at Runtime:- 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> scala.Predef$.any2ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>   at 
> org.apache.spark.sql.SparkSession$Builder.config(SparkSession.scala:750)
>   at 
> org.apache.spark.sql.SparkSession$Builder.appName(SparkSession.scala:741)
>   at com.MLTest.JavaPCAExample.main(JavaPCAExample.java:20)
> Java Class:-
> package com.MLTest;
> import org.apache.spark.sql.SparkSession;
> import java.util.Arrays;
> import java.util.List;
> import org.apache.spark.ml.feature.PCA;
> import org.apache.spark.ml.feature.PCAModel;
> import org.apache.spark.ml.linalg.VectorUDT;
> import org.apache.spark.ml.linalg.Vectors;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class JavaPCAExample {
>   public static void main(String[] args) {
>   SparkSession spark = 
> SparkSession.builder().appName("JavaPCAExample3")
>   .config("spark.some.config.option", 
> "some-value").getOrCreate();
>   List data = Arrays.asList(
>   RowFactory.create(Vectors.sparse(5, new int[] { 
> 1, 3 }, new double[] { 1.0, 7.0 })),
>   RowFactory.create(Vectors.dense(2.0, 0.0, 3.0, 
> 4.0, 5.0)),
>   RowFactory.create(Vectors.dense(4.0, 0.0, 0.0, 
> 6.0, 7.0)));
>   StructType schema = new StructType(
>   new StructField[] { new StructField("features", 
> new VectorUDT(), false, Metadata.empty()), });
>   Dataset df = spark.createDataFrame(data, schema);
>   PCAModel pca = new 
> PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(3).fit(df);
>   Dataset result = pca.transform(df).select("pcaFeatures");
>   result.show(false);
>   spark.stop();
>   }
> }
> pom.xml:-
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>   4.0.0
>   SparkMLTest
>   SparkMLTest
>   0.0.1-SNAPSHOT
>   
>   src
>   
>   
>   maven-compiler-plugin
>   3.5.1
>   
>   1.8
>   1.8
>   
>   
>   
>   
>   
>   
>   org.apache.spark
>   spark-core_2.10
>   2.2.0
>   
>   
>   org.apache.spark
>   spark-streaming_2.10
>   2.1.1
>   
>   
>   org.apache.spark
>   spark-mllib_2.10
>   2.1.1
>   provided
>   
>   
>   org.apache.spark
>   spark-sql_2.10
>   2.1.1
>   
>   
>   org.scala-lang
>   scala-library
>   2.13.0-M1
>   
>   
>   org.apache.parquet
>   parquet-hadoop-bundle
>   1.8.1
>   
>   
> 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21558) Kinesis lease failover time should be increased or made configurable

2017-07-28 Thread JIRA
Clément MATHIEU created SPARK-21558:
---

 Summary: Kinesis lease failover time should be increased or made 
configurable
 Key: SPARK-21558
 URL: https://issues.apache.org/jira/browse/SPARK-21558
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2
Reporter: Clément MATHIEU


I have a Spark Streaming application reading from a Kinesis stream which 
exhibits serious shard lease fickleness. The root cause as been identified as 
KCL default failover time being too low for our typical JVM pauses time:

#  KinesisClientLibConfiguration#DEFAULT_FAILOVER_TIME_MILLIS is 10 seconds, 
meaning that if a worker does not renew a lease within 10s, others workers will 
steal it
# spark-streaming-kinesis-asl uses default KCL failover time and does not allow 
to configure it
# Executor's JVM logs show frequent 10+ seconds pauses

While we could spend some time to fine tune GC configuration to reduce pause 
times, I am wondering if 10 seconds is not too low. Typical Spark executors 
have very large heaps and GCs available in HotSpot are not great at ensuring 
low and deterministic pause times. One might also want to use ParallelGC. 

What do you think about:

# Increasing fail over time (it might hurts application with low latency 
requirements)
# Making it configurable




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2017-07-28 Thread Abhijit Bhole (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16104708#comment-16104708
 ] 

Abhijit Bhole commented on SPARK-21479:
---

So here is the actual use case - 

{code:java}
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "b" : 2}, { "x" : 'c2', "a": 
3, "b" : 4}])
df2 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "c" : 5}, { "x" : 'c1', "a": 
3, "c" : 6}, { "x" : 'c2', "a": 5, "c" : 8}])

df1.join(df2, ['x', 'a'], 'right_outer').where("b = 2").explain()

df1.join(df2, ['x', 'a'], 'right_outer').where("b = 2").show()

print 

df1 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "b" : 2}, { "x" : 'c2', "a": 
3, "b" : 4}])
df2 = spark.createDataFrame([{ "x" : 'c1', "a": 1, "c" : 5}, { "x" : 'c1', "a": 
3, "c" : 6}, { "x" : 'c2', "a": 5, "c" : 8}])


df1.join(df2, ['x', 'a'], 'right_outer').where("x = 'c1'").explain()

df1.join(df2, ['x', 'a'], 'right_outer').where("x = 'c1'").show()
{code}

Output - 

{code:java}
== Physical Plan ==
*Project [x#458, a#456L, b#450L, c#457L]
+- *SortMergeJoin [x#451, a#449L], [x#458, a#456L], Inner
   :- *Sort [x#451 ASC NULLS FIRST, a#449L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(x#451, a#449L, 4)
   : +- *Filter (((isnotnull(b#450L) && (b#450L = 2)) && isnotnull(x#451)) 
&& isnotnull(a#449L))
   :+- Scan ExistingRDD[a#449L,b#450L,x#451]
   +- *Sort [x#458 ASC NULLS FIRST, a#456L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(x#458, a#456L, 4)
 +- *Filter (isnotnull(x#458) && isnotnull(a#456L))
+- Scan ExistingRDD[a#456L,c#457L,x#458]
+---+---+---+---+
|  x|  a|  b|  c|
+---+---+---+---+
| c1|  1|  2|  5|
+---+---+---+---+


== Physical Plan ==
*Project [x#490, a#488L, b#482L, c#489L]
+- SortMergeJoin [x#483, a#481L], [x#490, a#488L], RightOuter
   :- *Sort [x#483 ASC NULLS FIRST, a#481L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(x#483, a#481L, 4)
   : +- Scan ExistingRDD[a#481L,b#482L,x#483]
   +- *Sort [x#490 ASC NULLS FIRST, a#488L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(x#490, a#488L, 4)
 +- *Filter (isnotnull(x#490) && (x#490 = c1))
+- Scan ExistingRDD[a#488L,c#489L,x#490]
+---+---++---+
|  x|  a|   b|  c|
+---+---++---+
| c1|  1|   2|  5|
| c1|  3|null|  6|
+---+---++---+
{code}

As you can see filter on 'x' column does not get pushed down. In our cases, 'x' 
is a company id in an multi tenant system and it is extremely important to pass 
this filter to both dataframes or else it fetches the entire data for both the 
tables.


> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Updated] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Donghui Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Donghui Xu updated SPARK-21553:
---
Summary: Add the description of the default value of master parameter in 
the spark-shell  (was: Added the description of the default value of master 
parameter in the spark-shell)

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Priority: Minor
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21553:
--
Priority: Trivial  (was: Minor)

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Assignee: Donghui Xu
>Priority: Trivial
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21553) Add the description of the default value of master parameter in the spark-shell

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21553:
-

Assignee: Donghui Xu

> Add the description of the default value of master parameter in the 
> spark-shell
> ---
>
> Key: SPARK-21553
> URL: https://issues.apache.org/jira/browse/SPARK-21553
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.2.0
>Reporter: Donghui Xu
>Assignee: Donghui Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> When I type spark-shell --help, I find that the default value description for 
> the master parameter is missing. The user does not know what the default 
> value is when the master parameter is not included, so we need to add the 
> master parameter default description to the help information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105772#comment-16105772
 ] 

Andrew Ash commented on SPARK-21563:


And for reference, I added this additional logging to assist in debugging: 
https://github.com/palantir/spark/pull/238

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan edited comment on SPARK-21549 at 7/28/17 10:16 PM:
---

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user - say /tmp).

+CC [~WeiqingYang] 




was (Author: mridulm80):
This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user).



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21566) Python method for summary

2017-07-28 Thread Andrew Ray (JIRA)
Andrew Ray created SPARK-21566:
--

 Summary: Python method for summary
 Key: SPARK-21566
 URL: https://issues.apache.org/jira/browse/SPARK-21566
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Andrew Ray


Add python method for summary that was added in SPARK-21100



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105751#comment-16105751
 ] 

Andrew Ash commented on SPARK-20433:


Here's the patch I put in my fork of Spark: 
https://github.com/palantir/spark/pull/241

It addresses CVE-2017-7525 -- http://www.securityfocus.com/bid/99623

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21567) Dataset with Tuple of type alias throws error

2017-07-28 Thread Tomasz Bartczak (JIRA)
Tomasz Bartczak created SPARK-21567:
---

 Summary: Dataset with Tuple of type alias throws error
 Key: SPARK-21567
 URL: https://issues.apache.org/jira/browse/SPARK-21567
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1
 Environment: verified for spark 2.1.1 and 2.2.0 in sbt build
Reporter: Tomasz Bartczak


returning from a map a thing that is a tuple containg another tuple - defined 
as a type alias - we receive an error.

minimal reproducible case:

having a structure like this:
{code}
object C {
  type TwoInt = (Int,Int)
  def tupleTypeAlias: TwoInt = (1,1)
}
{code}

when I do:
{code}
Seq(1).toDS().map(_ => ("",C.tupleTypeAlias))
{code}


I get exception:
{code}
type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
at 
scala.reflect.internal.Symbols$SymbolContextApiImpl.asClass(Symbols.scala:84)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at 
org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
{code}

in spark 2.1.1 the last exception was 'head of an empty list'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19490) Hive partition columns are case-sensitive

2017-07-28 Thread Taklon Stephen Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105734#comment-16105734
 ] 

Taklon Stephen Wu commented on SPARK-19490:
---

https://github.com/apache/spark/pull/16832 is still opened and cenyuhai@ didn't 
tell me the direct commit of the `fixed PR`, can we reopen this JIRA or at 
least let me know if this is still an issue.

> Hive partition columns are case-sensitive
> -
>
> Key: SPARK-19490
> URL: https://issues.apache.org/jira/browse/SPARK-19490
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: cen yuhai
>
> The real partitions columns are lower case (year, month, day)
> {code}
> Caused by: java.lang.RuntimeException: Expected only partition pruning 
> predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202)
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976)
>   at 
> org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150)
>   at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472)
>   at 
> org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213)
>   at 
> org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> {code}
> Use these sql can reproduce this bug:
> CREATE TABLE partition_test (key Int) partitioned by (date string)
> SELECT * FROM partition_test where DATE = '20170101'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21568) ConsoleProgressBar should only be enabled in shells

2017-07-28 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-21568:
--

 Summary: ConsoleProgressBar should only be enabled in shells
 Key: SPARK-21568
 URL: https://issues.apache.org/jira/browse/SPARK-21568
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


This is the current logic that enables the progress bar:

{code}
_progressBar =
  if (_conf.getBoolean("spark.ui.showConsoleProgress", true) && 
!log.isInfoEnabled) {
Some(new ConsoleProgressBar(this))
  } else {
None
  }
{code}

That is based on the logging level; it just happens to align with the default 
configuration for shells (WARN) and normal apps (INFO).

But if someone changes the default logging config for their app, this may 
break; they may silence logs by setting the default level to WARN or ERROR, and 
a normal application will see a lot of log spam from the progress bar (which is 
especially bad when output is redirected to a file, as is usually done when 
running in cluster mode).

While it's possible to disable the progress bar separately, this behavior is 
not really expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan edited comment on SPARK-21549 at 7/28/17 10:14 PM:
---

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value (which is valid 
and writable by user).




was (Author: mridulm80):

This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value.



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20433) Security issue with jackson-databind

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105792#comment-16105792
 ] 

Sean Owen commented on SPARK-20433:
---

You updated to 2.6.7 but indicated above that's still vulnerable. Does it 
contain the fix?

Also how does this affect Spark?

> Security issue with jackson-databind
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>  Labels: security
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After 

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

*no* further _formatting_ is done here
{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0   0



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  10



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   0   1


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will 

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0 0



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated 

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  00



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the 

[jira] [Resolved] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21562.

Resolution: Duplicate

> Spark may request extra containers if the rpc between YARN and spark is too 
> fast
> 
>
> Key: SPARK-21562
> URL: https://issues.apache.org/jira/browse/SPARK-21562
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Wei Chen
>  Labels: YARN
>
> hi huys,
> I find an interesting problem when spark tries to request containers from 
> YARN. 
> Here is the case:
> In YarnAllocator.scala
> 1. this function requests container from YARN only if there are executors are 
> not be requested. 
> {code:java}def updateResourceRequests(): Unit = {
> val pendingAllocate = getPendingAllocate
> val numPendingAllocate = pendingAllocate.size
> val missing = targetNumExecutors - numPendingAllocate - 
> numExecutorsRunning
>   
> if (missing > 0) {
>  ..
> }
>   .
> }
> {code}
> 2. After the requested containers are allocated(granted through RPC), then it 
> will update the pending queues
>   
> {code:java}
> private def matchContainerToRequest(
>   allocatedContainer: Container,
>   location: String,
>   containersToUse: ArrayBuffer[Container],
>   remaining: ArrayBuffer[Container]): Unit = {
>   .
>  
>amClient.removeContainerRequest(containerRequest) //update pending queues
>
>.
> }
> {code}
> 3. After the allocated containers are launched, it will update the running 
> queue
> {code:java}
> private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
> Unit = {
> for (container <- containersToUse) {
>  
> auncherPool.execute(new Runnable {
> override def run(): Unit = {
>   try {
> new ExecutorRunnable(
>   Some(container),
>   conf,
>   sparkConf,
>   driverUrl,
>   executorId,
>   executorHostname,
>   executorMemory,
>   executorCores,
>   appAttemptId.getApplicationId.toString,
>   securityMgr,
>   localResources
> ).run()
> logInfo(s"has launched $containerId")
> updateInternalState()   //update running queues
>  
>   
> } 
> }{code}
> However, in step 3 it will launch a thread to first launch ExecutorRunnable 
> then update running queue. We found it would take almost 1 sec before the 
> updating running queue function is called(updateInternalState()). So there 
> would be an inconsistent situation here since the pending queue is updated 
> but the running queue is not updated yet due to the launching thread does not 
> reach updateInternalState() yet. If there is an RPC call to 
> amClient.allocate() between this inconsistent interval, then more executors 
> than targetNumExecutors would be requested.
> {noformat}
> Here is an example:
> Initial:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  00
> After first RPC call to amClient.allocate:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  1 0
> After the first allocated container is granted by YARN
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  0(is removed in step 2)  0
> =>if there is a RPC call here to amClient.allocate(), then more 
> containers are requested,
> however this situation is caused by the inconsistent state.
> After the container is launched in step 3
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1   01
> {noformat}
> ===
> I found this problem because I am changing requestType to test some features 
> on YARN's opportunisitc containers(e.g., allocation takes 100ms) which is 
> much faster then guaranteed containers(e.g., allocation takes almost 1s).
> I am not sure if I have a correct understanding.
> Appreciate anyone's help in this issue(correct me if I have miss 
> understanding)
> Wei



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105431#comment-16105431
 ] 

Sean Owen commented on SPARK-17614:
---

Well, it's unrelated to this issue, so this isn't the place. And you seem to be 
reporting syntax that Cassandra doesn't support, which isn't a Spark issue. 

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105505#comment-16105505
 ] 

Wei Chen commented on SPARK-21562:
--

Thanks for comments, I will close this one.

On Fri, Jul 28, 2017 at 11:36 AM, Marcelo Vanzin (JIRA) 



> Spark may request extra containers if the rpc between YARN and spark is too 
> fast
> 
>
> Key: SPARK-21562
> URL: https://issues.apache.org/jira/browse/SPARK-21562
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Wei Chen
>  Labels: YARN
>
> hi huys,
> I find an interesting problem when spark tries to request containers from 
> YARN. 
> Here is the case:
> In YarnAllocator.scala
> 1. this function requests container from YARN only if there are executors are 
> not be requested. 
> {code:java}def updateResourceRequests(): Unit = {
> val pendingAllocate = getPendingAllocate
> val numPendingAllocate = pendingAllocate.size
> val missing = targetNumExecutors - numPendingAllocate - 
> numExecutorsRunning
>   
> if (missing > 0) {
>  ..
> }
>   .
> }
> {code}
> 2. After the requested containers are allocated(granted through RPC), then it 
> will update the pending queues
>   
> {code:java}
> private def matchContainerToRequest(
>   allocatedContainer: Container,
>   location: String,
>   containersToUse: ArrayBuffer[Container],
>   remaining: ArrayBuffer[Container]): Unit = {
>   .
>  
>amClient.removeContainerRequest(containerRequest) //update pending queues
>
>.
> }
> {code}
> 3. After the allocated containers are launched, it will update the running 
> queue
> {code:java}
> private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
> Unit = {
> for (container <- containersToUse) {
>  
> auncherPool.execute(new Runnable {
> override def run(): Unit = {
>   try {
> new ExecutorRunnable(
>   Some(container),
>   conf,
>   sparkConf,
>   driverUrl,
>   executorId,
>   executorHostname,
>   executorMemory,
>   executorCores,
>   appAttemptId.getApplicationId.toString,
>   securityMgr,
>   localResources
> ).run()
> logInfo(s"has launched $containerId")
> updateInternalState()   //update running queues
>  
>   
> } 
> }{code}
> However, in step 3 it will launch a thread to first launch ExecutorRunnable 
> then update running queue. We found it would take almost 1 sec before the 
> updating running queue function is called(updateInternalState()). So there 
> would be an inconsistent situation here since the pending queue is updated 
> but the running queue is not updated yet due to the launching thread does not 
> reach updateInternalState() yet. If there is an RPC call to 
> amClient.allocate() between this inconsistent interval, then more executors 
> than targetNumExecutors would be requested.
> {noformat}
> Here is an example:
> Initial:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  00
> After first RPC call to amClient.allocate:
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  1 0
> After the first allocated container is granted by YARN
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1  0(is removed in step 2)  0
> =>if there is a RPC call here to amClient.allocate(), then more 
> containers are requested,
> however this situation is caused by the inconsistent state.
> After the container is launched in step 3
> targetNumExecutors  numPendingAllocate numExecutorsRunning
> 1   01
> {noformat}
> ===
> I found this problem because I am changing requestType to test some features 
> on YARN's opportunisitc containers(e.g., allocation takes 100ms) which is 
> much faster then guaranteed containers(e.g., allocation takes almost 1s).
> I am not sure if I have a correct understanding.
> Appreciate anyone's help in this issue(correct me if I have miss 
> understanding)
> Wei



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org


[jira] [Created] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-28 Thread Amit Assudani (JIRA)
Amit Assudani created SPARK-21565:
-

 Summary: aggregate query fails with watermark on eventTime but 
works with watermark on timestamp column generated by current_timestamp
 Key: SPARK-21565
 URL: https://issues.apache.org/jira/browse/SPARK-21565
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Amit Assudani


Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

Exception:

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

Code to replicate:

package test

import java.nio.file.{Files, Path, Paths}
import java.text.SimpleDateFormat

import org.apache.spark.sql.types._
import org.apache.spark.sql.{SparkSession}

import scala.collection.JavaConverters._

object Test1 {

  def main(args: Array[String]) {

val sparkSession = SparkSession
  .builder()
  .master("local[*]")
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
val checkpointPath = "target/cp1"
val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
delete(newEventsPath)
delete(Paths.get(checkpointPath).toAbsolutePath)
Files.createDirectories(newEventsPath)


val dfNewEvents= newEvents(sparkSession)
dfNewEvents.createOrReplaceTempView("dfNewEvents")

//The below works - Start
//val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
//dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
//val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
// End


//The below doesn't work - Start
val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
").withWatermark("eventTime","2 seconds")
 dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
  val groupEvents = sparkSession.sql("select symbol,eventTime, count(price) 
as count1 from dfNewEvents2 group by symbol,eventTime")
// - End


val query1 = groupEvents.writeStream
  .outputMode("append")
.format("console")
  .option("checkpointLocation", checkpointPath)
  .start("./myop")

val newEventFile1=newEventsPath.resolve("eventNew1.json")
Files.write(newEventFile1, List(
  """{"symbol": 
"GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
  """{"symbol": 
"GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
).toIterable.asJava)
query1.processAllAvailable()

sparkSession.streams.awaitAnyTermination(1)

  }

  private def newEvents(sparkSession: SparkSession) = {
val newEvents = Paths.get("target/newEvents/").toAbsolutePath
delete(newEvents)
Files.createDirectories(newEvents)

val dfNewEvents = 
sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2
 seconds")
dfNewEvents
  }

  private val eventsSchema = StructType(List(
StructField("symbol", StringType, true),
StructField("price", DoubleType, true),
StructField("eventTime", TimestampType, false)
  ))

  private def delete(dir: Path) = {
if(Files.exists(dir)) {
  Files.walk(dir).iterator().asScala.toList
.map(p => p.toFile)
.sortWith((o1, o2) => o1.compareTo(o2) > 0)
.foreach(_.delete)
}
  }

}






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-28 Thread Amit Assudani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Assudani updated SPARK-21565:
--
Description: 
*Short Description: *

Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

*Exception:*

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
at 
org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

*Code to replicate:*

package test

import java.nio.file.{Files, Path, Paths}
import java.text.SimpleDateFormat

import org.apache.spark.sql.types._
import org.apache.spark.sql.{SparkSession}

import scala.collection.JavaConverters._

object Test1 {

  def main(args: Array[String]) {

val sparkSession = SparkSession
  .builder()
  .master("local[*]")
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
val checkpointPath = "target/cp1"
val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
delete(newEventsPath)
delete(Paths.get(checkpointPath).toAbsolutePath)
Files.createDirectories(newEventsPath)


val dfNewEvents= newEvents(sparkSession)
dfNewEvents.createOrReplaceTempView("dfNewEvents")

//The below works - Start
//val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
//dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
//val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
// End


//The below doesn't work - Start
val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
").withWatermark("eventTime","2 seconds")
 dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
  val groupEvents = sparkSession.sql("select symbol,eventTime, count(price) 
as count1 from dfNewEvents2 group by symbol,eventTime")
// - End


val query1 = groupEvents.writeStream
  .outputMode("append")
.format("console")
  .option("checkpointLocation", checkpointPath)
  .start("./myop")

val newEventFile1=newEventsPath.resolve("eventNew1.json")
Files.write(newEventFile1, List(
  """{"symbol": 
"GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
  """{"symbol": 
"GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
).toIterable.asJava)
query1.processAllAvailable()

sparkSession.streams.awaitAnyTermination(1)

  }

  private def newEvents(sparkSession: SparkSession) = {
val newEvents = Paths.get("target/newEvents/").toAbsolutePath
delete(newEvents)
Files.createDirectories(newEvents)

val dfNewEvents = 
sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2
 seconds")
dfNewEvents
  }

  private val eventsSchema = StructType(List(
StructField("symbol", StringType, true),
StructField("price", DoubleType, true),
StructField("eventTime", TimestampType, false)
  ))

  private def delete(dir: Path) = {
if(Files.exists(dir)) {
  Files.walk(dir).iterator().asScala.toList
.map(p => p.toFile)
.sortWith((o1, o2) => o1.compareTo(o2) > 0)
.foreach(_.delete)
}
  }

}




  was:
Aggregation query fails with eventTime as watermark column while works with 
newTimeStamp column generated by running SQL with current_timestamp,

Exception:

Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at 

[jira] [Commented] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-07-28 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105630#comment-16105630
 ] 

Mridul Muralidharan commented on SPARK-21549:
-


This affects both mapred ("mapred.output.dir") and mapreduce 
("mapreduce.output.fileoutputformat.outputdir") based OutputFormat's which do 
not set the properties referenced and is an incompatibility introduced in spark 
2.2

Workaround is to explicitly set the property to a dummy value.



> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Diogo Munaro Vieira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105514#comment-16105514
 ] 

Diogo Munaro Vieira commented on SPARK-19720:
-

Yes, but it's a major security bug as described here. It should not be ported 
to 2.1.2?

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-28 Thread James Conner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105528#comment-16105528
 ] 

James Conner commented on SPARK-18016:
--

Thank you for letting me know, Kazuaki!  Please let me know if you need any 
debug or crash information.

The shape of the data that I am using is as follows:
* 1   x StringType (ID)
* 1   x VectorType (features)
* 2656 x DoubleType (SCORE, and feature_{1..2655}

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>

[jira] [Created] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21563:
--

 Summary: Race condition when serializing TaskDescriptions and 
adding jars
 Key: SPARK-21563
 URL: https://issues.apache.org/jira/browse/SPARK-21563
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-07-28 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21563:
---
Description: 
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
SPARK-19796

The race is between adding additional jars to the SparkContext and serializing 
the TaskDescription.

Consider this sequence of events:

- TaskSetManager creates a TaskDescription using a reference to the 
SparkContext's jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
- TaskDescription starts serializing, and begins writing jars: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
- the size of the jar map is written out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
- _on another thread_: the application adds a jar to the SparkContext's jars 
list
- then the entries in the jars list are serialized out: 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64

The problem now is that the jars list is serialized as having N entries, but 
actually N+1 entries follow that count!

This causes task deserialization to fail in the executor, with the stacktrace 
above.

The same issue also likely exists for files, though I haven't observed that and 
our application does not stress that codepath the same way it did for jar 
additions.

One fix here is that TaskSetManager could make an immutable copy of the jars 
list that it passes into the TaskDescription constructor, so that list doesn't 
change mid-serialization.

  was:
cc [~robert3005]

I was seeing this exception during some running Spark jobs:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

After some debugging, we determined that this is due to a race condition in 
task serde introduced in SPARK-19796.  cc [~irashid] [~kayousterhout]

The race is between adding additional jars to the SparkContext and serializing 
the 

[jira] [Created] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)
Andrew Ash created SPARK-21564:
--

 Summary: TaskDescription decoding failure should fail the task
 Key: SPARK-21564
 URL: https://issues.apache.org/jira/browse/SPARK-21564
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task

2017-07-28 Thread Andrew Ash (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-21564:
---
Description: 
cc [~robert3005]

I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]

  was:
I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readUTF(DataInputStream.java:609)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
at scala.collection.immutable.Range.foreach(Range.scala:160)
at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]


> TaskDescription decoding failure should fail the task
> -
>
> Key: SPARK-21564
> URL: https://issues.apache.org/jira/browse/SPARK-21564
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - 

[jira] [Created] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)
Wei Chen created SPARK-21562:


 Summary: Spark may request extra containers if the rpc between 
YARN and spark is too fast
 Key: SPARK-21562
 URL: https://issues.apache.org/jira/browse/SPARK-21562
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Wei Chen


hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 amClient.removeContainerRequest(containerRequest) //update pending queues
  .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  } 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17614.
---
Resolution: Fixed

[~zwu@gmail.com] please don't reopen this

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-17614.
-

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}


[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

3. After the allocated containers are launched, it will update the running queue
{color:red}private def runAllocatedContainers(containersToUse: 
ArrayBuffer[Container]): Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  } 


}{color}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature if YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 
{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 amClient.removeContainerRequest(containerRequest) //update pending queues
  .
}
{color}

3. After the 

[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.


{noformat}
Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  00



After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  1 0



After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1  0(is removed in step 2)  0


=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.


After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1   01


{noformat}
===
I found this problem because I am changing requestType to test some features on 
YARN's opportunisitc containers(e.g., allocation takes 100ms) which is much 
faster then guaranteed containers(e.g., allocation takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are 

[jira] [Resolved] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21561.
---
Resolution: Invalid

This isn't a place to ask for input on your code -- you'd have to show a 
reproducible bug here that you've narrowed down

> spark-streaming-kafka-010 DSteam is not pulling anything from Kafka
> ---
>
> Key: SPARK-21561
> URL: https://issues.apache.org/jira/browse/SPARK-21561
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Vlad Badelita
>  Labels: kafka-0.10, spark-streaming
>
> I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka 
> topic(broker version 0.10). I have checked that messages are being produced 
> and used a KafkaConsumer to pull them successfully. Now, when I try to use 
> the spark streaming api, I am not getting anything. If I just use 
> KafkaUtils.createRDD and specify some offset ranges manually it works. But 
> when, I try to use createDirectStream, all the rdds are empty and when I 
> check the partition offsets it simply reports that all partitions are 0. Here 
> is what I tried:
> {code:scala}
>  val sparkConf = new SparkConf().setAppName("kafkastream")
>  val ssc = new StreamingContext(sparkConf, Seconds(3))
>  val topics = Array("my_topic")
>  val kafkaParams = Map[String, Object](
>"bootstrap.servers" -> "hostname:6667"
>"key.deserializer" -> classOf[StringDeserializer],
>"value.deserializer" -> classOf[StringDeserializer],
>"group.id" -> "my_group",
>"auto.offset.reset" -> "earliest",
>"enable.auto.commit" -> (true: java.lang.Boolean)
>  )
>  val stream = KafkaUtils.createDirectStream[String, String](
>ssc,
>PreferConsistent,
>Subscribe[String, String](topics, kafkaParams)
>  )
>  stream.foreachRDD { rdd =>
>val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>rdd.foreachPartition { iter =>
>  val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
>  println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
>}
>val rddCount = rdd.count()
>println("rdd count: ", rddCount)
>// stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
>  }
>  ssc.start()
>  ssc.awaitTermination()
> {code}
> All partitions show offset ranges from 0 to 0 and all rdds are empty. I would 
> like it to start from the beginning of a partition but also pick up 
> everything that is being produced to it.
> I have also tried using spark-streaming-kafka-0.8 and it does work. I think 
> it is a 0.10 issue because everything else works fine. Thank you!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105439#comment-16105439
 ] 

Paul Wu commented on SPARK-17614:
-

Oh, sorry.  I thought I could use a query hereas I do with other rdbms. 
Things become complicated for this Cassandra case after I think more on 
thisI'll accept your comment. 

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19938) java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field

2017-07-28 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105449#comment-16105449
 ] 

Ryan Blue commented on SPARK-19938:
---

[~snavatski], I just hit this problem also and found out what causes it. 
There's a [similar issue with Java serialization on a SO 
question|https://stackoverflow.com/questions/9110677/readresolve-not-working-an-instance-of-guavas-serializedform-appears/18647941]
 that I found helpful.

The cause of this is one of two problems during deserialization:

# The classloader can't find the class of objects in the list
# The classloader used by Java deserialization differs from the one that loaded 
the class of objects in the list

These cases end up causing the deserialization code to take a path where 
{{readResolve}} isn't called on the list's {{SerializationProxy}}. When the 
list is set on the object that contains it, the type doesn't match and you get 
this exception.

To fix this problem, check the following things:

* Make sure Jars loaded on the driver are in the executor's classpath
* Make sure Jars provided by Spark aren't included in your application (to 
avoid loading with different classloaders).


> java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field
> ---
>
> Key: SPARK-19938
> URL: https://issues.apache.org/jira/browse/SPARK-19938
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.0.2
>Reporter: srinivas thallam
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:09 PM:
--

The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   

[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105425#comment-16105425
 ] 

Paul Wu commented on SPARK-17614:
-

So create a new issue? Or this is not an issue to you?

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-07-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21523:
--
Target Version/s: 2.2.1
Priority: Critical  (was: Minor)

I think this is fairly critical actually -- would like to get this into a 2.2.1 
release.

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Critical
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21546) dropDuplicates with watermark yields RuntimeException due to binding failure

2017-07-28 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105426#comment-16105426
 ] 

Shixiong Zhu commented on SPARK-21546:
--

Yeah, good catch. The watermark column should be one of the dropDuplicates 
columns. Otherwise, it never evicts states.

> dropDuplicates with watermark yields RuntimeException due to binding failure
> 
>
> Key: SPARK-21546
> URL: https://issues.apache.org/jira/browse/SPARK-21546
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>
> With today's master...
> The following streaming query with watermark and {{dropDuplicates}} yields 
> {{RuntimeException}} due to failure in binding.
> {code}
> val topic1 = spark.
>   readStream.
>   format("kafka").
>   option("subscribe", "topic1").
>   option("kafka.bootstrap.servers", "localhost:9092").
>   option("startingoffsets", "earliest").
>   load
> val records = topic1.
>   withColumn("eventtime", 'timestamp).  // <-- just to put the right name 
> given the purpose
>   withWatermark(eventTime = "eventtime", delayThreshold = "30 seconds"). // 
> <-- use the renamed eventtime column
>   dropDuplicates("value").  // dropDuplicates will use watermark
> // only when eventTime column exists
>   // include the watermark column => internal design leak?
>   select('key cast "string", 'value cast "string", 'eventtime).
>   as[(String, String, java.sql.Timestamp)]
> scala> records.explain
> == Physical Plan ==
> *Project [cast(key#0 as string) AS key#169, cast(value#1 as string) AS 
> value#170, eventtime#157-T3ms]
> +- StreamingDeduplicate [value#1], 
> StatefulOperatorStateInfo(,93c3de98-3f85-41a4-8aef-d09caf8ea693,0,0),
>  0
>+- Exchange hashpartitioning(value#1, 200)
>   +- EventTimeWatermark eventtime#157: timestamp, interval 30 seconds
>  +- *Project [key#0, value#1, timestamp#5 AS eventtime#157]
> +- StreamingRelation kafka, [key#0, value#1, topic#2, 
> partition#3, offset#4L, timestamp#5, timestampType#6]
> import org.apache.spark.sql.streaming.{OutputMode, Trigger}
> val sq = records.
>   writeStream.
>   format("console").
>   option("truncate", false).
>   trigger(Trigger.ProcessingTime("10 seconds")).
>   queryName("from-kafka-topic1-to-console").
>   outputMode(OutputMode.Update).
>   start
> {code}
> {code}
> ---
> Batch: 0
> ---
> 17/07/27 10:28:58 ERROR Executor: Exception in task 3.0 in stage 13.0 (TID 
> 438)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: eventtime#157-T3ms
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:977)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:370)
>   at 
> 

[jira] [Commented] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-07-28 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105427#comment-16105427
 ] 

Mark Grover commented on SPARK-19720:
-

I wasn't planning on. One could argue the case that this could be backported to 
branch-2.1 given that it's a rather simple change. However, 2.2 brought in some 
changes that were long overdue - dropping support for Java 7, Hadoop 2.5 and 
even if we got this change backported, you won't be able to make use of 
goodness down the road you didn't upgrade to Hadoop 2.6, Java 8, etc. So, my 
recommendation here would be to brave the new world of hadoop 2.6.

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21562) Spark may request extra containers if the rpc between YARN and spark is too fast

2017-07-28 Thread Wei Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Chen updated SPARK-21562:
-
Description: 
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{code:java}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}
{code}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{code:java}
private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{code}

3. After the allocated containers are launched, it will update the running queue
{code:java}
private def runAllocatedContainers(containersToUse: ArrayBuffer[Container]): 
Unit = {
for (container <- containersToUse) {
 
auncherPool.execute(new Runnable {
override def run(): Unit = {
  try {
new ExecutorRunnable(
  Some(container),
  conf,
  sparkConf,
  driverUrl,
  executorId,
  executorHostname,
  executorMemory,
  executorCores,
  appAttemptId.getApplicationId.toString,
  securityMgr,
  localResources
).run()
logInfo(s"has launched $containerId")
updateInternalState()   //update running queues
 
  


} 


}{code}



However, in step 3 it will launch a thread to first launch ExecutorRunnable 
then update running queue. We found it would take almost 1 sec before the 
updating running queue function is called(updateInternalState()). So there 
would be an inconsistent situation here since the pending queue is updated but 
the running queue is not updated yet due to the launching thread does not reach 
updateInternalState() yet. If there is an RPC call to amClient.allocate() 
between this inconsistent interval, then more executors than targetNumExecutors 
would be requested.

Here is an example:
Initial:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0 
   0
After first RPC call to amClient.allocate:
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 1 
0
After the first allocated container is granted by YARN
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 0(is removed in step 2)  0

=>if there is a RPC call here to amClient.allocate(), then more containers 
are requested,
however this situation is caused by the inconsistent state.

After the container is launched in step 3
targetNumExecutors  numPendingAllocate numExecutorsRunning
1 01

===
I found this problem because I am testng the feature on YARN's opportunisitc 
containers(e.g., allocation takes 100ms) which is much faster then guaranteed 
containers(e.g., allocateion takes almost 1s).


I am not sure if I have a correct understanding.
Appreciate anyone's help in this issue(correct me if I have miss understanding)


Wei



  was:
hi huys,

I find an interesting problem when spark tries to request containers from YARN. 
Here is the case:

In YarnAllocator.scala

1. this function requests container from YARN only if there are executors are 
not be requested. 

{color:red}def updateResourceRequests(): Unit = {
val pendingAllocate = getPendingAllocate
val numPendingAllocate = pendingAllocate.size
val missing = targetNumExecutors - numPendingAllocate - numExecutorsRunning

  
if (missing > 0) {
 ..
}

  .
}{color}


2. After the requested containers are allocated(granted through RPC), then it 
will update the pending queues
  
{color:red}private def matchContainerToRequest(
  allocatedContainer: Container,
  location: String,
  containersToUse: ArrayBuffer[Container],
  remaining: ArrayBuffer[Container]): Unit = {
  .
 


   amClient.removeContainerRequest(containerRequest) //update pending queues
   


   .
}
{color}

[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:08 PM:
--

The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> 

[jira] [Reopened] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu reopened SPARK-17614:
-

The fix does not support the syntax on the syntax like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2017-07-28 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105402#comment-16105402
 ] 

Paul Wu edited comment on SPARK-17614 at 7/28/17 6:14 PM:
--

The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp where empid>2)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}


was (Author: zwu@gmail.com):
The fix does not support the syntax  like this:

{{.jdbc(JDBC_URL, "(select * from emp)", connectionProperties);}}

Here is the stack trace:

{{Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:14 no viable 
alternative at input 'select' (SELECT * from [select]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:60)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:113)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at 
com.att.cass.proto.CassJDBCWithSpark.main(CassJDBCWithSpark.java:44)}}

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Sean Owen
>Priority: Minor
>  Labels: cassandra-jdbc, sql
> Fix For: 2.1.0
>
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   

[jira] [Created] (SPARK-21569) Internal Spark class needs to be kryo-registered

2017-07-28 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-21569:
-

 Summary: Internal Spark class needs to be kryo-registered
 Key: SPARK-21569
 URL: https://issues.apache.org/jira/browse/SPARK-21569
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Ryan Williams


[Full repro here|https://github.com/ryan-williams/spark-bugs/tree/hf]

As of 2.2.0, {{saveAsNewAPIHadoopFile}} jobs fail (when 
{{spark.kryo.registrationRequired=true}}) with:

{code}
java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
Note: To register this class use: 
kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:458)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:488)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:593)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

This internal Spark class should be kryo-registered by Spark by default.

This was not a problem in 2.1.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21570) File __spark_libs__XXX.zip does not exist on networked file system w/ yarn

2017-07-28 Thread Albert Chu (JIRA)
Albert Chu created SPARK-21570:
--

 Summary: File __spark_libs__XXX.zip does not exist on networked 
file system w/ yarn
 Key: SPARK-21570
 URL: https://issues.apache.org/jira/browse/SPARK-21570
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Albert Chu


I have a set of scripts that run Spark with data in a networked file system.  
One of my unit tests to make sure things don't break between Spark releases is 
to simply run a word count (via org.apache.spark.examples.JavaWordCount) on a 
file in the networked file system.  This test broke with Spark 2.2.0 when I use 
yarn to launch the job (using the spark standalone scheduler things still 
work).  I'm currently using Hadoop 2.7.0.  I get the following error:

{noformat}
Diagnostics: File 
file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does not exist
java.io.FileNotFoundException: File 
file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
{noformat}

While debugging, I sat and watched the directory and did see that 
/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
 does show up at some point.

Wondering if it's possible something racy was introduced.  Nothing in the Spark 
2.2.0 release notes suggests any type of configuration change that needs to be 
done.

Thanks





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21570) File __spark_libs__XXX.zip does not exist on networked file system w/ yarn

2017-07-28 Thread Albert Chu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105900#comment-16105900
 ] 

Albert Chu commented on SPARK-21570:


Oh, and because it will likely be asked and may likely be relevant.  In this 
test setup, HDFS is not used at all.

{noformat}

  fs.defaultFS
  file:///

{noformat}

All temp dirs, staging dirs, etc. are configured to appropriate locations in 
/tmp or somewhere in the networked file system.

> File __spark_libs__XXX.zip does not exist on networked file system w/ yarn
> --
>
> Key: SPARK-21570
> URL: https://issues.apache.org/jira/browse/SPARK-21570
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Albert Chu
>
> I have a set of scripts that run Spark with data in a networked file system.  
> One of my unit tests to make sure things don't break between Spark releases 
> is to simply run a word count (via org.apache.spark.examples.JavaWordCount) 
> on a file in the networked file system.  This test broke with Spark 2.2.0 
> when I use yarn to launch the job (using the spark standalone scheduler 
> things still work).  I'm currently using Hadoop 2.7.0.  I get the following 
> error:
> {noformat}
> Diagnostics: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
> java.io.FileNotFoundException: File 
> file:/p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:819)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:596)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}
> While debugging, I sat and watched the directory and did see that 
> /p/lcratery/achu/testing/rawnetworkfs/test/1181015/node-0/spark/node-0/spark-292938be-7ae3-460f-aca7-294083ebb790/__spark_libs__695301535722158702.zip
>  does show up at some point.
> Wondering if it's possible something racy was introduced.  Nothing in the 
> Spark 2.2.0 release notes suggests any type of configuration change that 
> needs to be done.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-28 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105919#comment-16105919
 ] 

Liang-Chi Hsieh commented on SPARK-21274:
-

[~Tagar] I've tried the query on PostgreSQL, the answer of [1, 2, 2] 
intersect_all [1, 2] is [1, 2]. So I think it's correct?

How do we know we need to change the tables when rewriting the intersect query?

> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21541) Spark Logs show incorrect job status for a job that does not create SparkContext

2017-07-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16105049#comment-16105049
 ] 

Sean Owen commented on SPARK-21541:
---

Was this change merged? I don't think it was 
https://github.com/apache/spark/pull/18741

> Spark Logs show incorrect job status for a job that does not create 
> SparkContext
> 
>
> Key: SPARK-21541
> URL: https://issues.apache.org/jira/browse/SPARK-21541
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Minor
> Fix For: 2.3.0
>
>
> If you run a spark job without creating the SparkSession or SparkContext, the 
> spark job logs says it succeeded but yarn says it fails and retries 3 times. 
> Also, since, Application Master unregisters with Resource Manager and exits 
> successfully, it deletes the spark staging directory, so when yarn makes 
> subsequent retries, it fails to find the staging directory and thus, the 
> retries fail.
> *Steps:*
> 1. For example, run a pyspark job without creating SparkSession or 
> SparkContext. 
> *Example:*
> import sys
> from random import random
> from operator import add
> from pyspark import SparkContext
> if __name__ == "__main__":
>   print("hello world")
> 2. Spark will mark it as FAILED. Got to the UI and check the container logs.
> 3. You will see the following information in the logs:
> spark:
> 7/07/14 13:22:10 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/07/14 13:22:10 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> But yarn logs will show:
> 2017-07-14 01:14:33,203 [AsyncDispatcher event handler] INFO 
> attempt.RMAppAttemptImpl: appattempt_1493735952617_12443844_01 State 
> change from FINAL_SAVING to FAILED



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-28 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-20090.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18618
[https://github.com/apache/spark/pull/18618]

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20090) Add StructType.fieldNames to Python API

2017-07-28 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-20090:
---

Assignee: Hyukjin Kwon

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >