[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-30 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598284#comment-16598284
 ] 

Evelyn Bayes commented on SPARK-25150:
--

Sorry my attachment doesn't want to stick,I'll give it another try.

 

[^zombie-analysis.py]

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-30 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598275#comment-16598275
 ] 

Evelyn Bayes edited comment on SPARK-25150 at 8/31/18 5:55 AM:
---

I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should. 
!zombie-analysis.py|width=7,height=7,align=absmiddle!

What's happening is it's re-aliasing the right side of the join because the 
left and right refer to the same base column. When it does this it renames all 
the columns in the right side of the join to the new alias but not the column 
which is actually a part of the join.

Then because the join refers to the column which hasn't been updated it now 
refers to the left side of the join. So it does a cartesian join on itself and 
straps on the right side of the join on the end.

The part of the code which is doing the renaming is:
[https://github.com/apache/spark/blob/v2.3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala]
It's using ResolveReferences.dedupRight which as it says just de duplicates the 
right side references from the left side (this might be a naive understanding 
of it).

Then if you just alias one of these columns it's fine. But that really 
shouldn't be required for the logical plan to be accurate.

 

 


was (Author: eeveeb):
I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should.

 

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598277#comment-16598277
 ] 

Hyukjin Kwon commented on SPARK-25293:
--

Does this describe a question, or a bug? If it's a question, we should better 
ask this to mailing list. It might be better to leave this resolved until we 
clear if it's a bug or not.

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25293:
-
Priority: Major  (was: Critical)

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598276#comment-16598276
 ] 

Hyukjin Kwon commented on SPARK-25293:
--

Please avoid to set Critical+ which is usually reserved for committers.

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Major
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25292) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25292.
--
Resolution: Duplicate

> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25292
> URL: https://issues.apache.org/jira/browse/SPARK-25292
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Critical
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
> In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the {{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25150) Joining DataFrames derived from the same source yields confusing/incorrect results

2018-08-30 Thread Evelyn Bayes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598275#comment-16598275
 ] 

Evelyn Bayes commented on SPARK-25150:
--

I'd love the chance to bug patch this.

I've included a simplified version of the python script which produces it, if 
you switch out the second join to the commented join it works as it should.

 

 

> Joining DataFrames derived from the same source yields confusing/incorrect 
> results
> --
>
> Key: SPARK-25150
> URL: https://issues.apache.org/jira/browse/SPARK-25150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nicholas Chammas
>Priority: Major
> Attachments: output-with-implicit-cross-join.txt, 
> output-without-implicit-cross-join.txt, persons.csv, states.csv, 
> zombie-analysis.py
>
>
> I have two DataFrames, A and B. From B, I have derived two additional 
> DataFrames, B1 and B2. When joining A to B1 and B2, I'm getting a very 
> confusing error:
> {code:java}
> Join condition is missing or trivial.
> Either: use the CROSS JOIN syntax to allow cartesian products between these
> relations, or: enable implicit cartesian products by setting the configuration
> variable spark.sql.crossJoin.enabled=true;
> {code}
> Then, when I configure "spark.sql.crossJoin.enabled=true" as instructed, 
> Spark appears to give me incorrect answers.
> I am not sure if I am missing something obvious, or if there is some kind of 
> bug here. The "join condition is missing" error is confusing and doesn't make 
> sense to me, and the seemingly incorrect output is concerning.
> I've attached a reproduction, along with the output I'm seeing with and 
> without the implicit cross join enabled.
> I realize the join I've written is not correct in the sense that it should be 
> left outer join instead of an inner join (since some of the aggregates are 
> not available for all states), but that doesn't explain Spark's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-08-30 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-23789.
-
Resolution: Duplicate

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
>   at 

[jira] [Reopened] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-08-30 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-23789:
-

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
>   at 

[jira] [Commented] (SPARK-23789) Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider

2018-08-30 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598259#comment-16598259
 ] 

Yuming Wang commented on SPARK-23789:
-

Fixed by https://github.com/apache/spark/pull/20784

> Shouldn't set hive.metastore.uris before invoking HiveDelegationTokenProvider
> -
>
> Key: SPARK-23789
> URL: https://issues.apache.org/jira/browse/SPARK-23789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> 18/03/23 23:33:35 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no 
> longer has any effect.  Use hive.hmshandler.retry.* instead
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name hive.metastore.local does 
> not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.attempts does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.metastore.ds.retry.interval does not exist
> 18/03/23 23:33:35 WARN HiveConf: HiveConf of name 
> hive.server2.enable.impersonation does not exist
> 18/03/23 23:33:35 INFO metastore: Trying to connect to metastore with URI 
> thrift://metastore.com:9083
> 18/03/23 23:33:35 ERROR TSaslTransport: SASL negotiation failure
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>   at 
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>   at 
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>   at 
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>   at 
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>   at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:124)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
>   at 
> 

[jira] [Updated] (SPARK-25282) Fix support for spark-shell with K8s

2018-08-30 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-25282:

Description: 
Spark shell when run with kubernetes master, gives following errors.
{noformat}
java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
local class serialVersionUID = -6655865447853211720
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)

{noformat}

Special care was taken to ensure, the same compiled jar was used both in images 
and the host system. or system running the driver.

This issue affects, pyspark and R interface as well.

  was:
Spark shell when run with kubernetes master, gives following errors.
{noformat}
java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
local class serialVersionUID = -6655865447853211720
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)

{noformat}

Special care was taken to ensure, the same compiled jar was used both in images 
and the host system. or system running the driver.


> Fix support for spark-shell with K8s
> 
>
> Key: SPARK-25282
> URL: https://issues.apache.org/jira/browse/SPARK-25282
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Spark shell when run with kubernetes master, gives following errors.
> {noformat}
> java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
> class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
> local class serialVersionUID = -6655865447853211720
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
> {noformat}
> Special care was taken to ensure, the same compiled jar was used both in 
> images and the host system. or system running the driver.
> This issue affects, pyspark and R interface as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread omkar puttagunta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

omkar puttagunta updated SPARK-25293:
-
Description: 
[https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
{quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
node on AWS EC2
{quote}
Simple Test; reading pipe delimited file and writing data to csv. Commands 
below are executed in spark-shell with master-url set

{{val df = 
spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
 val emailDf=df.filter("_c3='EML'") 
emailDf.repartition(100).write.csv("/opt/outputFile/")}}

After executing the cmds above in spark-shell with master url set.
{quote}In {{worker1}} -> Each part file is created 
in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
 In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
directly under outputDirectory specified during write.
{quote}
*Same thing happens with coalesce(100) or without specifying 
repartition/coalesce!!! Tried with Java also!*

*_Quesiton_*

1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
{{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?

  was:
[https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
{quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
node
{quote}
Simple Test; reading pipe delimited file and writing data to csv. Commands 
below are executed in spark-shell with master-url set

{{val df = 
spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
 val emailDf=df.filter("_c3='EML'") 
emailDf.repartition(100).write.csv("/opt/outputFile/")}}

After executing the cmds above in spark-shell with master url set.
{quote}In {{worker1}} -> Each part file is created 
in{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
directly under outputDirectory specified during write.
{quote}
*Same thing happens with coalesce(100) or without specifying 
repartition/coalesce!!!*

*_Quesiton_*

1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
{{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
created and {{part-xxx-xx}} files reside in the {{task-xxx}}directories?


> Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx 
> instead of directly saving in outputDir
> --
>
> Key: SPARK-25293
> URL: https://issues.apache.org/jira/browse/SPARK-25293
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Java API, Spark Shell, Spark Submit
>Affects Versions: 2.0.2
>Reporter: omkar puttagunta
>Priority: Critical
>
> [https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
> {quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
> node on AWS EC2
> {quote}
> Simple Test; reading pipe delimited file and writing data to csv. Commands 
> below are executed in spark-shell with master-url set
> {{val df = 
> spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
>  val emailDf=df.filter("_c3='EML'") 
> emailDf.repartition(100).write.csv("/opt/outputFile/")}}
> After executing the cmds above in spark-shell with master url set.
> {quote}In {{worker1}} -> Each part file is created 
> in\{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
>  In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
> directly under outputDirectory specified during write.
> {quote}
> *Same thing happens with coalesce(100) or without specifying 
> repartition/coalesce!!! Tried with Java also!*
> *_Quesiton_*
> 1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
> {{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
> created and {{part-xxx-xx}} files reside in the \{{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25293) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread omkar puttagunta (JIRA)
omkar puttagunta created SPARK-25293:


 Summary: Dataframe write to csv saves part files in 
outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
 Key: SPARK-25293
 URL: https://issues.apache.org/jira/browse/SPARK-25293
 Project: Spark
  Issue Type: Bug
  Components: EC2, Java API, Spark Shell, Spark Submit
Affects Versions: 2.0.2
Reporter: omkar puttagunta


[https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
{quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
node
{quote}
Simple Test; reading pipe delimited file and writing data to csv. Commands 
below are executed in spark-shell with master-url set

{{val df = 
spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
 val emailDf=df.filter("_c3='EML'") 
emailDf.repartition(100).write.csv("/opt/outputFile/")}}

After executing the cmds above in spark-shell with master url set.
{quote}In {{worker1}} -> Each part file is created 
in{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
directly under outputDirectory specified during write.
{quote}
*Same thing happens with coalesce(100) or without specifying 
repartition/coalesce!!!*

*_Quesiton_*

1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
{{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
created and {{part-xxx-xx}} files reside in the {{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598245#comment-16598245
 ] 

Apache Spark commented on SPARK-25021:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22298

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25021:


Assignee: (was: Apache Spark)

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25021) Add spark.executor.pyspark.memory support to Kubernetes

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25021:


Assignee: Apache Spark

> Add spark.executor.pyspark.memory support to Kubernetes
> ---
>
> Key: SPARK-25021
> URL: https://issues.apache.org/jira/browse/SPARK-25021
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-25004 adds {{spark.executor.pyspark.memory}} to control the memory 
> allocation for PySpark and updates YARN to add this memory to its container 
> requests. Kubernetes should do something similar to account for the python 
> memory allocation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25292) Dataframe write to csv saves part files in outputDireotry/task-xx/part-xxx instead of directly saving in outputDir

2018-08-30 Thread omkar puttagunta (JIRA)
omkar puttagunta created SPARK-25292:


 Summary: Dataframe write to csv saves part files in 
outputDireotry/task-xx/part-xxx instead of directly saving in outputDir
 Key: SPARK-25292
 URL: https://issues.apache.org/jira/browse/SPARK-25292
 Project: Spark
  Issue Type: Bug
  Components: EC2, Java API, Spark Shell, Spark Submit
Affects Versions: 2.0.2
Reporter: omkar puttagunta


[https://stackoverflow.com/questions/52108335/why-spark-dataframe-writes-part-files-to-temporary-in-instead-directly-creating]
{quote}Running Spark 2.0.2 in Standalone Cluster Mode; 2 workers and 1 master 
node
{quote}
Simple Test; reading pipe delimited file and writing data to csv. Commands 
below are executed in spark-shell with master-url set

{{val df = 
spark.sqlContext.read.option("delimiter","|").option("quote","\u").csv("/home/input-files/")
 val emailDf=df.filter("_c3='EML'") 
emailDf.repartition(100).write.csv("/opt/outputFile/")}}

After executing the cmds above in spark-shell with master url set.
{quote}In {{worker1}} -> Each part file is created 
in{{/opt/outputFile/_temporary/task-x-xxx/part-xxx-xxx}}
In {{worker2}} -> {{/opt/outputFile/part-xxx}} => part files are generated 
directly under outputDirectory specified during write.
{quote}
*Same thing happens with coalesce(100) or without specifying 
repartition/coalesce!!!*

*_Quesiton_*

1) why {{worker1}} {{/opt/outputFile/}} output directory doesn't have 
{{part-}} files just like in {{worker2}}? why {{_temporary}} directory is 
created and {{part-xxx-xx}} files reside in the {{task-xxx}}directories?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai resolved SPARK-25206.
---
Resolution: Won't Fix

Not backport to 2.3 as per [~cloud_fan]'s summary, closed.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25282) Fix support for spark-shell with K8s

2018-08-30 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-25282:

Description: 
Spark shell when run with kubernetes master, gives following errors.
{noformat}
java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
local class serialVersionUID = -6655865447853211720
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)

{noformat}

Special care was taken to ensure, the same compiled jar was used both in images 
and the host system. or system running the driver.

  was:
Spark shell when run with kubernetes master, gives following errors.
{noformat}
java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
local class serialVersionUID = -6655865447853211720
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)

{noformat}

Special care was taken to ensure, the same compiled jar was used both in images 
and the host system.


> Fix support for spark-shell with K8s
> 
>
> Key: SPARK-25282
> URL: https://issues.apache.org/jira/browse/SPARK-25282
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Prashant Sharma
>Priority: Major
>
> Spark shell when run with kubernetes master, gives following errors.
> {noformat}
> java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local 
> class incompatible: stream classdesc serialVersionUID = -3720498261147521051, 
> local class serialVersionUID = -6655865447853211720
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
> {noformat}
> Special care was taken to ensure, the same compiled jar was used both in 
> images and the host system. or system running the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-08-30 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25291:
--

 Summary: Flakiness of tests in terms of executor memory 
(SecretsTestSuite)
 Key: SPARK-25291
 URL: https://issues.apache.org/jira/browse/SPARK-25291
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Ilan Filonenko


SecretsTestSuite shows flakiness in terms of correct setting of executor 
memory: 

 

- Run SparkPi with env and mount secrets. *** FAILED ***
 "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)

When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25291) Flakiness of tests in terms of executor memory (SecretsTestSuite)

2018-08-30 Thread Ilan Filonenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-25291:
---
Description: 
SecretsTestSuite shows flakiness in terms of correct setting of executor 
memory: 

Run SparkPi with env and mount secrets. *** FAILED ***
 "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)

When ran with default settings 

  was:
SecretsTestSuite shows flakiness in terms of correct setting of executor 
memory: 

 

- Run SparkPi with env and mount secrets. *** FAILED ***
 "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)

When ran with default settings 


> Flakiness of tests in terms of executor memory (SecretsTestSuite)
> -
>
> Key: SPARK-25291
> URL: https://issues.apache.org/jira/browse/SPARK-25291
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> SecretsTestSuite shows flakiness in terms of correct setting of executor 
> memory: 
> Run SparkPi with env and mount secrets. *** FAILED ***
>  "[884]Mi" did not equal "[1408]Mi" (KubernetesSuite.scala:272)
> When ran with default settings 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598235#comment-16598235
 ] 

Wenchen Fan commented on SPARK-25206:
-

It turns out we need to backport 3 non-trivial PRs to entirely fix the problem, 
which is risky. Let's close this JIRA if the problem has been resolved in 
master.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file

2018-08-30 Thread Shirish Tatikonda (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598227#comment-16598227
 ] 

Shirish Tatikonda commented on SPARK-19809:
---

[~dongjoon] I am encountering the same problem even with Spark version 2.3.1.
{code:java}
[local:~] spark-shell
2018-08-30 21:07:25 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1535688452266).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("create table empty_orc(a int) stored as orc location 
'/tmp/empty_orc'").show
2018-08-30 21:07:44 WARN  ObjectStore: - Version information not found in 
metastore. hive.metastore.schema.verification is not enabled so recording the 
schema version 1.2.0
2018-08-30 21:07:44 WARN  ObjectStore:568 - Failed to get database default, 
returning NoSuchObjectException
2018-08-30 21:07:45 WARN  ObjectStore:568 - Failed to get database global_temp, 
returning NoSuchObjectException
++
||
++
++

// in a different terminal, I did "touch /tmp/empty_orc/zero.orc"

scala> sql("select * from empty_orc").show
java.lang.RuntimeException: serious problem
  at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021)
  at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:340)
  at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
  at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
  at 

[jira] [Resolved] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25256.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22264
[https://github.com/apache/spark/pull/22264]

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Darcy Shen
>Priority: Major
> Fix For: 2.4.0
>
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], 

[jira] [Assigned] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25256:
-

Assignee: Darcy Shen

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Darcy Shen
>Priority: Major
> Fix For: 2.4.0
>
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
> [1.0,1.0], 1, 0, 0)], output=[buf#205104])
> +- Scan hive 

[jira] [Assigned] (SPARK-25290) BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25290:


Assignee: (was: Apache Spark)

> BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError
> --
>
> Key: SPARK-25290
> URL: https://issues.apache.org/jira/browse/SPARK-25290
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> BytesToBytesMapOnHeapSuite randomizedStressTest caused OutOfMemoryError on 
> several test runs. Seems better to reduce memory usage in this test.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95369/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95482/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95501/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25290) BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25290:


Assignee: Apache Spark

> BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError
> --
>
> Key: SPARK-25290
> URL: https://issues.apache.org/jira/browse/SPARK-25290
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> BytesToBytesMapOnHeapSuite randomizedStressTest caused OutOfMemoryError on 
> several test runs. Seems better to reduce memory usage in this test.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95369/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95482/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95501/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25290) BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598162#comment-16598162
 ] 

Apache Spark commented on SPARK-25290:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22297

> BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError
> --
>
> Key: SPARK-25290
> URL: https://issues.apache.org/jira/browse/SPARK-25290
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> BytesToBytesMapOnHeapSuite randomizedStressTest caused OutOfMemoryError on 
> several test runs. Seems better to reduce memory usage in this test.
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95369/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95482/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95501/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25290) BytesToBytesMapOnHeapSuite randomizedStressTest can cause OutOfMemoryError

2018-08-30 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-25290:
---

 Summary: BytesToBytesMapOnHeapSuite randomizedStressTest can cause 
OutOfMemoryError
 Key: SPARK-25290
 URL: https://issues.apache.org/jira/browse/SPARK-25290
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


BytesToBytesMapOnHeapSuite randomizedStressTest caused OutOfMemoryError on 
several test runs. Seems better to reduce memory usage in this test.

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95369/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95482/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95501/testReport/org.apache.spark.unsafe.map/BytesToBytesMapOnHeapSuite/randomizedStressTest/]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25289) ChiSqSelector max on empty collection

2018-08-30 Thread Marie Beaulieu (JIRA)
Marie Beaulieu created SPARK-25289:
--

 Summary: ChiSqSelector max on empty collection
 Key: SPARK-25289
 URL: https://issues.apache.org/jira/browse/SPARK-25289
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.3.1
Reporter: Marie Beaulieu


In org.apache.spark.mllib.feature.ChiSqSelector.fit, there is a max taken on a 
possibly empty collection.

I am using Spark 2.3.1.

Here is an example to reproduce.
{code:java}
import org.apache.spark.mllib.feature.ChiSqSelector
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
implicit val spark = sqlContext.sparkSession

val labeledPoints = (0 to 1).map(n => {
  val v = Vectors.dense((1 to 3).map(_ => n * 1.0).toArray)
  LabeledPoint(n.toDouble, v)
})
val rdd = sc.parallelize(labeledPoints)
val selector = new ChiSqSelector().setSelectorType("fdr").setFdr(0.05)
selector.fit(rdd){code}
Here is the stack trace:
{code:java}
java.lang.UnsupportedOperationException: empty.max
at scala.collection.TraversableOnce$class.max(TraversableOnce.scala:229)
at scala.collection.mutable.ArrayOps$ofInt.max(ArrayOps.scala:234)
at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:280)
{code}
Looking at line 280 in ChiSqSelector, it's pretty obvious how the collection 
can be empty. A simple non empty validation should do the trick.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598132#comment-16598132
 ] 

Apache Spark commented on SPARK-24748:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22296

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25286:

Target Version/s: 2.4.0

> Remove dangerous parmap
> ---
>
> Key: SPARK-25286
> URL: https://issues.apache.org/jira/browse/SPARK-25286
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> One of parmap methods accepts an execution context created outside of parmap. 
> If the parmap method is called recursively on a thread pool limited by size, 
> it could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 
> . To eliminate the problems in the future, need to remove parmap() with the 
> signature:
> {code:scala}
> def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
>   (in: Col[I])
>   (f: I => O)
>   (implicit
> cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
> cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
> ec: ExecutionContext
>   ): Col[O]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598115#comment-16598115
 ] 

Saisai Shao commented on SPARK-25206:
-

I see, if it is not going to be merged, let's close this JIRA and add to the 
release note.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598113#comment-16598113
 ] 

yucai commented on SPARK-25206:
---

Based on our discussion in 
[https://github.com/apache/spark/pull/22184#issuecomment-416840509],

seems like [~cloud_fan] prefers not backport, need his confirmation.

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598111#comment-16598111
 ] 

Saisai Shao commented on SPARK-25135:
-

What's the ETA of this issue [~yumwang]?

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: Parquet, correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25206) wrong records are returned when Hive metastore schema and parquet schema are in different letter cases

2018-08-30 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598110#comment-16598110
 ] 

Saisai Shao commented on SPARK-25206:
-

What is the status of this JIRA, are we going to backport, or just mark as a 
known issue?

[~yucai] [~cloud_fan] [~smilegator]

> wrong records are returned when Hive metastore schema and parquet schema are 
> in different letter cases
> --
>
> Key: SPARK-25206
> URL: https://issues.apache.org/jira/browse/SPARK-25206
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.1
>Reporter: yucai
>Priority: Blocker
>  Labels: Parquet, correctness
> Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, 
> image-2018-08-25-10-04-21-901.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+
> {code}
>  
> *Root Cause*
> After deep dive, it has two issues, both are related to different letter 
> cases between Hive metastore schema and parquet schema.
> 1. Wrong column is pushdown.
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff}id{color} 
> actually).
> So no records are returned.
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> 2. Spark SQL returns NULL for a column whose Hive metastore schema and 
> Parquet schema are in different letter cases, even spark.sql.caseSensitive 
> set to false.
> SPARK-25132 addressed this issue already.
>  
> The biggest difference is, in Spark 2.1, user will get Exception for the same 
> query:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> So they will know the issue and fix the query.
> But in Spark 2.3, user will get the wrong results sliently.
>  
> To make the above query work, we need both SPARK-25132 and -SPARK-24716.-
>  
> [~yumwang] , [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598089#comment-16598089
 ] 

Apache Spark commented on SPARK-25255:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22295

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25255:


Assignee: Apache Spark

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25255) Add getActiveSession to SparkSession in PySpark

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25255:


Assignee: (was: Apache Spark)

> Add getActiveSession to SparkSession in PySpark
> ---
>
> Key: SPARK-25255
> URL: https://issues.apache.org/jira/browse/SPARK-25255
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Add getActiveSession to PySpark session API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25287.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22294
[https://github.com/apache/spark/pull/22294]

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson reassigned SPARK-25287:
--

Assignee: Erik Erlandson

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
> Fix For: 2.4.0
>
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25287:


Assignee: (was: Apache Spark)

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597965#comment-16597965
 ] 

Apache Spark commented on SPARK-25287:
--

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/22294

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: infrastructure
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25287:


Assignee: Apache Spark

> Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py
> ---
>
> Key: SPARK-25287
> URL: https://issues.apache.org/jira/browse/SPARK-25287
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.1
>Reporter: Erik Erlandson
>Assignee: Apache Spark
>Priority: Minor
>  Labels: infrastructure
>
> I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
> checked, so I get to the end of the {{merge_spark_pr.py}} process and it 
> fails on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25288) Kafka transaction tests are flaky

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25288:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Kafka transaction tests are flaky
> -
>
> Key: SPARK-25288
> URL: https://issues.apache.org/jira/browse/SPARK-25288
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaRelationSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25288) Kafka transaction tests are flaky

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25288:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Kafka transaction tests are flaky
> -
>
> Key: SPARK-25288
> URL: https://issues.apache.org/jira/browse/SPARK-25288
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaRelationSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25288) Kafka transaction tests are flaky

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597958#comment-16597958
 ] 

Apache Spark commented on SPARK-25288:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22293

> Kafka transaction tests are flaky
> -
>
> Key: SPARK-25288
> URL: https://issues.apache.org/jira/browse/SPARK-25288
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaRelationSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25288) Kafka transaction tests are flaky

2018-08-30 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25288:
-
Description: 
http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaRelationSuite_name=read+Kafka+transactional+messages%3A+read_committed
http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed
http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed

> Kafka transaction tests are flaky
> -
>
> Key: SPARK-25288
> URL: https://issues.apache.org/jira/browse/SPARK-25288
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaRelationSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV1SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaMicroBatchV2SourceSuite_name=read+Kafka+transactional+messages%3A+read_committed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25288) Kafka transaction tests are flaky

2018-08-30 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25288:


 Summary: Kafka transaction tests are flaky
 Key: SPARK-25288
 URL: https://issues.apache.org/jira/browse/SPARK-25288
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25287) Check for JIRA_USERNAME and JIRA_PASSWORD up front in merge_spark_pr.py

2018-08-30 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-25287:
--

 Summary: Check for JIRA_USERNAME and JIRA_PASSWORD up front in 
merge_spark_pr.py
 Key: SPARK-25287
 URL: https://issues.apache.org/jira/browse/SPARK-25287
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.1
Reporter: Erik Erlandson


I never remember to set {{JIRA_USERNAME}} and {{JIRA_PASSWORD}}, and it isn't 
checked, so I get to the end of the {{merge_spark_pr.py}} process and it fails 
on the Jira state update. An up-front check for this would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson resolved SPARK-25275.

   Resolution: Fixed
Fix Version/s: 2.4.0

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
> Fix For: 2.4.0
>
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597935#comment-16597935
 ] 

Erik Erlandson commented on SPARK-25275:


{{merge_spark_pr.py}} failed to close this, closing manually.

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597929#comment-16597929
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 8/30/18 9:11 PM:
--

Btw [~liyinan926] at some if you want to keep a healthy community you need to 
address these kind of issues, here the wrong message is communicated. Being 
fair is more important than any PR if you ask me or any commit at the end of 
the day. Also the fact that none from the people of the PR explained their 
intentions is also not respectful.


was (Author: skonto):
Btw [~liyinan926] at some if you want to keep a healthy community you need to 
address these kind of issues, here the wrong message is communicated. Being 
fair is more important than any PR if you ask me or any commit at the end of 
the day. Also the fact that none from the people of the PR explained their 
intentions is also not respectful.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597929#comment-16597929
 ] 

Stavros Kontopoulos commented on SPARK-24434:
-

Btw [~liyinan926] at some if you want to keep a healthy community you need to 
address these kind of issues, here the wrong message is communicated. Being 
fair is more important than any PR if you ask me or any commit at the end of 
the day. Also the fact that none from Palantir explained their intentions is 
also not respectful.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24434) Support user-specified driver and executor pod templates

2018-08-30 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597929#comment-16597929
 ] 

Stavros Kontopoulos edited comment on SPARK-24434 at 8/30/18 9:09 PM:
--

Btw [~liyinan926] at some if you want to keep a healthy community you need to 
address these kind of issues, here the wrong message is communicated. Being 
fair is more important than any PR if you ask me or any commit at the end of 
the day. Also the fact that none from the people of the PR explained their 
intentions is also not respectful.


was (Author: skonto):
Btw [~liyinan926] at some if you want to keep a healthy community you need to 
address these kind of issues, here the wrong message is communicated. Being 
fair is more important than any PR if you ask me or any commit at the end of 
the day. Also the fact that none from Palantir explained their intentions is 
also not respectful.

> Support user-specified driver and executor pod templates
> 
>
> Key: SPARK-24434
> URL: https://issues.apache.org/jira/browse/SPARK-24434
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> With more requests for customizing the driver and executor pods coming, the 
> current approach of adding new Spark configuration options has some serious 
> drawbacks: 1) it means more Kubernetes specific configuration options to 
> maintain, and 2) it widens the gap between the declarative model used by 
> Kubernetes and the configuration model used by Spark. We should start 
> designing a solution that allows users to specify pod templates as central 
> places for all customization needs for the driver and executor pods. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597921#comment-16597921
 ] 

Apache Spark commented on SPARK-25286:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22292

> Remove dangerous parmap
> ---
>
> Key: SPARK-25286
> URL: https://issues.apache.org/jira/browse/SPARK-25286
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> One of parmap methods accepts an execution context created outside of parmap. 
> If the parmap method is called recursively on a thread pool limited by size, 
> it could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 
> . To eliminate the problems in the future, need to remove parmap() with the 
> signature:
> {code:scala}
> def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
>   (in: Col[I])
>   (f: I => O)
>   (implicit
> cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
> cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
> ec: ExecutionContext
>   ): Col[O]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25286:


Assignee: (was: Apache Spark)

> Remove dangerous parmap
> ---
>
> Key: SPARK-25286
> URL: https://issues.apache.org/jira/browse/SPARK-25286
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> One of parmap methods accepts an execution context created outside of parmap. 
> If the parmap method is called recursively on a thread pool limited by size, 
> it could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 
> . To eliminate the problems in the future, need to remove parmap() with the 
> signature:
> {code:scala}
> def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
>   (in: Col[I])
>   (f: I => O)
>   (implicit
> cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
> cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
> ec: ExecutionContext
>   ): Col[O]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25286:


Assignee: Apache Spark

> Remove dangerous parmap
> ---
>
> Key: SPARK-25286
> URL: https://issues.apache.org/jira/browse/SPARK-25286
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> One of parmap methods accepts an execution context created outside of parmap. 
> If the parmap method is called recursively on a thread pool limited by size, 
> it could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 
> . To eliminate the problems in the future, need to remove parmap() with the 
> signature:
> {code:scala}
> def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
>   (in: Col[I])
>   (f: I => O)
>   (implicit
> cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
> cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
> ec: ExecutionContext
>   ): Col[O]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25286) Remove dangerous parmap

2018-08-30 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25286:
--

 Summary: Remove dangerous parmap
 Key: SPARK-25286
 URL: https://issues.apache.org/jira/browse/SPARK-25286
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Maxim Gekk


One of parmap methods accepts an execution context created outside of parmap. 
If the parmap method is called recursively on a thread pool limited by size, it 
could lead to deadlocks. See the JIRA tickets: SPARK-25240 and SPARK-25283 . To 
eliminate the problems in the future, need to remove parmap() with the 
signature:
{code:scala}
def parmap[I, O, Col[X] <: TraversableLike[X, Col[X]]]
  (in: Col[I])
  (f: I => O)
  (implicit
cbf: CanBuildFrom[Col[I], Future[O], Col[Future[O]]], // For in.map
cbf2: CanBuildFrom[Col[Future[O]], O, Col[O]], // for Future.sequence
ec: ExecutionContext
  ): Col[O]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24411) Adding native Java tests for `isInCollection`

2018-08-30 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24411.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22253
[https://github.com/apache/spark/pull/22253]

> Adding native Java tests for `isInCollection`
> -
>
> Key: SPARK-24411
> URL: https://issues.apache.org/jira/browse/SPARK-24411
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: DB Tsai
>Priority: Minor
>  Labels: starter
> Fix For: 2.4.0
>
>
> In the past, some of our Java APIs have been difficult to call from Java. We 
> should add tests in Java directly to make sure it works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25007:


Assignee: Apache Spark

> Add array_intersect / array_except /array_union / array_shuffle to SparkR
> -
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add R version of 
>  * array_intersect -SPARK-23913-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25007:


Assignee: (was: Apache Spark)

> Add array_intersect / array_except /array_union / array_shuffle to SparkR
> -
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R version of 
>  * array_intersect -SPARK-23913-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597907#comment-16597907
 ] 

Apache Spark commented on SPARK-25007:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22291

> Add array_intersect / array_except /array_union / array_shuffle to SparkR
> -
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R version of 
>  * array_intersect -SPARK-23913-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25007) Add array_intersect / array_except /array_union / array_shuffle to SparkR

2018-08-30 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-25007:
---
Description: 
Add R version of 
 * array_intersect -SPARK-23913-
 * array_except -SPARK-23915- 
 * array_union -SPARK-23914- 
 * array_shuffle -SPARK-23928-

  was:
Add R version of 
 * transform -SPARK-23928-
 * array_except -SPARK-23915- 
 * array_union -SPARK-23914- 
 * array_shuffle -SPARK-23928-

Summary: Add array_intersect / array_except /array_union / 
array_shuffle to SparkR  (was: Add shuffle / array_except /array_union / 
array_shuffle to SparkR)

> Add array_intersect / array_except /array_union / array_shuffle to SparkR
> -
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R version of 
>  * array_intersect -SPARK-23913-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25007) Add shuffle / array_except /array_union / array_shuffle to SparkR

2018-08-30 Thread Huaxin Gao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-25007:
---
Description: 
Add R version of 
 * transform -SPARK-23928-
 * array_except -SPARK-23915- 
 * array_union -SPARK-23914- 
 * array_shuffle -SPARK-23928-

  was:
Add R version of 
 * transform -SPARK-23908-
 * array_except -SPARK-23915- 
 * array_union -SPARK-23914- 
 * array_shuffle -SPARK-23928-

Summary: Add shuffle / array_except /array_union / array_shuffle to 
SparkR  (was: Add transform / array_except /array_union / array_shuffle to 
SparkR)

> Add shuffle / array_except /array_union / array_shuffle to SparkR
> -
>
> Key: SPARK-25007
> URL: https://issues.apache.org/jira/browse/SPARK-25007
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R version of 
>  * transform -SPARK-23928-
>  * array_except -SPARK-23915- 
>  * array_union -SPARK-23914- 
>  * array_shuffle -SPARK-23928-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25285) Add executor task metrics to track the number of tasks started and of tasks successfully completed

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25285:


Assignee: Apache Spark

> Add executor task metrics to track the number of tasks started and of tasks 
> successfully completed
> --
>
> Key: SPARK-25285
> URL: https://issues.apache.org/jira/browse/SPARK-25285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Assignee: Apache Spark
>Priority: Minor
>
> The motivation for these additional metrics is to help in troubleshooting 
> situations when tasks fail, are killed and/or restarted. Currently available 
> metrics include executor threadpool metrics for task completed and for active 
> tasks. The addition of threadpool tasStarted metric will allow for example to 
> collect info on the (approximate) number of failed tasks by computing the 
> difference thread started – (active threads + completed tasks and/or 
> successfully completed tasks).
> The proposed metric successfulTasks is also intended for this type of 
> troubleshooting. The difference between  successfulTasks and 
> threadpool.completeTasks, is that the latter is a (dropwizard library) gauge 
> taken from the threadpool, while the former is a (dropwizard) counter 
> computed in the [[Executor]] class, when a task successfully completes, 
> together with several other task metrics counters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25285) Add executor task metrics to track the number of tasks started and of tasks successfully completed

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25285:


Assignee: (was: Apache Spark)

> Add executor task metrics to track the number of tasks started and of tasks 
> successfully completed
> --
>
> Key: SPARK-25285
> URL: https://issues.apache.org/jira/browse/SPARK-25285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> The motivation for these additional metrics is to help in troubleshooting 
> situations when tasks fail, are killed and/or restarted. Currently available 
> metrics include executor threadpool metrics for task completed and for active 
> tasks. The addition of threadpool tasStarted metric will allow for example to 
> collect info on the (approximate) number of failed tasks by computing the 
> difference thread started – (active threads + completed tasks and/or 
> successfully completed tasks).
> The proposed metric successfulTasks is also intended for this type of 
> troubleshooting. The difference between  successfulTasks and 
> threadpool.completeTasks, is that the latter is a (dropwizard library) gauge 
> taken from the threadpool, while the former is a (dropwizard) counter 
> computed in the [[Executor]] class, when a task successfully completes, 
> together with several other task metrics counters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25285) Add executor task metrics to track the number of tasks started and of tasks successfully completed

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597849#comment-16597849
 ] 

Apache Spark commented on SPARK-25285:
--

User 'LucaCanali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22290

> Add executor task metrics to track the number of tasks started and of tasks 
> successfully completed
> --
>
> Key: SPARK-25285
> URL: https://issues.apache.org/jira/browse/SPARK-25285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> The motivation for these additional metrics is to help in troubleshooting 
> situations when tasks fail, are killed and/or restarted. Currently available 
> metrics include executor threadpool metrics for task completed and for active 
> tasks. The addition of threadpool tasStarted metric will allow for example to 
> collect info on the (approximate) number of failed tasks by computing the 
> difference thread started – (active threads + completed tasks and/or 
> successfully completed tasks).
> The proposed metric successfulTasks is also intended for this type of 
> troubleshooting. The difference between  successfulTasks and 
> threadpool.completeTasks, is that the latter is a (dropwizard library) gauge 
> taken from the threadpool, while the former is a (dropwizard) counter 
> computed in the [[Executor]] class, when a task successfully completes, 
> together with several other task metrics counters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25285) Add executor task metrics to track the number of tasks started and of tasks successfully completed

2018-08-30 Thread Luca Canali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-25285:

Priority: Minor  (was: Major)

> Add executor task metrics to track the number of tasks started and of tasks 
> successfully completed
> --
>
> Key: SPARK-25285
> URL: https://issues.apache.org/jira/browse/SPARK-25285
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Luca Canali
>Priority: Minor
>
> The motivation for these additional metrics is to help in troubleshooting 
> situations when tasks fail, are killed and/or restarted. Currently available 
> metrics include executor threadpool metrics for task completed and for active 
> tasks. The addition of threadpool tasStarted metric will allow for example to 
> collect info on the (approximate) number of failed tasks by computing the 
> difference thread started – (active threads + completed tasks and/or 
> successfully completed tasks).
> The proposed metric successfulTasks is also intended for this type of 
> troubleshooting. The difference between  successfulTasks and 
> threadpool.completeTasks, is that the latter is a (dropwizard library) gauge 
> taken from the threadpool, while the former is a (dropwizard) counter 
> computed in the [[Executor]] class, when a task successfully completes, 
> together with several other task metrics counters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25285) Add executor task metrics to track the number of tasks started and of tasks successfully completed

2018-08-30 Thread Luca Canali (JIRA)
Luca Canali created SPARK-25285:
---

 Summary: Add executor task metrics to track the number of tasks 
started and of tasks successfully completed
 Key: SPARK-25285
 URL: https://issues.apache.org/jira/browse/SPARK-25285
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Luca Canali


The motivation for these additional metrics is to help in troubleshooting 
situations when tasks fail, are killed and/or restarted. Currently available 
metrics include executor threadpool metrics for task completed and for active 
tasks. The addition of threadpool tasStarted metric will allow for example to 
collect info on the (approximate) number of failed tasks by computing the 
difference thread started – (active threads + completed tasks and/or 
successfully completed tasks).

The proposed metric successfulTasks is also intended for this type of 
troubleshooting. The difference between  successfulTasks and 
threadpool.completeTasks, is that the latter is a (dropwizard library) gauge 
taken from the threadpool, while the former is a (dropwizard) counter computed 
in the [[Executor]] class, when a task successfully completes, together with 
several other task metrics counters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25200) Allow setting HADOOP_CONF_DIR as a spark property

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597808#comment-16597808
 ] 

Apache Spark commented on SPARK-25200:
--

User 'adambalogh' has created a pull request for this issue:
https://github.com/apache/spark/pull/22289

> Allow setting HADOOP_CONF_DIR as a spark property
> -
>
> Key: SPARK-25200
> URL: https://issues.apache.org/jira/browse/SPARK-25200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Adam Balogh
>Priority: Major
>
> When submitting applications to Yarn in cluster mode, using the 
> InProcessLauncher, spark finds the cluster's configuration files based on the 
> HADOOP_CONF_DIR environment variable. This does not make it possible to 
> submit to more than one Yarn clusters concurrently using the 
> InProcessLauncher.
> I think we should make it possible to define HADOOP_CONF_DIR as a spark 
> property, so it can be different for each spark submission.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25200) Allow setting HADOOP_CONF_DIR as a spark property

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25200:


Assignee: Apache Spark

> Allow setting HADOOP_CONF_DIR as a spark property
> -
>
> Key: SPARK-25200
> URL: https://issues.apache.org/jira/browse/SPARK-25200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Adam Balogh
>Assignee: Apache Spark
>Priority: Major
>
> When submitting applications to Yarn in cluster mode, using the 
> InProcessLauncher, spark finds the cluster's configuration files based on the 
> HADOOP_CONF_DIR environment variable. This does not make it possible to 
> submit to more than one Yarn clusters concurrently using the 
> InProcessLauncher.
> I think we should make it possible to define HADOOP_CONF_DIR as a spark 
> property, so it can be different for each spark submission.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25200) Allow setting HADOOP_CONF_DIR as a spark property

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25200:


Assignee: (was: Apache Spark)

> Allow setting HADOOP_CONF_DIR as a spark property
> -
>
> Key: SPARK-25200
> URL: https://issues.apache.org/jira/browse/SPARK-25200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Adam Balogh
>Priority: Major
>
> When submitting applications to Yarn in cluster mode, using the 
> InProcessLauncher, spark finds the cluster's configuration files based on the 
> HADOOP_CONF_DIR environment variable. This does not make it possible to 
> submit to more than one Yarn clusters concurrently using the 
> InProcessLauncher.
> I think we should make it possible to define HADOOP_CONF_DIR as a spark 
> property, so it can be different for each spark submission.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-08-30 Thread Parker Hegstrom (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597781#comment-16597781
 ] 

Parker Hegstrom commented on SPARK-23435:
-

What is the status of this? My appveyor.yml has the lower version of testthat, 
but I'm still getting the original error.

Looks like it's because I followed 
http://spark.apache.org/docs/latest/building-spark.html#running-r-tests, but 
this command downloads the most recent testthat version.

Can someone change this until your PR goes through?

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25233) Give the user the option of specifying a fixed minimum message per partition per batch when using kafka direct API with backpressure

2018-08-30 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-25233.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 3
[https://github.com/apache/spark/pull/3]

> Give the user the option of specifying a fixed minimum message per partition 
> per batch when using kafka direct API with backpressure
> 
>
> Key: SPARK-25233
> URL: https://issues.apache.org/jira/browse/SPARK-25233
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Major
> Fix For: 2.4.0
>
>
> After SPARK-18371, it is guaranteed that there would be at least *one* 
> message per partition per batch using direct kafka API with backpressure when 
> new messages exist in the topics. It would be better if the user has the 
> option of setting the minimum instead of just a hard coded 1 limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25233) Give the user the option of specifying a fixed minimum message per partition per batch when using kafka direct API with backpressure

2018-08-30 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger reassigned SPARK-25233:
--

Assignee: Reza Safi

> Give the user the option of specifying a fixed minimum message per partition 
> per batch when using kafka direct API with backpressure
> 
>
> Key: SPARK-25233
> URL: https://issues.apache.org/jira/browse/SPARK-25233
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.4.0
>Reporter: Reza Safi
>Assignee: Reza Safi
>Priority: Major
>
> After SPARK-18371, it is guaranteed that there would be at least *one* 
> message per partition per batch using direct kafka API with backpressure when 
> new messages exist in the topics. It would be better if the user has the 
> option of setting the minimum instead of just a hard coded 1 limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597750#comment-16597750
 ] 

Apache Spark commented on SPARK-22148:
--

User 'dhruve' has created a pull request for this issue:
https://github.com/apache/spark/pull/22288

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current 
> executors are blacklisted but dynamic allocation is enabled
> -
>
> Key: SPARK-22148
> URL: https://issues.apache.org/jira/browse/SPARK-22148
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Juan Rodríguez Hortalá
>Priority: Major
> Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and 
> the whole Spark job with `task X (partition Y) cannot run anywhere due to 
> node and executor blacklist. Blacklisting behavior can be configured via 
> spark.blacklist.*.` when all the available executors are blacklisted for a 
> pending Task or TaskSet. This makes sense for static allocation, where the 
> set of executors is fixed for the duration of the application, but this might 
> lead to unnecessary job failures when dynamic allocation is enabled. For 
> example, in a Spark application with a single job at a time, when a node 
> fails at the end of a stage attempt, all other executors will complete their 
> tasks, but the tasks running in the executors of the failing node will be 
> pending. Spark will keep waiting for those tasks for 2 minutes by default 
> (spark.network.timeout) until the heartbeat timeout is triggered, and then it 
> will blacklist those executors for that stage. At that point in time, other 
> executors would had been released after being idle for 1 minute by default 
> (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't 
> started yet and so there are no more tasks available (assuming the default of 
> spark.speculation = false). So Spark will fail because the only executors 
> available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this 
> situation. This could be retried a configurable number of times after a 
> configurable wait time between request attempts, so if the cluster manager 
> fails to provide a suitable executor then the job is aborted like in the 
> previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-30 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597742#comment-16597742
 ] 

Yuming Wang commented on SPARK-25135:
-

[~dongjoon] orc has this issue. reproduce code:
{code:scala}
withTempPath { dir =>
  val path = dir.getCanonicalPath
  val cnt = 30
  val table1Path = s"$path/table1"
  val table2Path = s"$path/table2"
  val data =
spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id % 3 
as bigint) as col2")
  data.write.mode(SaveMode.Overwrite).orc(table1Path)
  withTable("table1", "table2", "table3") {
spark.sql(
  s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc location 
'$table1Path'")
spark.sql(
  s"CREATE TABLE table2(COL1 bigint, COL2 bigint) using orc location 
'$table2Path'")

withView("view1") {
  spark.sql("CREATE VIEW view1 as select col1, col2 from table1 where 
col1 > -20")
  spark.sql("INSERT OVERWRITE TABLE table2 select COL1, COL2 from 
view1")
  checkAnswer(spark.table("table2"), data)
  assert(spark.read.orc(table2Path).schema === 
spark.table("table2").schema)
}
  }
}
{code}
result should be:

{noformat}
Expected :StructType(StructField(COL1,LongType,true), 
StructField(COL2,LongType,true))
Actual   :StructType(StructField(col1,LongType,true), 
StructField(col2,LongType,true))
{noformat}


> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: Parquet, correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25135) insert datasource table may all null when select from view on parquet

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597741#comment-16597741
 ] 

Apache Spark commented on SPARK-25135:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22287

> insert datasource table may all null when select from view on parquet
> -
>
> Key: SPARK-25135
> URL: https://issues.apache.org/jira/browse/SPARK-25135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Yuming Wang
>Priority: Blocker
>  Labels: Parquet, correctness
>
> This happens on parquet.
> How to reproduce in parquet.
> {code:scala}
> val path = "/tmp/spark/parquet"
> val cnt = 30
> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as bigint) 
> as col2").write.mode("overwrite").parquet(path)
> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet 
> location '$path'")
> spark.sql("create view view1 as select col1, col2 from table1 where col1 > 
> -20")
> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using parquet")
> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> spark.table("table2").show
> {code}
> FYI, the following is orc.
> {code}
> scala> val path = "/tmp/spark/orc"
> scala> val cnt = 30
> scala> spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id as 
> bigint) as col2").write.mode("overwrite").orc(path)
> scala> spark.sql(s"CREATE TABLE table1(col1 bigint, col2 bigint) using orc 
> location '$path'")
> scala> spark.sql("create view view1 as select col1, col2 from table1 where 
> col1 > -20")
> scala> spark.sql("create table table2 (COL1 BIGINT, COL2 BIGINT) using orc")
> scala> spark.sql("insert overwrite table table2 select COL1, COL2 from view1")
> scala> spark.table("table2").show
> +++
> |COL1|COL2|
> +++
> |  15|  15|
> |  16|  16|
> |  17|  17|
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25284:


Assignee: Apache Spark

> Spark UI: make sure skipped stages are updated onJobEnd
> ---
>
> Key: SPARK-25284
> URL: https://issues.apache.org/jira/browse/SPARK-25284
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
>
> Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597661#comment-16597661
 ] 

Apache Spark commented on SPARK-25284:
--

User 'juliuszsompolski' has created a pull request for this issue:
https://github.com/apache/spark/pull/22286

> Spark UI: make sure skipped stages are updated onJobEnd
> ---
>
> Key: SPARK-25284
> URL: https://issues.apache.org/jira/browse/SPARK-25284
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25284:


Assignee: (was: Apache Spark)

> Spark UI: make sure skipped stages are updated onJobEnd
> ---
>
> Key: SPARK-25284
> URL: https://issues.apache.org/jira/browse/SPARK-25284
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25284) Spark UI: make sure skipped stages are updated onJobEnd

2018-08-30 Thread Juliusz Sompolski (JIRA)
Juliusz Sompolski created SPARK-25284:
-

 Summary: Spark UI: make sure skipped stages are updated onJobEnd
 Key: SPARK-25284
 URL: https://issues.apache.org/jira/browse/SPARK-25284
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Juliusz Sompolski


Tiny bug in onJobEnd not forcing update of skipped stages in KVstore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25275:


Assignee: (was: Apache Spark)

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597561#comment-16597561
 ] 

Apache Spark commented on SPARK-25275:
--

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/22285

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25275:


Assignee: Apache Spark

> require memberhip in wheel to run 'su' (in dockerfiles)
> ---
>
> Key: SPARK-25275
> URL: https://issues.apache.org/jira/browse/SPARK-25275
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Erik Erlandson
>Assignee: Apache Spark
>Priority: Major
>  Labels: docker, kubernetes
>
> For improved security, configure that users must be in wheel group in order 
> to run su.
> see example:
> [https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25278:


Assignee: Apache Spark

> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view (as a LocalTableScan), but 
> should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelations}} may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25278:


Assignee: (was: Apache Spark)

> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view (as a LocalTableScan), but 
> should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelations}} may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25278) Number of output rows metric of union of views is multiplied by their occurrences

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597528#comment-16597528
 ] 

Apache Spark commented on SPARK-25278:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22284

> Number of output rows metric of union of views is multiplied by their 
> occurrences
> -
>
> Key: SPARK-25278
> URL: https://issues.apache.org/jira/browse/SPARK-25278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jacek Laskowski
>Priority: Major
> Attachments: union-2-views.png, union-3-views.png
>
>
> When you use a view in a union multiple times (self-union), the {{number of 
> output rows}} metric seems to be the correct {{number of output rows}} 
> multiplied by the occurrences of the view, e.g.
> {code:java}
> scala> spark.version
> res0: String = 2.3.1
> val name = "demo_view"
> sql(s"CREATE OR REPLACE VIEW $name AS VALUES 1,2")
> assert(spark.catalog.tableExists(name))
> val view = spark.table(name)
> assert(view.count == 2)
> view.union(view).show // gives 4 for every view (as a LocalTableScan), but 
> should be 2
> view.union(view).union(view).show // gives 6{code}
> I think it's because {{View}} logical operator is a {{MultiInstanceRelation}} 
> (and think other {{MultiInstanceRelations}} may also be affected).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-30 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24909:
--
Fix Version/s: 2.3.2

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 2.3.2, 2.4.0
>
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
> Note to reproduce this, you need a situation where  you have a shufflemaptask 
> (call it task1) fetching data from an executor where it also has other 
> shufflemaptasks (call it task2) running (fetch from other hosts). the task1 
> fetching the data has to FetchFail which would cause the stage to fail and 
> the executor to be marked as lost due to the fetch failure.  It restarts a 
> new task set for the new stage attempt, then the shufflemaptask task2 that 
> was running on the executor that was marked Lost finished.  The scheduler 
> ignore that complete event  "Ignoring possible bogus ...". This results in a 
> hang because at this point the TaskSetManager has already marked all tasks 
> for all attempts of that stage as completed.
>  
> Configs needed to be on:
> |{{spark.blacklist.application.fetchFailure.enabled=true}}| |
> |{{spark.blacklist.application.fetchFailure.enabled=true}}|
> spark.files.fetchFailure.unRegisterOutputOnHost=true
> spark.shuffle.service.enabled=true



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25261) Standardize the default units of spark.driver|executor.memory

2018-08-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25261:
--
Target Version/s: 3.0.0
 Component/s: (was: Documentation)
  YARN
  Spark Core
  Kubernetes
 Summary: Standardize the default units of 
spark.driver|executor.memory  (was: Update configuration.md, correct the 
default units of spark.driver|executor.memory)

Really, these properties are parsed differently in different parts of the code, 
unfortunately. YARN and K8S parse a string without units as MiB, but not 
spark-submit.

We can fix the docs in the first PR, and then in Spark 3, correct this behavior 
inconsistency. I would even vote for no longer supporting unit-less strings in 
any such property, then.

> Standardize the default units of spark.driver|executor.memory
> -
>
> Key: SPARK-25261
> URL: https://issues.apache.org/jira/browse/SPARK-25261
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Minor
>
> From  
> [SparkContext|https://github.com/ivoson/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L464]
>  and 
> [SparkSubmitCommandBuilder|https://github.com/ivoson/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L265],we
>  can see that spark.driver.memory and spark.executor.memory are parsed as 
> bytes if no units specified. But in the doc, they are described as mb in 
> default, which may lead to some misunderstanding.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25283) A deadlock in UnionRDD

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25283:


Assignee: (was: Apache Spark)

> A deadlock in UnionRDD
> --
>
> Key: SPARK-25283
> URL: https://issues.apache.org/jira/browse/SPARK-25283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel 
> collections in UnionRDD by new parmap function. This changes cause a deadlock 
> in the partitions method. The code demonstrates the problem:
> {code:scala}
> val wide = 20
> def unionRDD(num: Int): UnionRDD[Int] = {
>   val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1))
>   new UnionRDD(sc, rdds)
> }
> val level0 = (0 until wide).map { _ =>
>   val level1 = (0 until wide).map(_ => unionRDD(wide))
>   new UnionRDD(sc, level1)
> }
> val rdd = new UnionRDD(sc, level0)
> rdd.partitions.length
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25283) A deadlock in UnionRDD

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25283:


Assignee: Apache Spark

> A deadlock in UnionRDD
> --
>
> Key: SPARK-25283
> URL: https://issues.apache.org/jira/browse/SPARK-25283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel 
> collections in UnionRDD by new parmap function. This changes cause a deadlock 
> in the partitions method. The code demonstrates the problem:
> {code:scala}
> val wide = 20
> def unionRDD(num: Int): UnionRDD[Int] = {
>   val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1))
>   new UnionRDD(sc, rdds)
> }
> val level0 = (0 until wide).map { _ =>
>   val level1 = (0 until wide).map(_ => unionRDD(wide))
>   new UnionRDD(sc, level1)
> }
> val rdd = new UnionRDD(sc, level0)
> rdd.partitions.length
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25283) A deadlock in UnionRDD

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597391#comment-16597391
 ] 

Apache Spark commented on SPARK-25283:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22283

> A deadlock in UnionRDD
> --
>
> Key: SPARK-25283
> URL: https://issues.apache.org/jira/browse/SPARK-25283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel 
> collections in UnionRDD by new parmap function. This changes cause a deadlock 
> in the partitions method. The code demonstrates the problem:
> {code:scala}
> val wide = 20
> def unionRDD(num: Int): UnionRDD[Int] = {
>   val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1))
>   new UnionRDD(sc, rdds)
> }
> val level0 = (0 until wide).map { _ =>
>   val level1 = (0 until wide).map(_ => unionRDD(wide))
>   new UnionRDD(sc, level1)
> }
> val rdd = new UnionRDD(sc, level0)
> rdd.partitions.length
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23539:


Assignee: Apache Spark

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2018-08-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23539:


Assignee: (was: Apache Spark)

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Priority: Major
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2018-08-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597380#comment-16597380
 ] 

Apache Spark commented on SPARK-23539:
--

User 'dongjinleekr' has created a pull request for this issue:
https://github.com/apache/spark/pull/22282

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Priority: Major
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25273) How to install testthat v1.0.2

2018-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25273:
-
Fix Version/s: 2.3.2

> How to install testthat v1.0.2
> --
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.3.2, 2.4.0
>
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25283) A deadlock in UnionRDD

2018-08-30 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25283:
--

 Summary: A deadlock in UnionRDD
 Key: SPARK-25283
 URL: https://issues.apache.org/jira/browse/SPARK-25283
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The PR https://github.com/apache/spark/pull/21913 replaced Scala parallel 
collections in UnionRDD by new parmap function. This changes cause a deadlock 
in the partitions method. The code demonstrates the problem:
{code:scala}
val wide = 20
def unionRDD(num: Int): UnionRDD[Int] = {
  val rdds = (0 until num).map(_ => sc.parallelize(1 to 10, 1))
  new UnionRDD(sc, rdds)
}
val level0 = (0 until wide).map { _ =>
  val level1 = (0 until wide).map(_ => unionRDD(wide))
  new UnionRDD(sc, level1)
}
val rdd = new UnionRDD(sc, level0)

rdd.partitions.length
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25273) How to install testthat v1.0.2

2018-08-30 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25273.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22272
[https://github.com/apache/spark/pull/22272]

> How to install testthat v1.0.2
> --
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >