[jira] [Commented] (SPARK-22055) Port release scripts

2018-05-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486728#comment-16486728
 ] 

Felix Cheung commented on SPARK-22055:
--

interesting - I'd definitely be happy to help.

do you have it scripted to inject the signing key into the docker image?

 

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/23/18 5:21 AM:
---

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with naming convention -  
(there is no . notation in R) both `train.validation.split()` and 
`set_estimator(lr)` are method? please don't mix  `.` and `_` in names, and 
hopefully also avoid mixing in Scala's camelCasing.

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 


was (Author: felixcheung):
# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with style - 

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do 

[jira] [Comment Edited] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/23/18 5:18 AM:
---

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?
 # one of the first comment - please be consistent with style - 

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 


was (Author: felixcheung):
# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that 

[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486719#comment-16486719
 ] 

Felix Cheung commented on SPARK-24359:
--

# could you include design doc as google doc - it will be easier to comment, 
ask questions etc
 # is it the plan to tightly couple the SparkML package on the particular Spark 
ASF release and its jar (like SparkR), or is SparkML going to work with 
multiple Spark releases (like sparklyr)?
 # if SparkML does not depend on SparkR, how do you propose it communicates 
with the Spark JVM? How do you get data into SparkML (on the JVM side, Spark's 
ML Pipeline Model still depends on Spark's Dataset/DataFrame), or simply to 
work with a SparkSession?

Releasing on to CRAN takes a lot of work - lots of scripts, tests and so on 
which now would be "duplicated" for a new 2nd R package. The process is 
particularly much much harder for any R package that depends on the JVM. Hope 
we keep this in mind for this proposal.

link to https://issues.apache.org/jira/browse/SPARK-18822

 

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., 

[jira] [Commented] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486717#comment-16486717
 ] 

Hyukjin Kwon commented on SPARK-22366:
--

Oops, sorry. I mistakenly edited the JIRA. I reverted it back.

> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will 
> quietly ignore attempted reads from files that have been corrupted, but it 
> still allows the query to fail on missing files. Being able to ignore missing 
> files too is useful in some replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22366:
-
Description: 
+underlined text+There's an existing flag "spark.sql.files.ignoreCorruptFiles" 
that will quietly ignore attempted reads from files that have been corrupted, 
but it still allows the query to fail on missing files. Being able to ignore 
missing files too is useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.

  was:
There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will quietly 
ignore attempted reads from files that have been corrupted, but it still allows 
the query to fail on missing files. Being able to ignore missing files too is 
useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.


> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> +underlined text+There's an existing flag 
> "spark.sql.files.ignoreCorruptFiles" that will quietly ignore attempted reads 
> from files that have been corrupted, but it still allows the query to fail on 
> missing files. Being able to ignore missing files too is useful in some 
> replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22366:
-
Description: 
There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will quietly 
ignore attempted reads from files that have been corrupted, but it still allows 
the query to fail on missing files. Being able to ignore missing files too is 
useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.

  was:
+underlined text+There's an existing flag "spark.sql.files.ignoreCorruptFiles" 
that will quietly ignore attempted reads from files that have been corrupted, 
but it still allows the query to fail on missing files. Being able to ignore 
missing files too is useful in some replication scenarios.

We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
functionality.


> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will 
> quietly ignore attempted reads from files that have been corrupted, but it 
> still allows the query to fail on missing files. Being able to ignore missing 
> files too is useful in some replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin closed SPARK-24349.
--

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), But from 
> that, if current settings is connecting to DB directly via JDBC instead of 
> RPC with metastore, it will failed with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin resolved SPARK-24349.

Resolution: Not A Problem

delegationTokensRequired has been checked in SparkSQLCLIDriver.scala

> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> -
>
> Key: SPARK-24349
> URL: https://issues.apache.org/jira/browse/SPARK-24349
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
>
> In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
> --proxy-user to impersonate will invoke obtainDelegationTokens(), But from 
> that, if current settings is connecting to DB directly via JDBC instead of 
> RPC with metastore, it will failed with
> {code}
> WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not 
> exist
> Exception in thread "main" java.lang.IllegalArgumentException: requirement 
> failed: Hive metastore uri undefined
> at scala.Predef$.require(Predef.scala:224)
> at 
> org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
> 18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
> /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24361) Polish code block manipulation API

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24361:


Assignee: (was: Apache Spark)

> Polish code block manipulation API
> --
>
> Key: SPARK-24361
> URL: https://issues.apache.org/jira/browse/SPARK-24361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Current code block manipulation API is immature and hacky. We should have a 
> formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24361) Polish code block manipulation API

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24361:


Assignee: Apache Spark

> Polish code block manipulation API
> --
>
> Key: SPARK-24361
> URL: https://issues.apache.org/jira/browse/SPARK-24361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Current code block manipulation API is immature and hacky. We should have a 
> formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24361) Polish code block manipulation API

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486652#comment-16486652
 ] 

Apache Spark commented on SPARK-24361:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21405

> Polish code block manipulation API
> --
>
> Key: SPARK-24361
> URL: https://issues.apache.org/jira/browse/SPARK-24361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Current code block manipulation API is immature and hacky. We should have a 
> formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Component/s: (was: Optimizer)
 Spark Core

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Major
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24361) Polish code block manipulation API

2018-05-22 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-24361:
---

 Summary: Polish code block manipulation API
 Key: SPARK-24361
 URL: https://issues.apache.org/jira/browse/SPARK-24361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Current code block manipulation API is immature and hacky. We should have a 
formal API to manipulate code blocks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-22 Thread xdcjie (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xdcjie updated SPARK-24339:
---
Affects Version/s: (was: 2.2.1)
   (was: 2.1.2)
   (was: 2.1.1)

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21945) pyspark --py-files doesn't work in yarn client mode

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486610#comment-16486610
 ] 

Hyukjin Kwon commented on SPARK-21945:
--

[~vanzin], I tried spark-submit in client mode and cluster mode with both zip 
and .py files and look working fine. Mind if I how you did? I tried as written 
in the PR description FYI.

> pyspark --py-files doesn't work in yarn client mode
> ---
>
> Key: SPARK-21945
> URL: https://issues.apache.org/jira/browse/SPARK-21945
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> I tried running pyspark with --py-files pythonfiles.zip  but it doesn't 
> properly add the zip file to the PYTHONPATH.
> I can work around by exporting PYTHONPATH.
> Looking in SparkSubmitCommandBuilder.buildPySparkShellCommand  I don't see 
> this supported at all.   If that is the case perhaps it should be moved to 
> improvement.
> Note it works via spark-submit in both client and cluster mode to run python 
> script.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486594#comment-16486594
 ] 

Joel Croteau commented on SPARK-24358:
--

Done.

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24358) createDataFrame in Python 3 should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Croteau updated SPARK-24358:
-
 Labels: Python3  (was: )
Description: createDataFrame can infer Python 3's bytearray type as a 
Binary. Since bytes is just the immutable, hashable version of this same 
structure, it makes sense for the same thing to apply there.  (was: 
createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
just the immutable, hashable version of this same structure, it makes sense for 
the same thing to apply there.)
Summary: createDataFrame in Python 3 should be able to infer bytes type 
as Binary type  (was: createDataFrame in Python should be able to infer bytes 
type as Binary type)

> createDataFrame in Python 3 should be able to infer bytes type as Binary type
> -
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>  Labels: Python3
>
> createDataFrame can infer Python 3's bytearray type as a Binary. Since bytes 
> is just the immutable, hashable version of this same structure, it makes 
> sense for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486584#comment-16486584
 ] 

Hyukjin Kwon edited comment on SPARK-24358 at 5/23/18 1:50 AM:
---

Yea, let's note that it's specific to Python 3 in the title or description.


was (Author: hyukjin.kwon):
Yea, let's note that it's specific to Python 3.

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486584#comment-16486584
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

Yea, let's note that it's specific to Python 3.

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486581#comment-16486581
 ] 

Joel Croteau edited comment on SPARK-24358 at 5/23/18 1:47 AM:
---

No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if I change the data type to bytearray:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=bytearray(b'Test string'))]


def init_session():
builder = SparkSession.builder.appName("Use bytearray instead")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
it runs fine:
{noformat}
root
 |-- data: binary (nullable = true)

[Row(data=bytearray(b'Test string'))]
{noformat}
bytes in Python 3 is just an immutable version of bytearray, so it should infer 
the type as binary and serialize it the same way it does with bytearray.


was (Author: tv4fun):
No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if 

[jira] [Commented] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486581#comment-16486581
 ] 

Joel Croteau commented on SPARK-24358:
--

No, I mean the bytes type in Python 3. This code:
{code:java}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=b'Test string')]


def init_session():
builder = SparkSession.builder.appName("Test bytes serialization")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()
{code}
 Fails under Python 3 with this output:
{noformat}
Traceback (most recent call last):
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1068, in _infer_type
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1094, in _infer_schema
TypeError: Can not infer schema for type: 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 18, in 
    __name__ == '__main__' and main()
  File "/home/jcroteau/is/pel_selection/test_row_pair.py", line 13, in main
    frame = spark.createDataFrame(TEST_DATA)
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 689, in createDataFrame
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 410, in _createFromLocal
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in _inferSchemaFromList
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/session.py", 
line 342, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in _infer_schema
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1096, in 
  File 
"/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/types.py", 
line 1070, in _infer_type
TypeError: not supported type: 
{noformat}
but if I change the data type to bytearray:
{code}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=bytearray(b'Test string'))]


def init_session():
builder = SparkSession.builder.appName("Use bytearray instead")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
it runs fine:
{noformat}
root
 |-- data: binary (nullable = true)

[Row(data=bytearray(b'Test string'))]
{noformat}
bytes in Python 3 is just an immutable version of bytearry, so it should infer 
the type as binary and serialize it the same way it does with bytearray.

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-22 Thread Misha Dmitriev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486580#comment-16486580
 ] 

Misha Dmitriev commented on SPARK-24356:


I plan to work on this feature.

> Duplicate strings in File.path managed by FileSegmentManagedBuffer
> --
>
> Key: SPARK-24356
> URL: https://issues.apache.org/jira/browse/SPARK-24356
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Misha Dmitriev
>Priority: Major
>
> I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
> high GC pressure due to high object churn. Analysis was done with the jxray 
> tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
> number of well-known memory issues. One problem that it found in this dump is 
> 19.5% of memory wasted due to duplicate strings. Of these duplicates, more 
> than a half come from {{FileInputStream.path}} and {{File.path}}. All the 
> {{FileInputStream}} objects that JXRay shows are garbage - looks like they 
> are used for a very short period and then discarded (I guess there is a 
> separate question of whether that's a good pattern). But {{File}} instances 
> are traceable to 
> {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here 
> is the full reference chain:
>  
> {code:java}
> ↖java.io.File.path
> ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
> ↖{j.u.ArrayList}
> ↖j.u.ArrayList$Itr.this$0
> ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
> ↖{java.util.concurrent.ConcurrentHashMap}.values
> ↖org.apache.spark.network.server.OneForOneStreamManager.streams
> ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
> ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
> ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
> {code}
>  
> Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
> similar, so I think {{FileInputStream}}s are generated by the 
> {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely 
> come from 
> [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
>  
> To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
> that in the above code we create a File with a complete, normalized pathname, 
> that has been already interned. This will prevent the code inside 
> {{java.io.File}} from modifying this string, and thus it will use the 
> interned copy, and will pass it to FileInputStream. Essentially the current 
> line
> {code:java}
> return new File(new File(localDir, String.format("%02x", subDirId)), 
> filename);{code}
> should be replaced with something like
> {code:java}
> String pathname = localDir + File.separator + String.format(...) + 
> File.separator + filename;
> pathname = fileSystem.normalize(pathname).intern();
> return new File(pathname);{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24349) obtainDelegationTokens() exits JVM if Driver use JDBC instead of using metastore

2018-05-22 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-24349:
---
Description: 
In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
--proxy-user to impersonate will invoke obtainDelegationTokens(), But from 
that, if current settings is connecting to DB directly via JDBC instead of RPC 
with metastore, it will failed with
{code}
WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Hive metastore uri undefined
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
{code}

  was:
In [SPARK-23639|https://issues.apache.org/jira/browse/SPARK-23639], use 
--proxy-user to impersonate will invoke obtainDelegationTokens(), but current 
Driver use JDBC instead of metastore, it will failed out with
{code}
WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Hive metastore uri undefined
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
at 
org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
{code}


> obtainDelegationTokens() exits JVM if Driver use JDBC instead of using 
> metastore 
> 

[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-05-22 Thread Joel Croteau (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486571#comment-16486571
 ] 

Joel Croteau commented on SPARK-24357:
--

Fair enough, here is some code to reproduce it:
{code:python}
from pyspark.sql import SparkSession, Row

TEST_DATA = [Row(data=1 << 65)]


def init_session():
builder = SparkSession.builder.appName("Demonstrate integer overflow")
return builder.getOrCreate()


def main():
spark = init_session()
frame = spark.createDataFrame(TEST_DATA)
frame.printSchema()
print(frame.collect())


__name__ == '__main__' and main()

{code}
This should either infer a type that can hold the 1 << 65 value from TEST_DATA, 
or produce a runtime error about inferring the schema or serializing the data. 
This is the actual output:
{noformat}
root
 |-- data: long (nullable = true)

[Row(data=None)]
{noformat}

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486569#comment-16486569
 ] 

Hyukjin Kwon commented on SPARK-24358:
--

? do you mean bytes in Python 2? that's an alias for str, isn't it?

> createDataFrame in Python should be able to infer bytes type as Binary type
> ---
>
> Key: SPARK-24358
> URL: https://issues.apache.org/jira/browse/SPARK-24358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Minor
>
> createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
> just the immutable, hashable version of this same structure, it makes sense 
> for the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486565#comment-16486565
 ] 

Hyukjin Kwon commented on SPARK-24324:
--

Ah, I meant shorter reproducer should make other guys easier to take a look.

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 20071211-08 1
> en Albert_Caquot 20071211-15 3
> en Albert_Caquot 20071211-21 1"""
> import tempfile
> fp = tempfile.NamedTemporaryFile()
> fp.write(input_data)
> fp.seek(0)
> import findspark
> findspark.init()
> import pyspark
> from pyspark.sql.types import StructType, StructField
> from pyspark.sql.types import StringType, IntegerType, TimestampType
> from pyspark.sql import functions
> sc = pyspark.SparkContext(appName="udf_example")
> sqlctx = pyspark.SQLContext(sc)
> schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("timestamp", TimestampType(), False),
>  StructField("views", IntegerType(), False)])
> df = sqlctx.read.csv(fp.name,
>  header=False,
>  schema=schema,
>  timestampFormat="MMdd-HHmmss",
>  sep=' ')
> df.count()
> df.dtypes
> df.show()
> new_schema = StructType([StructField("lang", StringType(), False),
>  StructField("page", StringType(), False),
>  StructField("day", StringType(), False),
>  StructField("enc", StringType(), False)])
> from 

[jira] [Commented] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486561#comment-16486561
 ] 

Hyukjin Kwon commented on SPARK-24339:
--

(Don't set the target versions usually reserved for committers and affects 
versions usually set after actually being fixed.

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24339:
-
Target Version/s:   (was: 2.1.1, 2.1.2, 2.2.0, 2.2.1)

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24339) spark sql can not prune column in transform/map/reduce query

2018-05-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24339:
-
Fix Version/s: (was: 2.2.1)
   (was: 2.1.2)
   (was: 2.1.1)
   (was: 2.2.0)

> spark sql can not prune column in transform/map/reduce query
> 
>
> Key: SPARK-24339
> URL: https://issues.apache.org/jira/browse/SPARK-24339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: xdcjie
>Priority: Minor
>  Labels: map, reduce, sql, transform
>
> I was using {{TRANSFORM USING}} with branch-2.1/2.2, and noticed that it will 
> scan all column of data, query like:
> {code:java}
> SELECT TRANSFORM(usid, cch) USING 'python test.py' AS (u1, c1, u2, c2) FROM 
> test_table;{code}
> it's physical plan like:
> {code:java}
> == Physical Plan ==
> ScriptTransformation [usid#17, cch#9], python test.py, [u1#784, c1#785, 
> u2#786, c2#787], 
> HiveScriptIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim,
> )),List((field.delim,   
> )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
> +- FileScan parquet 
> [sh#0L,clk#1L,chg#2L,qey#3,ship#4,chgname#5,sid#6,bid#7,dis#8L,cch#9,wch#10,wid#11L,arank#12L,rtag#13,iid#14,uid#15,pid#16,usid#17,wdid#18,bid#19,oqleft#20,oqright#21,poqvalue#22,tm#23,...
>  367 more fields] Batched: false, Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/Downloads/part-r-00093-0ef5d59f-2e08-4085-9b46-458a1652932a.g...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct {code}
> In our scenario, parquet has 400 column, this query will take more time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24354) Adding support for quoteMode in Spark's build in CSV DataFrameWriter

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486557#comment-16486557
 ] 

Hyukjin Kwon commented on SPARK-24354:
--

Nope, the library was changed and it doesn't have such quoteMode. See 
https://github.com/databricks/spark-csv/issues/430#issuecomment-302631438.

> Adding support for quoteMode in Spark's build in CSV DataFrameWriter
> 
>
> Key: SPARK-24354
> URL: https://issues.apache.org/jira/browse/SPARK-24354
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0
>Reporter: Umesh K
>Priority: Major
>
> Hi All, I feel quoteMode option which we used to have in 
> [spark-csv|[https://github.com/databricks/spark-csv]] will be very useful if 
> we implement in Spark's inbuilt CSV DataFrameWriter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-05-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486554#comment-16486554
 ] 

Hyukjin Kwon commented on SPARK-24357:
--

It should have been much more readable if there are some reproducers though.

> createDataFrame in Python infers large integers as long type and then fails 
> silently when converting them
> -
>
> Key: SPARK-24357
> URL: https://issues.apache.org/jira/browse/SPARK-24357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Joel Croteau
>Priority: Major
>
> When inferring the schema type of an RDD passed to createDataFrame, PySpark 
> SQL will infer any integral type as a LongType, which is a 64-bit integer, 
> without actually checking whether the values will fit into a 64-bit slot. If 
> the values are larger than 64 bits, then when pickled and unpickled in Java, 
> Unpickler will convert them to BigIntegers. When applySchemaToPythonRDD is 
> called, it will ignore the BigInteger type and return Null. This results in 
> any large integers in the resulting DataFrame being silently converted to 
> None. This can create some very surprising and difficult to debug behavior, 
> in particular if you are not aware of this limitation. There should either be 
> a runtime error at some point in this conversion chain, or else _infer_type 
> should infer larger integers as DecimalType with appropriate precision, or as 
> BinaryType. The former would be less convenient, but the latter may be 
> problematic to implement in practice. In any case, we should stop silently 
> converting large integers to None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24342) Large Task prior scheduling to Reduce overall execution time

2018-05-22 Thread gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gao updated SPARK-24342:

Priority: Major  (was: Minor)

> Large Task prior scheduling to Reduce overall execution time
> 
>
> Key: SPARK-24342
> URL: https://issues.apache.org/jira/browse/SPARK-24342
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.3.0
>Reporter: gao
>Priority: Major
> Attachments: tasktimespan.PNG
>
>
> When performing a set of concurrent tasks, if the relatively large task 
> (long-time task) performs the first small-task execution, the overall 
> execution time 
> can be shortened.
> Therefore, Spark needs to add a new function to perform Large Task of a group 
> of task set prior scheduling and small tasks after scheduling
>    The time span of the task can be identified based on the historical 
> execution time. We can think that the task with a long execution time will 
> longe in 
> future. Record the last task execution time together with the task's key as a 
> log file, and load the log file at the next execution time. use The 
> RangePartitioner and partitioning partitioning methods prioritize large tasks 
> and can achieve concurrent task optimization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24322) Upgrade Apache ORC to 1.4.4

2018-05-22 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24322:
--
Description: 
ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to update 
Spark to use it.

https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4

For example, ORC-306 fixes the timestamp issue.
{code}
scala> spark.version
res0: String = 2.3.0
scala> spark.sql("set spark.sql.orc.impl=native")
scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
12:34:56.000789")).toDF().write.orc("/tmp/orc")
scala> spark.read.orc("/tmp/orc").show(false)
+--+
|value |
+--+
|1900-05-05 12:34:55.000789|
+--+
{code}

  was:
ORC 1.4.4 (released on May 14th) includes nine fixes like ORC-301. This issue 
aims to update Spark to use it.

https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4


> Upgrade Apache ORC to 1.4.4
> ---
>
> Key: SPARK-24322
> URL: https://issues.apache.org/jira/browse/SPARK-24322
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> ORC 1.4.4 (released on May 14th) includes nine fixes. This issue aims to 
> update Spark to use it.
> https://issues.apache.org/jira/issues/?filter=12342568=project%20%3D%20ORC%20AND%20resolution%20%3D%20Fixed%20AND%20fixVersion%20%3D%201.4.4
> For example, ORC-306 fixes the timestamp issue.
> {code}
> scala> spark.version
> res0: String = 2.3.0
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> Seq(java.sql.Timestamp.valueOf("1900-05-05 
> 12:34:56.000789")).toDF().write.orc("/tmp/orc")
> scala> spark.read.orc("/tmp/orc").show(false)
> +--+
> |value |
> +--+
> |1900-05-05 12:34:55.000789|
> +--+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24360) Support Hive 3.0 metastore

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24360:


Assignee: Apache Spark

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24360) Support Hive 3.0 metastore

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16485646#comment-16485646
 ] 

Apache Spark commented on SPARK-24360:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21404

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24360) Support Hive 3.0 metastore

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24360:


Assignee: (was: Apache Spark)

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-24360
> URL: https://issues.apache.org/jira/browse/SPARK-24360
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24360) Support Hive 3.0 metastore

2018-05-22 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-24360:
-

 Summary: Support Hive 3.0 metastore
 Key: SPARK-24360
 URL: https://issues.apache.org/jira/browse/SPARK-24360
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an[ [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

[jira] [Commented] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484679#comment-16484679
 ] 

Apache Spark commented on SPARK-24313:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21403

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Critical
>  Labels: correctness
> Fix For: 2.4.0
>
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-24359:
--

 Summary: SPIP: ML Pipelines in R
 Key: SPARK-24359
 URL: https://issues.apache.org/jira/browse/SPARK-24359
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 3.0.0
Reporter: Hossein Falaki
 Attachments: SparkML_ ML Pipelines in R.pdf

h1. Background and motivation

SparkR supports calling MLlib functionality with an[ [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

 

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of

Apache Spark. This new package will be built on top of SparkR’s APIs to expose 
SparkML’s pipeline APIs and functionality.

 

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

 

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:

 

> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1)

 

When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:

 

> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  

 

> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3)

 

Where stage1, stage2, etc 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-22 Thread Hossein Falaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki updated SPARK-24359:
---
Attachment: SparkML_ ML Pipelines in R.pdf

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an[ [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
>  
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of
> Apache Spark. This new package will be built on top of SparkR’s APIs to 
> expose SparkML’s pipeline APIs and functionality.
>  
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
>  
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., set_max_iter()). If a constructor 
> gets arguments, they will be named arguments. For example:
>  
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1)
>  
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
>  
> > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 

[jira] [Commented] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484654#comment-16484654
 ] 

Apache Spark commented on SPARK-24355:
--

User 'Victsm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21402

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when 

[jira] [Assigned] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24355:


Assignee: Apache Spark

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Assignee: Apache Spark
>Priority: Major
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when the shuffle server is serving many concurrent 
> ChunkFetchRequests, the 

[jira] [Assigned] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24355:


Assignee: (was: Apache Spark)

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from clients.
> As a result, when the shuffle server is serving many concurrent 
> ChunkFetchRequests, the server side netty 

[jira] [Commented] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-05-22 Thread Min Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484650#comment-16484650
 ] 

Min Shen commented on SPARK-24355:
--

[~felixcheung]  [~jinxing6...@126.com] [~cloud_fan]

Could you please take a look at this PR?

https://github.com/apache/spark/pull/21402

> Improve Spark shuffle server responsiveness to non-ChunkFetch requests
> --
>
> Key: SPARK-24355
> URL: https://issues.apache.org/jira/browse/SPARK-24355
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
> Environment: Hadoop-2.7.4
> Spark-2.3.0
>Reporter: Min Shen
>Priority: Major
>
> We run Spark on YARN, and deploy Spark external shuffle service as part of 
> YARN NM aux service.
> One issue we saw with Spark external shuffle service is the various timeout 
> experienced by the clients on either registering executor with local shuffle 
> server or establish connection to remote shuffle server.
> Example of a timeout for establishing connection with remote shuffle server:
> {code:java}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
> {code}
> Example of a timeout for registering executor with local shuffle server:
> {code:java}
> ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
> {code}
> While patches such as SPARK-20640 and config parameters such as 
> spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
> spark.authenticate is set to true) could help to alleviate this type of 
> problems, it does not solve the fundamental issue.
> We have observed that, when the shuffle workload gets very busy in peak 
> hours, the client requests could timeout even after configuring these 
> parameters to very high values. Further investigating this issue revealed the 
> following issue:
> Right now, the default server side netty handler threads is 2 * # cores, and 
> can be further configured with parameter spark.shuffle.io.serverThreads.
> In order to process a client request, it would require one available server 
> netty handler thread.
> However, when the server netty handler threads start to process 
> ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
> contentions from the random read operations initiated by all the 
> ChunkFetchRequests received from 

[jira] [Assigned] (SPARK-24335) Dataset.map schema not applied in some cases

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24335:


Assignee: Apache Spark

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Assignee: Apache Spark
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24335) Dataset.map schema not applied in some cases

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484649#comment-16484649
 ] 

Apache Spark commented on SPARK-24335:
--

User 'Victsm' has created a pull request for this issue:
https://github.com/apache/spark/pull/21402

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24335) Dataset.map schema not applied in some cases

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24335:


Assignee: (was: Apache Spark)

> Dataset.map schema not applied in some cases
> 
>
> Key: SPARK-24335
> URL: https://issues.apache.org/jira/browse/SPARK-24335
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Robert Reid
>Priority: Major
>
> In the following code an {color:#808080}UnsupportedOperationException{color} 
> is thrown in the filter() call just after the Dateset.map() call unless 
> withWatermark() is added between them. The error reports 
> `{color:#808080}fieldIndex on a Row without schema is undefined{color}`.  I 
> expect the map() method to have applied the schema and for it to be 
> accessible in filter().  Without the extra withWatermark() call my debugger 
> reports that the `row` objects in the filter lambda are `GenericRow`.  With 
> the watermark call it reports that they are `GenericRowWithSchema`.
> I should add that I'm new to working with Structured Streaming.  So if I'm 
> overlooking some implied dependency please fill me in.
> I'm encountering this in new code for a new production job. The presented 
> code is distilled down to demonstrate the problem.  While the problem can be 
> worked around simply by adding withWatermark() I'm concerned that this will 
> leave the code in a fragile state.  With this simplified code if this error 
> occurs again it will be easy to identify what change led to the error.  But 
> in the code I'm writing, with this functionality delegated to other classes, 
> it is (and has been) very challenging to identify the cause.
>  
> {code:java}
> public static void main(String[] args) {
> SparkSession sparkSession = 
> SparkSession.builder().master("local").getOrCreate();
> sparkSession.conf().set(
> "spark.sql.streaming.checkpointLocation",
> "hdfs://localhost:9000/search_relevance/checkpoint" // for spark 
> 2.3
> // "spark.sql.streaming.checkpointLocation", "tmp/checkpoint" // 
> for spark 2.1
> );
> StructType inSchema = DataTypes.createStructType(
> new StructField[] {
> DataTypes.createStructField("id", DataTypes.StringType
>   , false),
> DataTypes.createStructField("ts", DataTypes.TimestampType 
>   , false),
> DataTypes.createStructField("f1", DataTypes.LongType  
>   , true)
> }
> );
> Dataset rawSet = sparkSession.sqlContext().readStream()
> .format("rate")
> .option("rowsPerSecond", 1)
> .load()
> .map(   (MapFunction) raw -> {
> Object[] fields = new Object[3];
> fields[0] = "id1";
> fields[1] = raw.getAs("timestamp");
> fields[2] = raw.getAs("value");
> return RowFactory.create(fields);
> },
> RowEncoder.apply(inSchema)
> )
> // If withWatermark() is included above the filter() line then 
> this works.  Without it we get:
> //Caused by: java.lang.UnsupportedOperationException: 
> fieldIndex on a Row without schema is undefined.
> // at the row.getAs() call.
> // .withWatermark("ts", "10 seconds")  // <-- This is required 
> for row.getAs("f1") to work ???
> .filter((FilterFunction) row -> !row.getAs("f1").equals(0L))
> .withWatermark("ts", "10 seconds")
> ;
> StreamingQuery streamingQuery = rawSet
> .select("*")
> .writeStream()
> .format("console")
> .outputMode("append")
> .start();
> try {
> streamingQuery.awaitTermination(30_000);
> } catch (StreamingQueryException e) {
> System.out.println("Caught exception at 'awaitTermination':");
> e.printStackTrace();
> }
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24358) createDataFrame in Python should be able to infer bytes type as Binary type

2018-05-22 Thread Joel Croteau (JIRA)
Joel Croteau created SPARK-24358:


 Summary: createDataFrame in Python should be able to infer bytes 
type as Binary type
 Key: SPARK-24358
 URL: https://issues.apache.org/jira/browse/SPARK-24358
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Joel Croteau


createDataFrame can infer Python's bytearray type as a Binary. Since bytes is 
just the immutable, hashable version of this same structure, it makes sense for 
the same thing to apply there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24357) createDataFrame in Python infers large integers as long type and then fails silently when converting them

2018-05-22 Thread Joel Croteau (JIRA)
Joel Croteau created SPARK-24357:


 Summary: createDataFrame in Python infers large integers as long 
type and then fails silently when converting them
 Key: SPARK-24357
 URL: https://issues.apache.org/jira/browse/SPARK-24357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Joel Croteau


When inferring the schema type of an RDD passed to createDataFrame, PySpark SQL 
will infer any integral type as a LongType, which is a 64-bit integer, without 
actually checking whether the values will fit into a 64-bit slot. If the values 
are larger than 64 bits, then when pickled and unpickled in Java, Unpickler 
will convert them to BigIntegers. When applySchemaToPythonRDD is called, it 
will ignore the BigInteger type and return Null. This results in any large 
integers in the resulting DataFrame being silently converted to None. This can 
create some very surprising and difficult to debug behavior, in particular if 
you are not aware of this limitation. There should either be a runtime error at 
some point in this conversion chain, or else _infer_type should infer larger 
integers as DecimalType with appropriate precision, or as BinaryType. The 
former would be less convenient, but the latter may be problematic to implement 
in practice. In any case, we should stop silently converting large integers to 
None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer

2018-05-22 Thread Misha Dmitriev (JIRA)
Misha Dmitriev created SPARK-24356:
--

 Summary: Duplicate strings in File.path managed by 
FileSegmentManagedBuffer
 Key: SPARK-24356
 URL: https://issues.apache.org/jira/browse/SPARK-24356
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.3.0
Reporter: Misha Dmitriev


I recently analyzed a heap dump of Yarn Node Manager that was suffering from 
high GC pressure due to high object churn. Analysis was done with the jxray 
tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a 
number of well-known memory issues. One problem that it found in this dump is 
19.5% of memory wasted due to duplicate strings. Of these duplicates, more than 
a half come from {{FileInputStream.path}} and {{File.path}}. All the 
{{FileInputStream}} objects that JXRay shows are garbage - looks like they are 
used for a very short period and then discarded (I guess there is a separate 
question of whether that's a good pattern). But {{File}} instances are 
traceable to {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} 
field. Here is the full reference chain:
 
{code:java}
↖java.io.File.path
↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
↖{j.u.ArrayList}
↖j.u.ArrayList$Itr.this$0
↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
↖{java.util.concurrent.ConcurrentHashMap}.values
↖org.apache.spark.network.server.OneForOneStreamManager.streams
↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
{code}
 
Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very 
similar, so I think {{FileInputStream}}s are generated by the 
{{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely come 
from 
[https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263]
 
To avoid duplicate strings in {{File.path}}'s in this case, it is suggested 
that in the above code we create a File with a complete, normalized pathname, 
that has been already interned. This will prevent the code inside 
{{java.io.File}} from modifying this string, and thus it will use the interned 
copy, and will pass it to FileInputStream. Essentially the current line
{code:java}
return new File(new File(localDir, String.format("%02x", subDirId)), 
filename);{code}
should be replaced with something like
{code:java}
String pathname = localDir + File.separator + String.format(...) + 
File.separator + filename;
pathname = fileSystem.normalize(pathname).intern();
return new File(pathname);{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24355) Improve Spark shuffle server responsiveness to non-ChunkFetch requests

2018-05-22 Thread Min Shen (JIRA)
Min Shen created SPARK-24355:


 Summary: Improve Spark shuffle server responsiveness to 
non-ChunkFetch requests
 Key: SPARK-24355
 URL: https://issues.apache.org/jira/browse/SPARK-24355
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.3.0
 Environment: Hadoop-2.7.4

Spark-2.3.0
Reporter: Min Shen


We run Spark on YARN, and deploy Spark external shuffle service as part of YARN 
NM aux service.

One issue we saw with Spark external shuffle service is the various timeout 
experienced by the clients on either registering executor with local shuffle 
server or establish connection to remote shuffle server.

Example of a timeout for establishing connection with remote shuffle server:
{code:java}
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
at 
org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:288)
at 
org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:248)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:106)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:115)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:182)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.org$apache$spark$storage$ShuffleBlockFetcherIterator$$send$1(ShuffleBlockFetcherIterator.scala:396)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:391)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:345)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:57)
{code}
Example of a timeout for registering executor with local shuffle server:
{code:java}
ava.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
waiting for task.
at 
org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
at 
org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
at 
org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at 
org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
at 
org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
at 
org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
{code}
While patches such as SPARK-20640 and config parameters such as 
spark.shuffle.registration.timeout and spark.shuffle.sasl.timeout (when 
spark.authenticate is set to true) could help to alleviate this type of 
problems, it does not solve the fundamental issue.

We have observed that, when the shuffle workload gets very busy in peak hours, 
the client requests could timeout even after configuring these parameters to 
very high values. Further investigating this issue revealed the following issue:

Right now, the default server side netty handler threads is 2 * # cores, and 
can be further configured with parameter spark.shuffle.io.serverThreads.
In order to process a client request, it would require one available server 
netty handler thread.
However, when the server netty handler threads start to process 
ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk 
contentions from the random read operations initiated by all the 
ChunkFetchRequests received from clients.
As a result, when the shuffle server is serving many concurrent 
ChunkFetchRequests, the server side netty handler threads could all be blocked 
on reading shuffle files, thus leaving no handler thread available to process 
other types of requests which should all be very quick to process.

This issue could potentially be fixed by limiting the number of netty handler 
threads that could get blocked when processing ChunkFetchRequest. We have a 
patch to do this by using a separate 

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-05-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19185:
--

Assignee: Gabor Somogyi

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
> 

[jira] [Resolved] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-05-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19185.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20997
[https://github.com/apache/spark/pull/20997]

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: streaming, windowing
> Fix For: 2.4.0
>
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> 

[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni edited comment on SPARK-24324 at 5/22/18 8:17 PM:


[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema, I will still get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of other 
errors in between (e.g type mismatches and the like).


was (Author: cristiancantoro):
[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema, but I still get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS

[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni edited comment on SPARK-24324 at 5/22/18 8:16 PM:


[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema, but I still get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).


was (Author: cristiancantoro):
[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  

[jira] [Comment Edited] (SPARK-22055) Port release scripts

2018-05-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483217#comment-16483217
 ] 

Marcelo Vanzin edited comment on SPARK-22055 at 5/22/18 8:11 PM:
-

IMO it would be a better idea to take Felix's docker image [1] and create a 
script based on it, instead of relying on Jenkins jobs for this. As part of 
2.3.1 I have written a small script to help me prepare the release, and it does 
half of the job. All that is left is automate building the image, and running 
the commands inside it.

I'll send my script for review after 2.3.1 is out and I'm happy with it.

[1] https://github.com/felixcheung/spark-build/blob/master/Dockerfile


was (Author: vanzin):
IMO it would be a better idea to take Felix's docker image [1] and create a 
script based on it, instead of relying on Jenkins jobs for this. As part of 
2.3.1 I have written a small script to help me prepare the release, and it half 
of the job. All that is left is automate building the image, and running the 
commands inside it.

I'll send my script for review after 2.3.1 is out and I'm happy with it.

[1] https://github.com/felixcheung/spark-build/blob/master/Dockerfile

> Port release scripts
> 
>
> Key: SPARK-22055
> URL: https://issues.apache.org/jira/browse/SPARK-22055
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: holdenk
>Priority: Blocker
>
> The current Jenkins jobs are generated from scripts in a private repo. We 
> should port these to enable changes like SPARK-22054 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24348.
-
Resolution: Fixed

> scala.MatchError in the "element_at" expression
> ---
>
> Key: SPARK-24348
> URL: https://issues.apache.org/jira/browse/SPARK-24348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Vayda
>Assignee: Alex Vayda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{element_at}} with a wrong first operand type a 
> {{scala.MatchError}} is thrown instead of {{AnalysisException}}
> *Example:*
> {code:sql}
> select element_at('foo', 1)
> {code}
> results in:
> {noformat}
> scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24348) scala.MatchError in the "element_at" expression

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24348:
---

Assignee: Alex Vayda

> scala.MatchError in the "element_at" expression
> ---
>
> Key: SPARK-24348
> URL: https://issues.apache.org/jira/browse/SPARK-24348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Vayda
>Assignee: Alex Vayda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{element_at}} with a wrong first operand type a 
> {{scala.MatchError}} is thrown instead of {{AnalysisException}}
> *Example:*
> {code:sql}
> select element_at('foo', 1)
> {code}
> results in:
> {noformat}
> scala.MatchError: StringType (of class org.apache.spark.sql.types.StringType$)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.inputTypes(collectionOperations.scala:1469)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ElementAt.checkInputDataTypes(collectionOperations.scala:1478)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23899) Built-in SQL Function Improvement

2018-05-22 Thread Alex Vayda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484477#comment-16484477
 ] 

Alex Vayda edited comment on SPARK-23899 at 5/22/18 7:56 PM:
-

What do you guys think about adding another set of convenient functions for 
working with multi-dimensional arrays? E.g. matrix operations like 
{{transpose}}, {{multiply}} and others?
Something similar to {{ml.linalg.Matrix}}


was (Author: wajda):
What do you guys think about adding another set of convenient functions for 
working with multi-dimentional arrays? E.g. matrix operations like 
{{transpose}}, {{multiply}} and others?
Something similar to {{ml.linalg.Matrix}}

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-23899
> URL: https://issues.apache.org/jira/browse/SPARK-23899
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing 
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and 
> MS SQL Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23899) Built-in SQL Function Improvement

2018-05-22 Thread Alex Vayda (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484477#comment-16484477
 ] 

Alex Vayda commented on SPARK-23899:


What do you guys think about adding another set of convenient functions for 
working with multi-dimentional arrays? E.g. matrix operations like 
{{transpose}}, {{multiply}} and others?
Something similar to {{ml.linalg.Matrix}}

> Built-in SQL Function Improvement
> -
>
> Key: SPARK-23899
> URL: https://issues.apache.org/jira/browse/SPARK-23899
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> This umbrella JIRA is to improve compatibility with the other data processing 
> systems, including Hive, Teradata, Presto, Postgres, MySQL, DB2, Oracle, and 
> MS SQL Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23780:
---
Fix Version/s: (was: 2.3.2)
   2.3.1

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Ivan Dzikovsky
>Assignee: Felix Cheung
>Priority: Major
>  Labels: regression
> Fix For: 2.3.1, 2.4.0
>
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity/taints
 * Pod affinity/anti-affinity

Note that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity/ taints
 * Pod affinity/anti-affinity

Note that nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity/taints
>  * Pod affinity/anti-affinity
> Note that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24354) Adding support for quoteMode in Spark's build in CSV DataFrameWriter

2018-05-22 Thread Umesh K (JIRA)
Umesh K created SPARK-24354:
---

 Summary: Adding support for quoteMode in Spark's build in CSV 
DataFrameWriter
 Key: SPARK-24354
 URL: https://issues.apache.org/jira/browse/SPARK-24354
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.3.0, 2.2.0, 2.1.0
Reporter: Umesh K


Hi All, I feel quoteMode option which we used to have in 
[spark-csv|[https://github.com/databricks/spark-csv]] will be very useful if we 
implement in Spark's inbuilt CSV DataFrameWriter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-22 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484376#comment-16484376
 ] 

Huaxin Gao commented on SPARK-24333:


I will work on this. Thanks!

> Add fit with validation set to spark.ml GBT: Python API
> ---
>
> Key: SPARK-24333
> URL: https://issues.apache.org/jira/browse/SPARK-24333
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity/ taints
 * Pod affinity/anti-affinity

Note that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity
 * Pod affinity/anti-affinity

Selecting where to run drivers and executors covers many cases like the ones 
described here:

Note that nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity/ taints
>  * Pod affinity/anti-affinity
> Note that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24121) The API for handling expression code generation in expression codegen

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24121:
---

Assignee: Liang-Chi Hsieh

> The API for handling expression code generation in expression codegen
> -
>
> Key: SPARK-24121
> URL: https://issues.apache.org/jira/browse/SPARK-24121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> In order to achieve the replacement of expr value during expression codegen 
> (please see the proposal at 
> [https://github.com/apache/spark/pull/19813#issuecomment-354045400),] we need 
> an API to handle the insertion of temporary symbols for statements generated 
> by expressions. This API must allow us to know what statement expressions are 
> during codegen and to use symbols instead of actual codes when generating 
> java codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24121) The API for handling expression code generation in expression codegen

2018-05-22 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24121.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21193
[https://github.com/apache/spark/pull/21193]

> The API for handling expression code generation in expression codegen
> -
>
> Key: SPARK-24121
> URL: https://issues.apache.org/jira/browse/SPARK-24121
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> In order to achieve the replacement of expr value during expression codegen 
> (please see the proposal at 
> [https://github.com/apache/spark/pull/19813#issuecomment-354045400),] we need 
> an API to handle the insertion of temporary symbols for statements generated 
> by expressions. This API must allow us to know what statement expressions are 
> during codegen and to use symbols instead of actual codes when generating 
> java codes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484346#comment-16484346
 ] 

Stavros Kontopoulos commented on SPARK-24353:
-

I will create a design doc for this feature.

> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity
>  * Pod affinity/anti-affinity
> Selecting where to run drivers and executors covers many cases like the ones 
> described here:
> Note that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity
 * Pod affinity/anti-affinity

Selecting where to run drivers and executors covers many cases like the ones 
described here:

Note that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity
 * Pod affinity/anti-affinity

Note that nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity
>  * Pod affinity/anti-affinity
> Selecting where to run drivers and executors covers many cases like the ones 
> described here:
> Note that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s, in 
detail:
 * Node affinity
 * Pod affinity/anti-affinity

Note that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s. Note 
that nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s, in 
> detail:
>  * Node affinity
>  * Pod affinity/anti-affinity
> Note that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Target Version/s: 2.3.2

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Target Version/s: 2.3.1  (was: 2.3.2)

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Labels: correctness  (was: )

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Minor
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24257) LongToUnsafeRowMap calculate the new size may be wrong

2018-05-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24257:

Priority: Blocker  (was: Minor)

> LongToUnsafeRowMap calculate the new size may be wrong
> --
>
> Key: SPARK-24257
> URL: https://issues.apache.org/jira/browse/SPARK-24257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Blocker
>  Labels: correctness
>
> LongToUnsafeRowMap
> Calculate the new size simply by multiplying by 2
> At this time, the size of the application may not be enough to store data
> Some data is lost and the data read out is dirty



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13638) Support for saving with a quote mode

2018-05-22 Thread Umesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484255#comment-16484255
 ] 

Umesh K edited comment on SPARK-13638 at 5/22/18 4:36 PM:
--

[~rxin] Just want to confirm are we going to have quoteMode in future or we 
always have to use quoteAll?


was (Author: kachau):
Just want to confirm are we ever going to have quoteMode or we always have to 
use quoteAll?

> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Jurriaan Pruis
>Priority: Minor
> Fix For: 2.0.0
>
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it might break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13638) Support for saving with a quote mode

2018-05-22 Thread Umesh K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484255#comment-16484255
 ] 

Umesh K commented on SPARK-13638:
-

Just want to confirm are we ever going to have quoteMode or we always have to 
use quoteAll?

> Support for saving with a quote mode
> 
>
> Key: SPARK-13638
> URL: https://issues.apache.org/jira/browse/SPARK-13638
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Jurriaan Pruis
>Priority: Minor
> Fix For: 2.0.0
>
>
> https://github.com/databricks/spark-csv/pull/254
> tobithiel reported this.
> {quote}
> I'm dealing with some messy csv files and being able to just quote all fields 
> is very useful,
> so that other applications don't misunderstand the file because of some 
> sketchy characters
> {quote}
> When writing there are several quote modes in apache commons csv. (See 
> https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/QuoteMode.html)
> This might have to be supported.
> However, it looks univocity parser used for writing (it looks currently only 
> this library is supported) does not support this quote mode. I think we can 
> drop this backwards compatibility if we are not going to add apache commons 
> csv.
> This is a reminder that it might break backwards compatibility for the 
> options, {{quoteMode}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-24353:

Description: 
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support of this feature to Spark on K8s. Note 
that nodeSelector will be deprecated in the future.

 

 

  was:
Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support to Spark on K8s. Note here that 
nodeSelector will be deprecated in the future.

 

 


> Add support for pod affinity/anti-affinity
> --
>
> Key: SPARK-24353
> URL: https://issues.apache.org/jira/browse/SPARK-24353
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Spark on K8s allows to place driver/executor pods on specific k8s nodes, 
> using nodeSelector. NodeSelector is a very simple way to constrain pods to 
> nodes with particular labels. The 
> [affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
>  feature, currently in beta, greatly expands the types of constraints you can 
> express. Aim here is to bring support of this feature to Spark on K8s. Note 
> that nodeSelector will be deprecated in the future.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24353) Add support for pod affinity/anti-affinity

2018-05-22 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-24353:
---

 Summary: Add support for pod affinity/anti-affinity
 Key: SPARK-24353
 URL: https://issues.apache.org/jira/browse/SPARK-24353
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


Spark on K8s allows to place driver/executor pods on specific k8s nodes, using 
nodeSelector. NodeSelector is a very simple way to constrain pods to nodes with 
particular labels. The 
[affinity/anti-affinity|https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity]
 feature, currently in beta, greatly expands the types of constraints you can 
express. Aim here is to bring support to Spark on K8s. Note here that 
nodeSelector will be deprecated in the future.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24352) Flaky test: StandaloneDynamicAllocationSuite

2018-05-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-24352:
--

 Summary: Flaky test: StandaloneDynamicAllocationSuite
 Key: SPARK-24352
 URL: https://issues.apache.org/jira/browse/SPARK-24352
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Tests
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin


>From jenkins:

[https://amplab.cs.berkeley.edu/jenkins/user/vanzin/my-views/view/Spark/job/spark-branch-2.3-test-maven-hadoop-2.6/384/testReport/junit/org.apache.spark.deploy/StandaloneDynamicAllocationSuite/executor_registration_on_a_blacklisted_host_must_fail/]

 
{noformat}
Error Message
There is already an RpcEndpoint called CoarseGrainedScheduler
Stacktrace
  java.lang.IllegalArgumentException: There is already an RpcEndpoint 
called CoarseGrainedScheduler
  at 
org.apache.spark.rpc.netty.Dispatcher.registerRpcEndpoint(Dispatcher.scala:71)
  at 
org.apache.spark.rpc.netty.NettyRpcEnv.setupEndpoint(NettyRpcEnv.scala:130)
  at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.createDriverEndpointRef(CoarseGrainedSchedulerBackend.scala:396)
  at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.start(CoarseGrainedSchedulerBackend.scala:391)
  at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.start(StandaloneSchedulerBackend.scala:61)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:512)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
  at 
org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:495)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
{noformat}

This actually looks like a previous test is leaving some stuff running and 
making this one fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24341) Codegen compile error from predicate subquery

2018-05-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484225#comment-16484225
 ] 

Xiao Li commented on SPARK-24341:
-

cc [~dkbiswal] Could you take a look at this too?

> Codegen compile error from predicate subquery
> -
>
> Key: SPARK-24341
> URL: https://issues.apache.org/jira/browse/SPARK-24341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Juliusz Sompolski
>Priority: Minor
>
> Ran on master:
> {code}
> drop table if exists juleka;
> drop table if exists julekb;
> create table juleka (a integer, b integer);
> create table julekb (na integer, nb integer);
> insert into juleka values (1,1);
> insert into julekb values (1,1);
> select * from juleka where (a, b) not in (select (na, nb) from julekb);
> {code}
> Results in:
> {code}
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 27, Column 29: Cannot compare types "int" and 
> "org.apache.spark.sql.catalyst.InternalRow"
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1415)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:92)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:46)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:380)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:99)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:97)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2$$anonfun$apply$3.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>   at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>   at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:203)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$4$$anonfun$apply$2.apply(BroadcastNestedLoopJoinExec.scala:202)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
>   at 
> org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.encodeUnsafeRows(UnsafeRowBatchUtils.scala:49)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:126)
>   at 
> org.apache.spark.sql.execution.collect.Collector$$anonfun$2.apply(Collector.scala:125)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:111)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  

[jira] [Updated] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread huangtengfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtengfei updated SPARK-24351:
-
Description: 
In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802].
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306].
 
 Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.

  was:
In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
 
Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.


> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802].
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306].
>  
>  Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24351:


Assignee: (was: Apache Spark)

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24351:


Assignee: Apache Spark

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Assignee: Apache Spark
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484172#comment-16484172
 ] 

Apache Spark commented on SPARK-24350:
--

User 'wajda' has created a pull request for this issue:
https://github.com/apache/spark/pull/21401

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484173#comment-16484173
 ] 

Apache Spark commented on SPARK-24351:
--

User 'ivoson' has created a pull request for this issue:
https://github.com/apache/spark/pull/21400

> offsetLog/commitLog purge thresholdBatchId should be computed with current 
> committed epoch but not currentBatchId in CP mode
> 
>
> Key: SPARK-24351
> URL: https://issues.apache.org/jira/browse/SPARK-24351
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: huangtengfei
>Priority: Major
>
> In structured streaming, there is a conf 
> spark.sql.streaming.minBatchesToRetain which is set to specify 'The minimum 
> number of batches that must be retained and made recoverable' as described in 
> [SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
>  In continuous processing, the metadata purge is triggered when an epoch is 
> committed in 
> [ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
>  
> Since currentBatchId increases independently in cp mode, the current 
> committed epoch may be far behind currentBatchId if some task hangs for some 
> time. It is not safe to discard the metadata with thresholdBatchId computed 
> based on currentBatchId because we may clean all the metadata in the 
> checkpoint directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24350:


Assignee: Apache Spark

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24350:


Assignee: (was: Apache Spark)

> ClassCastException in "array_position" function
> ---
>
> Key: SPARK-24350
> URL: https://issues.apache.org/jira/browse/SPARK-24350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Alex Wajda
>Priority: Major
> Fix For: 2.4.0
>
>
> When calling {{array_position}} function with a wrong type of the 1st operand 
> a {{ClassCastException}} is thrown instead of {{AnalysisException}}
> Example:
> {code:sql}
> select array_position('foo', 'bar')
> {code}
> {noformat}
> java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot 
> be cast to org.apache.spark.sql.types.ArrayType
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
>   at 
> org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

2018-05-22 Thread huangtengfei (JIRA)
huangtengfei created SPARK-24351:


 Summary: offsetLog/commitLog purge thresholdBatchId should be 
computed with current committed epoch but not currentBatchId in CP mode
 Key: SPARK-24351
 URL: https://issues.apache.org/jira/browse/SPARK-24351
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: huangtengfei


In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
 
Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24350) ClassCastException in "array_position" function

2018-05-22 Thread Alex Wajda (JIRA)
Alex Wajda created SPARK-24350:
--

 Summary: ClassCastException in "array_position" function
 Key: SPARK-24350
 URL: https://issues.apache.org/jira/browse/SPARK-24350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Alex Wajda
 Fix For: 2.4.0


When calling {{array_position}} function with a wrong type of the 1st operand a 
{{ClassCastException}} is thrown instead of {{AnalysisException}}

Example:

{code:sql}
select array_position('foo', 'bar')
{code}

{noformat}
java.lang.ClassCastException: org.apache.spark.sql.types.StringType$ cannot be 
cast to org.apache.spark.sql.types.ArrayType
at 
org.apache.spark.sql.catalyst.expressions.ArrayPosition.inputTypes(collectionOperations.scala:1398)
at 
org.apache.spark.sql.catalyst.expressions.ExpectsInputTypes$class.checkInputDataTypes(ExpectsInputTypes.scala:44)
at 
org.apache.spark.sql.catalyst.expressions.ArrayPosition.checkInputDataTypes(collectionOperations.scala:1401)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:168)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:256)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveAliases$$assignAliases$1$$anonfun$apply$3.applyOrElse(Analyzer.scala:252)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22269:


Assignee: Apache Spark

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484151#comment-16484151
 ] 

Apache Spark commented on SPARK-22269:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21399

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22269) Java style checks should be run in Jenkins

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22269:


Assignee: (was: Apache Spark)

> Java style checks should be run in Jenkins
> --
>
> Key: SPARK-22269
> URL: https://issues.apache.org/jira/browse/SPARK-22269
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Andrew Ash
>Priority: Major
>
> A few times now I've gone to build the master branch and it's failed due to 
> Java style errors, which I've sent in PRs to fix:
> - https://issues.apache.org/jira/browse/SPARK-22268
> - https://issues.apache.org/jira/browse/SPARK-21875
> Digging through the history a bit, it looks like this check used to run on 
> Jenkins and was previously enabled at 
> https://github.com/apache/spark/pull/10763 but then reverted at 
> https://github.com/apache/spark/commit/4bcea1b8595424678aa6c92d66ba08c92e0fefe5
> We should work out what it takes to enable the Java check in Jenkins so these 
> kinds of errors are caught in CI rather than afterwards post-merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni edited comment on SPARK-24324 at 5/22/18 3:30 PM:


[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if I change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).


was (Author: cristiancantoro):
[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> 

[jira] [Commented] (SPARK-24324) UserDefinedFunction mixes column labels

2018-05-22 Thread Cristian Consonni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16484138#comment-16484138
 ] 

Cristian Consonni commented on SPARK-24324:
---

[~hyukjin.kwon] said:
> Can you narrow down the problem? It roughly sounds like because names and 
> rows are mapped by position in CSV or Pandas UDF.

It is not just that, even if change the return statement in the funzion 
{code:python}concat_hours{code} to
{code:python}
return pd.DataFrame({'lang': x.lang, 'page': x.page, 'day': x.day,
 'enc': encoded_views_string}, index=[x.index[0]])
{code}
which is ordered in the same way as the schema I get mixed columns:

{noformat}
+--++---+-+ 
|  lang|page|day|  enc|
+--++---+-+
|2007-12-10|A150B148C197| en| Albert_Camus|
|2007-12-11|G1I1P3V1| en|Albert_Caquot|
|2007-12-10|  C1C1E1| en|Albert_Caquot|
|2007-12-11|U145V131W154X142| en| Albert_Camus|
+--++---+-+
{noformat}

The only way I can get the right values in the columns is with the following:
{code:python}
return pd.DataFrame({'enc': x.page, 'day': x.lang, 'lang': x.day,
 'page': encoded_views_string}, index=[x.index[0]])
{code}
which to me has absolutely no meaning and/or reason.

it took me several hours to debug this because I was getting all sorts of otehr 
errors in between (e.g type mismatches and the like).

> UserDefinedFunction mixes column labels
> ---
>
> Key: SPARK-24324
> URL: https://issues.apache.org/jira/browse/SPARK-24324
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Python (using virtualenv):
> {noformat}
> $ python --version 
> Python 3.6.5
> {noformat}
> Modules installed:
> {noformat}
> arrow==0.12.1
> backcall==0.1.0
> bleach==2.1.3
> chardet==3.0.4
> decorator==4.3.0
> entrypoints==0.2.3
> findspark==1.2.0
> html5lib==1.0.1
> ipdb==0.11
> ipykernel==4.8.2
> ipython==6.3.1
> ipython-genutils==0.2.0
> ipywidgets==7.2.1
> jedi==0.12.0
> Jinja2==2.10
> jsonschema==2.6.0
> jupyter==1.0.0
> jupyter-client==5.2.3
> jupyter-console==5.2.0
> jupyter-core==4.4.0
> MarkupSafe==1.0
> mistune==0.8.3
> nbconvert==5.3.1
> nbformat==4.4.0
> notebook==5.5.0
> numpy==1.14.3
> pandas==0.22.0
> pandocfilters==1.4.2
> parso==0.2.0
> pbr==3.1.1
> pexpect==4.5.0
> pickleshare==0.7.4
> progressbar2==3.37.1
> prompt-toolkit==1.0.15
> ptyprocess==0.5.2
> pyarrow==0.9.0
> Pygments==2.2.0
> python-dateutil==2.7.2
> python-utils==2.3.0
> pytz==2018.4
> pyzmq==17.0.0
> qtconsole==4.3.1
> Send2Trash==1.5.0
> simplegeneric==0.8.1
> six==1.11.0
> SQLAlchemy==1.2.7
> stevedore==1.28.0
> terminado==0.8.1
> testpath==0.3.1
> tornado==5.0.2
> traitlets==4.3.2
> virtualenv==15.1.0
> virtualenv-clone==0.2.6
> virtualenvwrapper==4.7.2
> wcwidth==0.1.7
> webencodings==0.5.1
> widgetsnbextension==3.2.1
> {noformat}
>  
> Java:
> {noformat}
> $ java -version 
>  java version "1.8.0_171"
>  Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode){noformat}
> System:
> {noformat}
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.4 LTS
> Release:  16.04
> Codename: xenial
> {noformat}
>Reporter: Cristian Consonni
>Priority: Major
>
> I am working on Wikipedia page views (see [task T188041 on Wikimedia's 
> Pharicator|https://phabricator.wikimedia.org/T188041]). For simplicity, let's 
> say that these are the data:
> {noformat}
>
> {noformat}
> For each combination of (lang, page, day(timestamp)) I need to transform  the 
>  views for each hour:
> {noformat}
> 00:00 -> A
> 01:00 -> B
> ...
> {noformat}
> and concatenate the number of views for that hour.  So, if a page got 5 views 
> at 00:00 and 7 views at 01:00 it would become:
> {noformat}
> A5B7
> {noformat}
>  
> I have written a UDF called {code:python}concat_hours{code}
> However, the function is mixing the columns and I am not sure what is going 
> on. I wrote here a minimal complete example that reproduces the issue on my 
> system (the details of my environment are above).
> {code:python}
> #!/usr/bin/env python3
> # coding: utf-8
> input_data = b"""en Albert_Camus 20071210-00 150
> en Albert_Camus 20071210-01 148
> en Albert_Camus 20071210-02 197
> en Albert_Camus 20071211-20 145
> en Albert_Camus 20071211-21 131
> en Albert_Camus 20071211-22 154
> en Albert_Camus 20071211-230001 142
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-02 1
> en Albert_Caquot 20071210-040001 1
> en Albert_Caquot 20071211-06 1
> en Albert_Caquot 

[jira] [Assigned] (SPARK-24338) Spark SQL fails to create a Hive table when running in a Apache Sentry-secured Environment

2018-05-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24338:


Assignee: (was: Apache Spark)

> Spark SQL fails to create a Hive table when running in a Apache 
> Sentry-secured Environment
> --
>
> Key: SPARK-24338
> URL: https://issues.apache.org/jira/browse/SPARK-24338
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Chaoran Yu
>Priority: Critical
> Attachments: exception.txt
>
>
> This 
> [commit|https://github.com/apache/spark/commit/ce13c2672318242748f7520ed4ce6bcfad4fb428]
>  introduced a bug that caused Spark SQL "CREATE TABLE" statement to fail in 
> Hive when Apache Sentry is used to control cluster authorization. This bug 
> exists in Spark 2.1.0 and all later releases. The error message thrown is in 
> the attached file.[^exception.txt]
> Cloudera in their fork of Spark fixed this bug as shown 
> [here|https://github.com/cloudera/spark/blob/spark2-2.2.0-cloudera2/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L229].
>  It would make sense for this fix to be merged back upstream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >