[jira] [Updated] (SPARK-29932) lint-r should do non-zero exit in case of errors

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29932:
--
Component/s: SparkR

> lint-r should do non-zero exit in case of errors
> 
>
> Key: SPARK-29932
> URL: https://issues.apache.org/jira/browse/SPARK-29932
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29932) lint-r should do non-zero exit in case of errors

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29932:
--
Summary: lint-r should do non-zero exit in case of errors  (was: lint-r 
should do non-zero exit if there is no R installation)

> lint-r should do non-zero exit in case of errors
> 
>
> Key: SPARK-29932
> URL: https://issues.apache.org/jira/browse/SPARK-29932
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29932) lint-r should do non-zero exit if there is no R installation

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29932:
--
Summary: lint-r should do non-zero exit if there is no R installation  
(was: lint-r should do non-zero exit if there is no R instation)

> lint-r should do non-zero exit if there is no R installation
> 
>
> Key: SPARK-29932
> URL: https://issues.apache.org/jira/browse/SPARK-29932
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29932) lint-r should do non-zero exit if there is no R instation

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29932:
--
Summary: lint-r should do non-zero exit if there is no R instation  (was: 
lint-r should do non-zero exit in case of error)

> lint-r should do non-zero exit if there is no R instation
> -
>
> Key: SPARK-29932
> URL: https://issues.apache.org/jira/browse/SPARK-29932
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29932) lint-r should do non-zero exit in case of error

2019-11-16 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-29932:
-

 Summary: lint-r should do non-zero exit in case of error
 Key: SPARK-29932
 URL: https://issues.apache.org/jira/browse/SPARK-29932
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29858:
-

Assignee: Hu Fuwang

> ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
> -
>
> Key: SPARK-29858
> URL: https://issues.apache.org/jira/browse/SPARK-29858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29858) ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29858.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26551
[https://github.com/apache/spark/pull/26551]

> ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands
> -
>
> Key: SPARK-29858
> URL: https://issues.apache.org/jira/browse/SPARK-29858
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Assignee: Hu Fuwang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29378) Make AppVeyor's SparkR with Arrow tests compatible with Arrow R 0.15

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29378.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26555
[https://github.com/apache/spark/pull/26555]

> Make AppVeyor's SparkR with Arrow tests compatible with Arrow R 0.15
> 
>
> Key: SPARK-29378
> URL: https://issues.apache.org/jira/browse/SPARK-29378
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> AppVeyor tests seem failing with Arrow 0.15 - 
> https://github.com/apache/spark/pull/26041
> We should set {{ARROW_PRE_0_15_IPC_FORMAT}} to {{1}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29378) Upgrade SparkR to use Arrow 0.15 API

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29378:
--
Issue Type: Improvement  (was: Test)

> Upgrade SparkR to use Arrow 0.15 API
> 
>
> Key: SPARK-29378
> URL: https://issues.apache.org/jira/browse/SPARK-29378
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> AppVeyor tests seem failing with Arrow 0.15 - 
> https://github.com/apache/spark/pull/26041
> We should set {{ARROW_PRE_0_15_IPC_FORMAT}} to {{1}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29378) Upgrade SparkR to use Arrow 0.15 API

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29378:
--
Summary: Upgrade SparkR to use Arrow 0.15 API  (was: Make AppVeyor's SparkR 
with Arrow tests compatible with Arrow R 0.15)

> Upgrade SparkR to use Arrow 0.15 API
> 
>
> Key: SPARK-29378
> URL: https://issues.apache.org/jira/browse/SPARK-29378
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> AppVeyor tests seem failing with Arrow 0.15 - 
> https://github.com/apache/spark/pull/26041
> We should set {{ARROW_PRE_0_15_IPC_FORMAT}} to {{1}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29378) Make AppVeyor's SparkR with Arrow tests compatible with Arrow R 0.15

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29378:
-

Assignee: Dongjoon Hyun

> Make AppVeyor's SparkR with Arrow tests compatible with Arrow R 0.15
> 
>
> Key: SPARK-29378
> URL: https://issues.apache.org/jira/browse/SPARK-29378
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Dongjoon Hyun
>Priority: Major
>
> AppVeyor tests seem failing with Arrow 0.15 - 
> https://github.com/apache/spark/pull/26041
> We should set {{ARROW_PRE_0_15_IPC_FORMAT}} to {{1}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29924) Document Arrow requirement in JDK9+

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29924:
--
Description: 
At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
required for Arrow runtime on JDK9+ environment

Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code 
incompatibility. We need to update R document like sparkr.md

  was:At least, we need to mention `io.netty.tryReflectionSetAccessible=true` 
is required for Arrow runtime on JDK9+ environment


> Document Arrow requirement in JDK9+
> ---
>
> Key: SPARK-29924
> URL: https://issues.apache.org/jira/browse/SPARK-29924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> At least, we need to mention `io.netty.tryReflectionSetAccessible=true` is 
> required for Arrow runtime on JDK9+ environment
> Also, SparkR's minimum arrow became also 0.15.1 due to Arrow source code 
> incompatibility. We need to update R document like sparkr.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29928) Check parsing timestamps up to microsecond precision by JSON/CSV datasource

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29928.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26558
[https://github.com/apache/spark/pull/26558]

> Check parsing timestamps up to microsecond precision by JSON/CSV datasource
> ---
>
> Key: SPARK-29928
> URL: https://issues.apache.org/jira/browse/SPARK-29928
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Port tests added for 2.4 by the commit: 
> https://github.com/apache/spark/commit/9c7e8be1dca8285296f3052c41f35043699d7d10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29928) Check parsing timestamps up to microsecond precision by JSON/CSV datasource

2019-11-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29928:
-

Assignee: Maxim Gekk

> Check parsing timestamps up to microsecond precision by JSON/CSV datasource
> ---
>
> Key: SPARK-29928
> URL: https://issues.apache.org/jira/browse/SPARK-29928
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Port tests added for 2.4 by the commit: 
> https://github.com/apache/spark/commit/9c7e8be1dca8285296f3052c41f35043699d7d10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-16 Thread Terry Kim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975853#comment-16975853
 ] 

Terry Kim commented on SPARK-29890:
---

{code:java}
scala> p1.join(p2, Seq("nums")).printSchema
root
 |-- nums: integer (nullable = false)
 |-- abc: integer (nullable = false)
 |-- abc: integer (nullable = false)
{code}

Note that `fill` takes in column _names_. Thus, the following is ambiguous.

{code:java}
scala> p1.join(p2, Seq("nums")).na.fill(0, Seq("abc"))
org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could be: 
abc, abc.;
  at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
{code}

[~cloud_fan], I think this is an expected behavior. What do you think?

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29931) Declare all SQL legacy configs as will be removed in Spark 4.0

2019-11-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975851#comment-16975851
 ] 

Sean R. Owen commented on SPARK-29931:
--

I think it's OK to deprecate them if they're legacy. I don't think we have to 
commit to 4.0, although that's likely. It's conceivable there could a reason to 
do it later, or sooner.

> Declare all SQL legacy configs as will be removed in Spark 4.0
> --
>
> Key: SPARK-29931
> URL: https://issues.apache.org/jira/browse/SPARK-29931
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add the sentence to descriptions of all legacy SQL configs existed before 
> Spark 3.0: "This config will be removed in Spark 4.0.". Here is the list of 
> such configs:
> * spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName
> * spark.sql.legacy.literal.pickMinimumPrecision
> * spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
> * spark.sql.legacy.sizeOfNull
> * spark.sql.legacy.replaceDatabricksSparkAvro.enabled
> * spark.sql.legacy.setopsPrecedence.enabled
> * spark.sql.legacy.integralDivide.returnBigint
> * spark.sql.legacy.bucketedTableScan.outputOrdering
> * spark.sql.legacy.parser.havingWithoutGroupByAsWhere
> * spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue
> * spark.sql.legacy.setCommandRejectsSparkCoreConfs
> * spark.sql.legacy.utcTimestampFunc.enabled
> * spark.sql.legacy.typeCoercion.datetimeToString
> * spark.sql.legacy.looseUpcast
> * spark.sql.legacy.ctePrecedence.enabled
> * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2019-11-16 Thread koert kuipers (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers updated SPARK-29906:
--
Priority: Minor  (was: Major)

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Priority: Minor
>  Labels: correctness
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indicator, Third_Party_Equals_Covered_Recipient_Indicator, 
> Contextual_Information, 

[jira] [Commented] (SPARK-29931) Declare all SQL legacy configs as will be removed in Spark 4.0

2019-11-16 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975813#comment-16975813
 ] 

Maxim Gekk commented on SPARK-29931:


[~rxin] [~lixiao] [~srowen] [~dongjoon] [~cloud_fan] [~hyukjin.kwon] Does this 
make sense for you?

> Declare all SQL legacy configs as will be removed in Spark 4.0
> --
>
> Key: SPARK-29931
> URL: https://issues.apache.org/jira/browse/SPARK-29931
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add the sentence to descriptions of all legacy SQL configs existed before 
> Spark 3.0: "This config will be removed in Spark 4.0.". Here is the list of 
> such configs:
> * spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName
> * spark.sql.legacy.literal.pickMinimumPrecision
> * spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
> * spark.sql.legacy.sizeOfNull
> * spark.sql.legacy.replaceDatabricksSparkAvro.enabled
> * spark.sql.legacy.setopsPrecedence.enabled
> * spark.sql.legacy.integralDivide.returnBigint
> * spark.sql.legacy.bucketedTableScan.outputOrdering
> * spark.sql.legacy.parser.havingWithoutGroupByAsWhere
> * spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue
> * spark.sql.legacy.setCommandRejectsSparkCoreConfs
> * spark.sql.legacy.utcTimestampFunc.enabled
> * spark.sql.legacy.typeCoercion.datetimeToString
> * spark.sql.legacy.looseUpcast
> * spark.sql.legacy.ctePrecedence.enabled
> * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29931) Declare all SQL legacy configs as will be removed in Spark 4.0

2019-11-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29931:
--

 Summary: Declare all SQL legacy configs as will be removed in 
Spark 4.0
 Key: SPARK-29931
 URL: https://issues.apache.org/jira/browse/SPARK-29931
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Add the sentence to descriptions of all legacy SQL configs existed before Spark 
3.0: "This config will be removed in Spark 4.0.". Here is the list of such 
configs:
* spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName
* spark.sql.legacy.literal.pickMinimumPrecision
* spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
* spark.sql.legacy.sizeOfNull
* spark.sql.legacy.replaceDatabricksSparkAvro.enabled
* spark.sql.legacy.setopsPrecedence.enabled
* spark.sql.legacy.integralDivide.returnBigint
* spark.sql.legacy.bucketedTableScan.outputOrdering
* spark.sql.legacy.parser.havingWithoutGroupByAsWhere
* spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue
* spark.sql.legacy.setCommandRejectsSparkCoreConfs
* spark.sql.legacy.utcTimestampFunc.enabled
* spark.sql.legacy.typeCoercion.datetimeToString
* spark.sql.legacy.looseUpcast
* spark.sql.legacy.ctePrecedence.enabled
* spark.sql.legacy.arrayExistsFollowsThreeValuedLogic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29871) Flaky test: ImageFileFormatTest.test_read_images

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29871.
--
Resolution: Invalid

You're not really providing any info here. We aren't observing the failure in 
any test builds either. I'm not sure this is actionable.

> Flaky test: ImageFileFormatTest.test_read_images
> 
>
> Key: SPARK-29871
> URL: https://issues.apache.org/jira/browse/SPARK-29871
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Running tests...
> --
>  test_read_images (pyspark.ml.tests.test_image.ImageFileFormatTest) ... ERROR 
> (12.050s)
> ==
> ERROR [12.050s]: test_read_images 
> (pyspark.ml.tests.test_image.ImageFileFormatTest)
> --
> Traceback (most recent call last):
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/tests/test_image.py",
>  line 35, in test_read_images
>  self.assertEqual(df.count(), 4)
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/dataframe.py",
>  line 507, in count
>  return int(self._jdf.count())
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py",
>  line 1286, in __call__
>  answer, self.gateway_client, self.target_id, self.name)
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/utils.py",
>  line 98, in deco
>  return f(*a, **kw)
>  File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
>  format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling o32.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1, amp-jenkins-worker-05.amp, executor driver): 
> javax.imageio.IIOException: Unsupported Image Type
>  at 
> com.sun.imageio.plugins.jpeg.JPEGImageReader.readInternal(JPEGImageReader.java:1079)
>  at 
> com.sun.imageio.plugins.jpeg.JPEGImageReader.read(JPEGImageReader.java:1050)
>  at javax.imageio.ImageIO.read(ImageIO.java:1448)
>  at javax.imageio.ImageIO.read(ImageIO.java:1352)
>  at org.apache.spark.ml.image.ImageSchema$.decode(ImageSchema.scala:134)
>  at 
> org.apache.spark.ml.source.image.ImageFileFormat.$anonfun$buildReader$2(ImageFileFormat.scala:84)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:147)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(generated.java:33)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:63)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
>  at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
>  at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
>  at org.apache.spark.scheduler.Task.run(Task.scala:127)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>  at 
> 

[jira] [Commented] (SPARK-29830) PySpark.context.Sparkcontext.binaryfiles improved memory with buffer

2019-11-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975809#comment-16975809
 ] 

Sean R. Owen commented on SPARK-29830:
--

I don't know how you're going to get a stream from the JVM to Python easily 
though, hm.

> PySpark.context.Sparkcontext.binaryfiles improved memory with buffer
> 
>
> Key: SPARK-29830
> URL: https://issues.apache.org/jira/browse/SPARK-29830
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Jörn Franke
>Priority: Major
>
> At the moment, Pyspark reads binary files into a byte array directly. This 
> means it reads the full binary file immediately into memory, which is 1) 
> memory in-efficient 2) differs from the Scala implementation (see pyspark 
> here: 
> [https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles).
>    
> |https://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/context.html#SparkContext.binaryFiles]
> In Scala, Spark returns a PortableDataStream, which means the application 
> does not need to read the full content of the stream in memory to work on it 
> (see 
> [https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext).]
>  
> Hence, it is proposed to adapt the Pyspark implementation to return something 
> similar to a PortableDataStream in Scala (e.g. 
> [BytesIO|[https://docs.python.org/3/library/io.html#io.BytesIO].]
>  
> Reading binary files in an efficient manner is crucial for many IoT 
> applications, but potentially also other fields (e.g. disk image analysis in 
> forensics).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29903) Add documentation for recursiveFileLookup

2019-11-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975807#comment-16975807
 ] 

Sean R. Owen commented on SPARK-29903:
--

Sure, want to open a PR?

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29930) Remove SQL configs declared to be removed in Spark 3.0

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29930:
-
Priority: Minor  (was: Major)

> Remove SQL configs declared to be removed in Spark 3.0
> --
>
> Key: SPARK-29930
> URL: https://issues.apache.org/jira/browse/SPARK-29930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Need to remove the following SQL configs:
> * spark.sql.fromJsonForceNullableSchema
> * spark.sql.legacy.compareDateTimestampInTimestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29878) Improper cache strategies in GraphX

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29878.
--
Resolution: Duplicate

> Improper cache strategies in GraphX
> ---
>
> Key: SPARK-29878
> URL: https://issues.apache.org/jira/browse/SPARK-29878
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> I have run examples.graphx.SSPExample and looked through the RDD dependency 
> graphs as well as persist operations. There are some improper cache 
> strategies in GraphX. The same situations also exist when I run 
> ConnectedComponentsExample.
> 1.  vertices.cache() and newEdges.cache() are unnecessary
> In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this 
> method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), 
> and RDD vertices/newEdges are cached in apply(). But these two RDDs are not 
> directly used anymore (their children RDDs has been cached) in SSPExample, so 
> the persists can be unnecessary here. 
> However, the other examples may need these two persists, so I think they 
> cannot be simply removed. It might be hard to fix.
> {code:scala}
>   def apply[VD: ClassTag, ED: ClassTag](
>   vertices: VertexRDD[VD],
>   edges: EdgeRDD[ED]): GraphImpl[VD, ED] = {
> vertices.cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> // Convert the vertex partitions in edges to the correct type
> val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]]
>   .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD])
>   .cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> GraphImpl.fromExistingRDDs(vertices, newEdges)
>   }
> {code}
> 2. Missing persist on newEdges
> SSSPExample will invoke pregel to do execution. Pregel will ultilize 
> ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use 
> by multiple actions in Pregel. So newEdges should be persisted.
> Same as the above issue, this issue is also found in 
> ConnectedComponentsExample. It is also hard to fix, because the persist added 
> may be unnecessary for other examples.
> {code:scala}
> // Pregel.scala
> // compute the messages
> var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // 
> newEdges is created here
> val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
>   checkpointInterval, graph.vertices.sparkContext)
> messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
> var activeMessages = messages.count() // The first time use newEdges
> ...
> while (activeMessages > 0 && i < maxIterations) {
>   // Receive the messages and update the vertices.
>   prevG = g
>   g = g.joinVertices(messages)(vprog) // Generate g will depends on 
> newEdges
>   ...
>   activeMessages = messages.count() // The second action to use newEdges. 
> newEdges should be unpersisted after this instruction.
> {code}
> {code:scala}
> // ReplicatedVertexView.scala
>   def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: 
> Boolean): Unit = {
>   ...
>val newEdges = 
> edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) {
> (ePartIter, shippedVertsIter) => ePartIter.map {
>   case (pid, edgePartition) =>
> (pid, 
> edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator)))
> }
>   })
>   edges = newEdges // newEdges should be persisted
>   hasSrcId = includeSrc
>   hasDstId = includeDst
> }
>   }
> {code}
> As I don't have much knowledge about Graphx, so I don't know how to fix these 
> issues well.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28781) Unneccesary persist in PeriodicCheckpointer.update()

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28781.
--
Resolution: Not A Problem

I think the point of this class is to manage RDDs that depend on each other, to 
break lineage, etc. They all need to be persisted, so they are not recomputed 
when child RDDs are materialized. My only question here is why it needs to hold 
3 rather than 2, but, that's a different issue. 

> Unneccesary persist in PeriodicCheckpointer.update()
> 
>
> Key: SPARK-28781
> URL: https://issues.apache.org/jira/browse/SPARK-28781
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> Once the fuction _update()_ is called, the RDD _newData_ is persisted at line 
> 82. However, only when meeting the checking point condition (at line 94), the 
> persisted rdd _newData_ would be used for the second time in the API 
> _checkpoint()_ (do checkpoint at line 97). In other conditions, _newData_ 
> will only be used once and it is unnecessary to persist the rdd in that case. 
> Although the persistedQueue will be checked to avoid too many unnecessary 
> cached data, it would be better to avoid every unnecessary persist operation.
> {code:scala}
> def update(newData: T): Unit = {
> persist(newData)
> persistedQueue.enqueue(newData)
> // We try to maintain 2 Datasets in persistedQueue to support the 
> semantics of this class:
> // Users should call [[update()]] when a new Dataset has been created,
> // before the Dataset has been materialized.
> while (persistedQueue.size > 3) {
>   val dataToUnpersist = persistedQueue.dequeue()
>   unpersist(dataToUnpersist)
> }
> updateCount += 1
> // Handle checkpointing (after persisting)
> if (checkpointInterval != -1 && (updateCount % checkpointInterval) == 0
>   && sc.getCheckpointDir.nonEmpty) {
>   // Add new checkpoint before removing old checkpoints.
>   checkpoint(newData)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29930) Remove SQL configs declared to be removed in Spark 3.0

2019-11-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29930:
--

 Summary: Remove SQL configs declared to be removed in Spark 3.0
 Key: SPARK-29930
 URL: https://issues.apache.org/jira/browse/SPARK-29930
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Need to remove the following SQL configs:
* spark.sql.fromJsonForceNullableSchema
* spark.sql.legacy.compareDateTimestampInTimestamp



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29827) Wrong persist strategy in mllib.clustering.BisectingKMeans.run

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29827.
--
Resolution: Duplicate

Same general answer - it's not clear that persisting is a win here. input is 
assumed to be persisted if the user wants to spend the resources - you actually 
quote the warning there.

These are also not bugs.

> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --
>
> Key: SPARK-29827
> URL: https://issues.apache.org/jira/browse/SPARK-29827
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
>  * First, the rdd {color:#de350b}_input_{color} should be persisted, because 
> it was not only used by the action _first(),_ but also used by other __ 
> actions in the following code.
>  * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. 
> It was used in the fuction _summarize()_ more than once, which containts an 
> action on _assignments_.
>  * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ 
> persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. 
> Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its 
> child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary 
> to persist {color:#de350b}_norms_{color} anymore.
> {code:scala}
>   private[spark] def run(
>   input: RDD[Vector],
>   instr: Option[Instrumentation]): BisectingKMeansModel = {
> if (input.getStorageLevel == StorageLevel.NONE) {
>   logWarning(s"The input RDD ${input.id} is not directly cached, which 
> may hurt performance if"
> + " its parent RDDs are also not cached.")
> }
> // Needs to persist input
> val d = input.map(_.size).first() 
> logInfo(s"Feature dimension: $d.")
> val dMeasure: DistanceMeasure = 
> DistanceMeasure.decodeFromString(this.distanceMeasure)
> // Compute and cache vector norms for fast distance computation.
> val norms = input.map(v => Vectors.norm(v, 
> 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
> val vectors = input.zip(norms).map { case (x, norm) => new 
> VectorWithNorm(x, norm) }
> var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
> var activeClusters = summarize(d, assignments, dMeasure)
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29856) Conditional unnecessary persist on RDDs in ML algorithms

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29856.
--
Resolution: Duplicate

> Conditional unnecessary persist on RDDs in ML algorithms
> 
>
> Key: SPARK-29856
> URL: https://issues.apache.org/jira/browse/SPARK-29856
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
> _{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
> persisted, but it only used once. So this persist operation is unnecessary.
> {code:scala}
> val baggedInput = BaggedPoint
>   .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
> withReplacement,
> (tp: TreePoint) => tp.weight, seed = seed)
>   .persist(StorageLevel.MEMORY_AND_DISK)
>   ...
>while (nodeStack.nonEmpty) {
>   ...
>   timer.start("findBestSplits")
>   RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
> nodesForGroup,
> treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
>   timer.stop("findBestSplits")
> }
> baggedInput.unpersist()
> {code}
> However, the action on {color:#DE350B}_baggedInput_{color} is in a while 
> loop. 
> In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
> one action uses {color:#DE350B}_baggedInput_{color}.
> In most of ML applications, the loop will executes for many times, which 
> means {color:#DE350B}_baggedInput_{color} will be used in many actions. So 
> the persist is necessary now.
> That's the point why the persist operation is "conditional" unnecessary.
> Same situations exist in many other ML algorithms, e.g., RDD 
> {color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
> {color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29810) Missing persist on retaggedInput in RandomForest.run()

2019-11-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975800#comment-16975800
 ] 

Sean R. Owen commented on SPARK-29810:
--

Generally speaking, it's not necessarily true that you want to persist 
something that's used more than one time. The cost of persisting might outweigh 
the benefit.

Also generally speaking, we assume the user will persist the input if desired, 
though whether this is checked is pretty inconsistent. The retag operation here 
isn't a transformation even. Based on this I don't think persisting this is a 
win

> Missing persist on retaggedInput in RandomForest.run()
> --
>
> Key: SPARK-29810
> URL: https://issues.apache.org/jira/browse/SPARK-29810
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Dong Wang
>Priority: Major
>
> The rdd retaggedInput should be persisted in ml.tree.impl.RandomForest.run(), 
> because it will be used more than one actions.
> {code:scala}
>   def run(
>   input: RDD[LabeledPoint],
>   strategy: OldStrategy,
>   numTrees: Int,
>   featureSubsetStrategy: String,
>   seed: Long,
>   instr: Option[Instrumentation],
>   prune: Boolean = true, // exposed for testing only, real trees are 
> always pruned
>   parentUID: Option[String] = None): Array[DecisionTreeModel] = {
> val timer = new TimeTracker()
> timer.start("total")
> timer.start("init")
> val retaggedInput = input.retag(classOf[LabeledPoint]) // it needs to be 
> persisted
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29832) Unnecessary persist on instances in ml.regression.IsotonicRegression.fit

2019-11-16 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975799#comment-16975799
 ] 

Sean R. Owen commented on SPARK-29832:
--

[~spark_cachecheck] some of these may be valid, but a lot of them don't appear 
to be. Let's tackle a few before opening the 30 JIRAs you did -- that's very 
noisy. 

run() is going to do use (a transform of) this input many times in a loop. I 
don't think this analysis is accurate. You could say that it's more optimal to 
persist stuff closer to where the loop is; sometimes it's better, sometimes, 
not, depends on how big and expensive the result is.

> Unnecessary persist on instances in ml.regression.IsotonicRegression.fit
> 
>
> Key: SPARK-29832
> URL: https://issues.apache.org/jira/browse/SPARK-29832
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> Persist on instances in ml.regression.IsotonicRegression.fit() is 
> unnecessary, because it is only used once in run(instances).
> {code:scala}
>   override def fit(dataset: Dataset[_]): IsotonicRegressionModel = 
> instrumented { instr =>
> transformSchema(dataset.schema, logging = true)
> // Extract columns from data.  If dataset is persisted, do not persist 
> oldDataset.
> val instances = extractWeightedLabeledPoints(dataset)
> val handlePersistence = dataset.storageLevel == StorageLevel.NONE
> // Unnecessary persist
> if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
> instr.logPipelineStage(this)
> instr.logDataset(dataset)
> instr.logParams(this, labelCol, featuresCol, weightCol, predictionCol, 
> featureIndex, isotonic)
> instr.logNumFeatures(1)
> val isotonicRegression = new 
> MLlibIsotonicRegression().setIsotonic($(isotonic))
> val oldModel = isotonicRegression.run(instances) // Only use once here
> if (handlePersistence) instances.unpersist()
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29760) Document VALUES statement in SQL Reference.

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29760.
--
Resolution: Won't Fix

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29765) Monitoring UI throws IndexOutOfBoundsException when accessing metrics of attempt in stage

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29765.
--
Resolution: Not A Problem

> Monitoring UI throws IndexOutOfBoundsException when accessing metrics of 
> attempt in stage
> -
>
> Key: SPARK-29765
> URL: https://issues.apache.org/jira/browse/SPARK-29765
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: Amazon EMR 5.27
>Reporter: Viacheslav Tradunsky
>Priority: Major
>
> When clicking on one of the largest tasks by input, I get to 
> [http://:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0|http://10.207.110.207:20888/proxy/application_1572992299050_0001/stages/stage/?id=74=0]
>  with 500 error
> {code:java}
> java.lang.IndexOutOfBoundsException: 95745 at 
> scala.collection.immutable.Vector.checkRangeConvert(Vector.scala:132) at 
> scala.collection.immutable.Vector.apply(Vector.scala:122) at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply$mcDJ$sp(AppStatusStore.scala:255)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> org.apache.spark.status.AppStatusStore$$anonfun$scanTasks$1$1.apply(AppStatusStore.scala:254)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofLong.foreach(ArrayOps.scala:246) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofLong.map(ArrayOps.scala:246) at 
> org.apache.spark.status.AppStatusStore.scanTasks$1(AppStatusStore.scala:254) 
> at 
> org.apache.spark.status.AppStatusStore.taskSummary(AppStatusStore.scala:287) 
> at org.apache.spark.ui.jobs.StagePage.render(StagePage.scala:321) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.WebUI$$anonfun$2.apply(WebUI.scala:84) at 
> org.apache.spark.ui.JettyUtils$$anon$3.doGet(JettyUtils.scala:90) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) 
> at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.doFilter(AmIpFilter.java:166)
>  at 
> org.spark_project.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>  at 
> org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>  at 
> org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>  at 
> org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>  at 
> org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
>  at 
> org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
>  at 
> org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>  at org.spark_project.jetty.server.Server.handle(Server.java:539) at 
> org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333) at 
> org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>  at 
> org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>  at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108) 
> at 
> org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>  at 
> org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>  at 
> org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>  at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Created] (SPARK-29929) Allow V2 Datasources to require a data distribution

2019-11-16 Thread Andrew K Long (Jira)
Andrew K Long created SPARK-29929:
-

 Summary: Allow V2 Datasources to require a data distribution
 Key: SPARK-29929
 URL: https://issues.apache.org/jira/browse/SPARK-29929
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Andrew K Long


Currently users are unable to specify that their v2 Datasource requires a 
particular Distribution before inserting data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29476.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 26386
[https://github.com/apache/spark/pull/26386]

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: pavithra ramachandran
>Priority: Trivial
> Fix For: 3.1.0
>
>
> I think it is better to have some tool tips in the  Executors tab specially 
> for Thread dump link[for most of the other columns tool tip is already added] 
> to explain more information what it meant like *thread dump for executors and 
> drivers*.
> And  after clicking on thread dump link ,the next page contains the *search* 
> and *thread table detail*s.
> In this page also add *tool tip* for *Search,*better to mention what and all 
> it will search like it will search the content from table including stack 
> trace details.And *tool tip* for *thread table column heading detail*s for 
> better understanding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29476:
-
Fix Version/s: (was: 3.1.0)
   3.0.0

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: pavithra ramachandran
>Priority: Trivial
> Fix For: 3.0.0
>
>
> I think it is better to have some tool tips in the  Executors tab specially 
> for Thread dump link[for most of the other columns tool tip is already added] 
> to explain more information what it meant like *thread dump for executors and 
> drivers*.
> And  after clicking on thread dump link ,the next page contains the *search* 
> and *thread table detail*s.
> In this page also add *tool tip* for *Search,*better to mention what and all 
> it will search like it will search the content from table including stack 
> trace details.And *tool tip* for *thread table column heading detail*s for 
> better understanding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29476) Add tooltip information for Thread Dump links and Thread details table columns in Executors Tab

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29476:


Assignee: pavithra ramachandran

> Add tooltip information for Thread Dump links and Thread details table 
> columns in Executors Tab
> ---
>
> Key: SPARK-29476
> URL: https://issues.apache.org/jira/browse/SPARK-29476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: pavithra ramachandran
>Priority: Trivial
>
> I think it is better to have some tool tips in the  Executors tab specially 
> for Thread dump link[for most of the other columns tool tip is already added] 
> to explain more information what it meant like *thread dump for executors and 
> drivers*.
> And  after clicking on thread dump link ,the next page contains the *search* 
> and *thread table detail*s.
> In this page also add *tool tip* for *Search,*better to mention what and all 
> it will search like it will search the content from table including stack 
> trace details.And *tool tip* for *thread table column heading detail*s for 
> better understanding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22236) CSV I/O: does not respect RFC 4180

2019-11-16 Thread Santhosh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975776#comment-16975776
 ] 

Santhosh commented on SPARK-22236:
--

The code mentioned above 
spark.read.option('escape', '"').csv('testfile.csv').collect()


Is erroring out with "RuntimeException: escape cannot be more than one 
character"

Not sure why " (double quote) is considered as more than one character. Please 
help/suggest/correct.

(My Spark version 2.4.3)

> CSV I/O: does not respect RFC 4180
> --
>
> Key: SPARK-22236
> URL: https://issues.apache.org/jira/browse/SPARK-22236
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Ondrej Kokes
>Priority: Minor
>
> When reading or writing CSV files with Spark, double quotes are escaped with 
> a backslash by default. However, the appropriate behaviour as set out by RFC 
> 4180 (and adhered to by many software packages) is to escape using a second 
> double quote.
> This piece of Python code demonstrates the issue
> {code}
> import csv
> with open('testfile.csv', 'w') as f:
> cw = csv.writer(f)
> cw.writerow(['a 2.5" drive', 'another column'])
> cw.writerow(['a "quoted" string', '"quoted"'])
> cw.writerow([1,2])
> with open('testfile.csv') as f:
> print(f.read())
> # "a 2.5"" drive",another column
> # "a ""quoted"" string","""quoted"""
> # 1,2
> spark.read.csv('testfile.csv').collect()
> # [Row(_c0='"a 2.5"" drive"', _c1='another column'),
> #  Row(_c0='"a ""quoted"" string"', _c1='"""quoted"""'),
> #  Row(_c0='1', _c1='2')]
> # explicitly stating the escape character fixed the issue
> spark.read.option('escape', '"').csv('testfile.csv').collect()
> # [Row(_c0='a 2.5" drive', _c1='another column'),
> #  Row(_c0='a "quoted" string', _c1='"quoted"'),
> #  Row(_c0='1', _c1='2')]
> {code}
> The same applies to writes, where reading the file written by Spark may 
> result in garbage.
> {code}
> df = spark.read.option('escape', '"').csv('testfile.csv') # reading the file 
> correctly
> df.write.format("csv").save('testout.csv')
> with open('testout.csv/part-csv') as f:
> cr = csv.reader(f)
> print(next(cr))
> print(next(cr))
> # ['a 2.5\\ drive"', 'another column']
> # ['a \\quoted\\" string"', '\\quoted\\""']
> {code}
> The culprit is in 
> [CSVOptions.scala|https://github.com/apache/spark/blob/7d0a3ef4ced9684457ad6c5924c58b95249419e1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L91],
>  where the default escape character is overridden.
> While it's possible to work with CSV files in a "compatible" manner, it would 
> be useful if Spark had sensible defaults that conform to the above-mentioned 
> RFC (as well as W3C recommendations). I realise this would be a breaking 
> change and thus if accepted, it would probably need to result in a warning 
> first, before moving to a new default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29928) Check parsing timestamps up to microsecond precision by JSON/CSV datasource

2019-11-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29928:
--

 Summary: Check parsing timestamps up to microsecond precision by 
JSON/CSV datasource
 Key: SPARK-29928
 URL: https://issues.apache.org/jira/browse/SPARK-29928
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Port tests added for 2.4 by the commit: 
https://github.com/apache/spark/commit/9c7e8be1dca8285296f3052c41f35043699d7d10



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29818) Missing persist on RDD

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29818.
--
Fix Version/s: (was: 3.0.0)
   Resolution: Not A Problem

> Missing persist on RDD
> --
>
> Key: SPARK-29818
> URL: https://issues.apache.org/jira/browse/SPARK-29818
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Aman Omer
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29890) Unable to fill na with 0 with duplicate columns

2019-11-16 Thread sandeshyapuram (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975733#comment-16975733
 ] 

sandeshyapuram commented on SPARK-29890:


[~imback82] This happens even for a normal join:
{noformat}
val p1 = spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", 
"abc")
val p2 = spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", 
"abc")
p1.join(p2, Seq("nums"), "left")
.na.fill(0).show
{noformat}

> Unable to fill na with 0 with duplicate columns
> ---
>
> Key: SPARK-29890
> URL: https://issues.apache.org/jira/browse/SPARK-29890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.3, 2.4.3
>Reporter: sandeshyapuram
>Priority: Major
>
> Trying to fill out na values with 0.
> {noformat}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val parent = 
> spark.sparkContext.parallelize(Seq((1,2),(3,4),(5,6))).toDF("nums", "abc")
> val c1 = parent.filter(lit(true))
> val c2 = parent.filter(lit(true))
> c1.join(c2, Seq("nums"), "left")
> .na.fill(0).show{noformat}
> {noformat}
> 9/11/14 04:24:24 ERROR org.apache.hadoop.security.JniBasedUnixGroupsMapping: 
> error looking up the name of group 820818257: No such file or directory
> org.apache.spark.sql.AnalysisException: Reference 'abc' is ambiguous, could 
> be: abc, abc.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:117)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:220)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1246)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:443)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:500)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$7.apply(DataFrameNaFunctions.scala:492)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fillValue(DataFrameNaFunctions.scala:492)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:171)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:155)
>   at 
> org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:134)
>   ... 54 elided{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29606) Improve EliminateOuterJoin performance

2019-11-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975732#comment-16975732
 ] 

Yuming Wang commented on SPARK-29606:
-

Our production(Spark 2.3):

{noformat}
=== Metrics of Analyzer/Optimizer Rules ===
Total number of runs: 2226
Total time: 68.124998296 seconds

Rule
  Effective Time / Total Time Effective Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin  
  0 / 36896767130 0 / 13
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
  15895801138 / 15897389751   8 / 33
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions 
  0 / 10059337284 0 / 9
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions
  2305679490 / 2307334411 1 / 2
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences   
  374062283 / 435074227   12 / 33
org.apache.spark.sql.execution.datasources.DataSourceAnalysis   
  0 / 362099024   0 / 9
org.apache.spark.sql.execution.datasources.FindDataSourceTable  
  224836671 / 226671823   6 / 33
org.apache.spark.sql.catalyst.analysis.DecimalPrecision 
  93147539 / 1320314244 / 33
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions
  42449100 / 95458987 8 / 33
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings  
  52303790 / 93020492 2 / 33
org.apache.spark.sql.catalyst.analysis.ResolveTimeZone  
  74463214 / 89250928 10 / 33
org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion
  48345991 / 87081245 2 / 33
org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion  
  45851578 / 85275827 2 / 33
org.apache.spark.sql.catalyst.optimizer.ColumnPruning   
  24345686 / 76831540 1 / 15


{noformat}


> Improve EliminateOuterJoin performance
> --
>
> Key: SPARK-29606
> URL: https://issues.apache.org/jira/browse/SPARK-29606
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> spark.sql(
>   """
> |CREATE TABLE `big_table1`(`adj_type_id` tinyint, `byr_cntry_id` 
> decimal(4,0), `sap_category_id` decimal(9,0), `lstg_site_id` decimal(9,0), 
> `lstg_type_code` decimal(4,0), `offrd_slng_chnl_grp_id` smallint, 
> `slr_cntry_id` decimal(4,0), `sold_slng_chnl_grp_id` smallint, 
> `bin_lstg_yn_id` tinyint, `bin_sold_yn_id` tinyint, `lstg_curncy_id` 
> decimal(4,0), `blng_curncy_id` decimal(4,0), `bid_count` decimal(18,0), 
> `ck_trans_count` decimal(18,0), `ended_bid_count` decimal(18,0), 
> `new_lstg_count` decimal(18,0), `ended_lstg_count` decimal(18,0), 
> `ended_success_lstg_count` decimal(18,0), `item_sold_count` decimal(18,0), 
> `gmv_us_amt` decimal(18,2), `gmv_byr_lc_amt` decimal(18,2), `gmv_slr_lc_amt` 
> decimal(18,2), `gmv_lstg_curncy_amt` decimal(18,2), `gmv_us_m_amt` 
> decimal(18,2), `rvnu_insrtn_fee_us_amt` decimal(18,6), 
> `rvnu_insrtn_fee_lc_amt` decimal(18,6), `rvnu_insrtn_fee_bc_amt` 
> decimal(18,6), `rvnu_insrtn_fee_us_m_amt` decimal(18,6), 
> `rvnu_insrtn_crd_us_amt` decimal(18,6), `rvnu_insrtn_crd_lc_amt` 
> decimal(18,6), `rvnu_insrtn_crd_bc_amt` decimal(18,6), 
> `rvnu_insrtn_crd_us_m_amt` decimal(18,6), `rvnu_fetr_fee_us_amt` 
> decimal(18,6), `rvnu_fetr_fee_lc_amt` decimal(18,6), `rvnu_fetr_fee_bc_amt` 
> decimal(18,6), `rvnu_fetr_fee_us_m_amt` decimal(18,6), `rvnu_fetr_crd_us_amt` 
> decimal(18,6), `rvnu_fetr_crd_lc_amt` decimal(18,6), `rvnu_fetr_crd_bc_amt` 
> decimal(18,6), `rvnu_fetr_crd_us_m_amt` decimal(18,6), `rvnu_fv_fee_us_amt` 
> decimal(18,6), `rvnu_fv_fee_slr_lc_amt` decimal(18,6), 
> `rvnu_fv_fee_byr_lc_amt` decimal(18,6), `rvnu_fv_fee_bc_amt` decimal(18,6), 
> `rvnu_fv_fee_us_m_amt` decimal(18,6), `rvnu_fv_crd_us_amt` decimal(18,6), 
> `rvnu_fv_crd_byr_lc_amt` decimal(18,6), `rvnu_fv_crd_slr_lc_amt` 
> decimal(18,6), `rvnu_fv_crd_bc_amt` decimal(18,6), `rvnu_fv_crd_us_m_amt` 
> decimal(18,6), `rvnu_othr_l_fee_us_amt` decimal(18,6), 
> `rvnu_othr_l_fee_lc_amt` decimal(18,6), `rvnu_othr_l_fee_bc_amt` 
> decimal(18,6), `rvnu_othr_l_fee_us_m_amt` decimal(18,6), 
> `rvnu_othr_l_crd_us_amt` 

[jira] [Updated] (SPARK-29904) Parse timestamps in microsecond precision by JSON/CSV datasources

2019-11-16 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29904:
---
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3

> Parse timestamps in microsecond precision by JSON/CSV datasources
> -
>
> Key: SPARK-29904
> URL: https://issues.apache.org/jira/browse/SPARK-29904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.5
>
>
> Currently, Spark can parse strings with timestamps from JSON/CSV in 
> millisecond precision. Internally, timestamps have microsecond precision. The 
> ticket aims to modify parsing logic in Spark 2.4 to support the microsecond 
> precision. Porting of DateFormatter/TimestampFormatter from Spark 3.0-preview 
> is risky, so, need to find another lighter solution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2019-11-16 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975697#comment-16975697
 ] 

Maxim Gekk commented on SPARK-29927:


[~cloud_fan] WDYT, does it make sense to change the functions as well?

> Parse timestamps in microsecond precision by `to_timestamp`, 
> `to_unix_timestamp`, `unix_timestamp`
> --
>
> Key: SPARK-29927
> URL: https://issues.apache.org/jira/browse/SPARK-29927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` 
> functions uses SimpleDateFormat to parse strings to timestamps. 
> SimpleDateFormat is able to parse only in millisecond precision if an user 
> specified `SSS` in a pattern. The ticket aims to support parsing up to the 
> microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29927) Parse timestamps in microsecond precision by `to_timestamp`, `to_unix_timestamp`, `unix_timestamp`

2019-11-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29927:
--

 Summary: Parse timestamps in microsecond precision by 
`to_timestamp`, `to_unix_timestamp`, `unix_timestamp`
 Key: SPARK-29927
 URL: https://issues.apache.org/jira/browse/SPARK-29927
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Currently, the `to_timestamp`, `to_unix_timestamp`, `unix_timestamp` functions 
uses SimpleDateFormat to parse strings to timestamps. SimpleDateFormat is able 
to parse only in millisecond precision if an user specified `SSS` in a pattern. 
The ticket aims to support parsing up to the microsecond precision.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29923) Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+

2019-11-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29923:
-
Docs Text: 
Spark applications running on JDK 9 or later must set the system property 
{{io.netty.tryReflectionSetAccessible}} to true. 
NOTE: remove this release note if we later find a way to work around this.
   Labels: release-notes  (was: )

> Set `io.netty.tryReflectionSetAccessible` for Arrow on JDK9+
> 
>
> Key: SPARK-29923
> URL: https://issues.apache.org/jira/browse/SPARK-29923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29925) Maven Build fails with Hadoop Version 3.2.0

2019-11-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975681#comment-16975681
 ] 

Yuming Wang edited comment on SPARK-29925 at 11/16/19 11:50 AM:


You should  build with {{hadoop-3.2}} profile. 
https://github.com/apache/spark/commit/90c64ea4194ed7d5e1b315b3287f64dc661c8963#diff-e700812356511df02cda7d3ccd38ca02


was (Author: q79969786):
You should  build with {{hadoop-3.2}} profile. 
https://github.com/apache/spark/blob/f77c10de38d0563b2e42d1200a1fbbdb3018c2e9/pom.xml#L2919-L2942

> Maven Build fails with Hadoop Version 3.2.0
> ---
>
> Key: SPARK-29925
> URL: https://issues.apache.org/jira/browse/SPARK-29925
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
> Environment: The build was tested in two environments. The first was 
> Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
> OpenJDK 8 and Scala 2.12.
> The same error occurred in both environments. 
> Both environments used Linux kernel 4.19. Both environments were VirtualBox 
> VMs running on a MacBook. 
>Reporter: Douglas Colkitt
>Priority: Minor
>
> Build fails at Spark Core stage when using Maven with specified Hadoop 
> version 3.2. The build command run is:
> {code:java}
> ./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
> {code}
> The build error output is
> {code:java}
> [INFO] 
> [INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiling 262 Scala sources and 27 Java sources to 
> /usr/local/src/spark/core/target/scala-2.12/test-classes ...
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
>  object lang is not a member of package org.apache.commons
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
>  not found: value SerializationUtils
> [ERROR] two errors found{code}
> The problem does _not_ occur when building without Hadoop package 
> specification, i.e. when running:
> {code:java}
> ./build/mvn -DskipTests package
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29925) Maven Build fails with Hadoop Version 3.2.0

2019-11-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29925.
-
Resolution: Invalid

> Maven Build fails with Hadoop Version 3.2.0
> ---
>
> Key: SPARK-29925
> URL: https://issues.apache.org/jira/browse/SPARK-29925
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
> Environment: The build was tested in two environments. The first was 
> Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
> OpenJDK 8 and Scala 2.12.
> The same error occurred in both environments. 
> Both environments used Linux kernel 4.19. Both environments were VirtualBox 
> VMs running on a MacBook. 
>Reporter: Douglas Colkitt
>Priority: Minor
>
> Build fails at Spark Core stage when using Maven with specified Hadoop 
> version 3.2. The build command run is:
> {code:java}
> ./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
> {code}
> The build error output is
> {code:java}
> [INFO] 
> [INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiling 262 Scala sources and 27 Java sources to 
> /usr/local/src/spark/core/target/scala-2.12/test-classes ...
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
>  object lang is not a member of package org.apache.commons
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
>  not found: value SerializationUtils
> [ERROR] two errors found{code}
> The problem does _not_ occur when building without Hadoop package 
> specification, i.e. when running:
> {code:java}
> ./build/mvn -DskipTests package
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29925) Maven Build fails with Hadoop Version 3.2.0

2019-11-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975681#comment-16975681
 ] 

Yuming Wang commented on SPARK-29925:
-

You should  build with {{hadoop-3.2}} profile. 
https://github.com/apache/spark/blob/f77c10de38d0563b2e42d1200a1fbbdb3018c2e9/pom.xml#L2919-L2942

> Maven Build fails with Hadoop Version 3.2.0
> ---
>
> Key: SPARK-29925
> URL: https://issues.apache.org/jira/browse/SPARK-29925
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
> Environment: The build was tested in two environments. The first was 
> Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
> OpenJDK 8 and Scala 2.12.
> The same error occurred in both environments. 
> Both environments used Linux kernel 4.19. Both environments were VirtualBox 
> VMs running on a MacBook. 
>Reporter: Douglas Colkitt
>Priority: Minor
>
> Build fails at Spark Core stage when using Maven with specified Hadoop 
> version 3.2. The build command run is:
> {code:java}
> ./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
> {code}
> The build error output is
> {code:java}
> [INFO] 
> [INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiling 262 Scala sources and 27 Java sources to 
> /usr/local/src/spark/core/target/scala-2.12/test-classes ...
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
>  object lang is not a member of package org.apache.commons
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
>  not found: value SerializationUtils
> [ERROR] two errors found{code}
> The problem does _not_ occur when building without Hadoop package 
> specification, i.e. when running:
> {code:java}
> ./build/mvn -DskipTests package
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29926) interval `1. second` should be invalid as PostgreSQL

2019-11-16 Thread Kent Yao (Jira)
Kent Yao created SPARK-29926:


 Summary: interval `1. second` should be invalid as PostgreSQL
 Key: SPARK-29926
 URL: https://issues.apache.org/jira/browse/SPARK-29926
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Spark 

{code:sql}
-- !query 134
select interval '1. second'
-- !query 134 schema
struct<1 seconds:interval>
-- !query 134 output
1 seconds


-- !query 135
select cast('1. second' as interval)
-- !query 135 schema
struct
-- !query 135 output
1 seconds
{code}

PostgreSQL

{code:sql}
ostgres=# select interval '1. seconds';
ERROR:  invalid input syntax for type interval: "1. seconds"
LINE 1: select interval '1. seconds';
{code}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29926) interval `1. second` should be invalid as PostgreSQL

2019-11-16 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975657#comment-16975657
 ] 

Kent Yao commented on SPARK-29926:
--

working on this

> interval `1. second` should be invalid as PostgreSQL
> 
>
> Key: SPARK-29926
> URL: https://issues.apache.org/jira/browse/SPARK-29926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Minor
>
> Spark 
> {code:sql}
> -- !query 134
> select interval '1. second'
> -- !query 134 schema
> struct<1 seconds:interval>
> -- !query 134 output
> 1 seconds
> -- !query 135
> select cast('1. second' as interval)
> -- !query 135 schema
> struct
> -- !query 135 output
> 1 seconds
> {code}
> PostgreSQL
> {code:sql}
> ostgres=# select interval '1. seconds';
> ERROR:  invalid input syntax for type interval: "1. seconds"
> LINE 1: select interval '1. seconds';
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29807) Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"

2019-11-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29807:
---

Assignee: Yuanjian Li

> Rename "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"
> -
>
> Key: SPARK-29807
> URL: https://issues.apache.org/jira/browse/SPARK-29807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> The relation between "spark.sql.ansi.enabled" and "spark.sql.dialect" is 
> confusing, since the "PostgreSQL" dialect should contain the features of 
> "spark.sql.ansi.enabled".
> To make things clearer, we can rename the "spark.sql.ansi.enabled" to 
> "spark.sql.dialect.spark.ansi.enabled", thus the option 
> "spark.sql.dialect.spark.ansi.enabled" is only for Spark dialect.
> For the casting and arithmetic operations, runtime exceptions should be 
> thrown if "spark.sql.dialect" is "spark" and 
> "spark.sql.dialect.spark.ansi.enabled" is true or "spark.sql.dialect" is 
> PostgresSQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29925) Maven Build fails with Hadoop Version 3.2.0

2019-11-16 Thread Douglas Colkitt (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Douglas Colkitt updated SPARK-29925:

Description: 
Build fails at Spark Core stage when using Maven with specified Hadoop version 
3.2. The build command run is:
{code:java}
./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
{code}
The build error output is
{code:java}
[INFO] 
[INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiling 262 Scala sources and 27 Java sources to 
/usr/local/src/spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
 object lang is not a member of package org.apache.commons
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
 not found: value SerializationUtils
[ERROR] two errors found{code}
The problem does _not_ occur when building without Hadoop package 
specification, i.e. when running:
{code:java}
./build/mvn -DskipTests package
{code}
 

  was:
Build fails at Spark Core stage when using Maven with specified Hadoop Cloud 
package. The build command run is:
{code:java}
./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
{code}
The build error output is
{code:java}
[INFO] 
[INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiling 262 Scala sources and 27 Java sources to 
/usr/local/src/spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
 object lang is not a member of package org.apache.commons
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
 not found: value SerializationUtils
[ERROR] two errors found{code}
The problem does _not_ occur when building without Hadoop package 
specification, i.e. when running:
{code:java}
./build/mvn -DskipTests package
{code}
 


> Maven Build fails with Hadoop Version 3.2.0
> ---
>
> Key: SPARK-29925
> URL: https://issues.apache.org/jira/browse/SPARK-29925
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
> Environment: The build was tested in two environments. The first was 
> Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
> OpenJDK 8 and Scala 2.12.
> The same error occurred in both environments. 
> Both environments used Linux kernel 4.19. Both environments were VirtualBox 
> VMs running on a MacBook. 
>Reporter: Douglas Colkitt
>Priority: Minor
>
> Build fails at Spark Core stage when using Maven with specified Hadoop 
> version 3.2. The build command run is:
> {code:java}
> ./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
> {code}
> The build error output is
> {code:java}
> [INFO] 
> [INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiling 262 Scala sources and 27 Java sources to 
> /usr/local/src/spark/core/target/scala-2.12/test-classes ...
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
>  object lang is not a member of package org.apache.commons
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
>  not found: value SerializationUtils
> [ERROR] two errors found{code}
> The problem does _not_ occur when building without Hadoop package 
> specification, i.e. when running:
> {code:java}
> ./build/mvn -DskipTests package
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29925) Maven Build fails with Hadoop Version 3.2.0

2019-11-16 Thread Douglas Colkitt (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Douglas Colkitt updated SPARK-29925:

Description: 
Build fails at Spark Core stage when using Maven with specified Hadoop Cloud 
package. The build command run is:
{code:java}
./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
{code}
The build error output is
{code:java}
[INFO] 
[INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiling 262 Scala sources and 27 Java sources to 
/usr/local/src/spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
 object lang is not a member of package org.apache.commons
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
 not found: value SerializationUtils
[ERROR] two errors found{code}
The problem does _not_ occur when building without Hadoop package 
specification, i.e. when running:
{code:java}
./build/mvn -DskipTests package
{code}
 

  was:
Build fails at Spark Core stage when using Maven with specified Hadoop Cloud 
package. The build command run is:
{code:java}
./build/mvn -DskipTests -Phadoop-cloud -Dhadoop.version=3.2.0 package
{code}
The build error output is
{code:java}
[INFO] 
[INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiling 262 Scala sources and 27 Java sources to 
/usr/local/src/spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
 object lang is not a member of package org.apache.commons
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
 not found: value SerializationUtils
[ERROR] two errors found{code}
The problem does _not_ occur when building without Hadoop package 
specification, i.e. when running:
{code:java}
./build/mvn -DskipTests package
{code}
 

Summary: Maven Build fails with Hadoop Version 3.2.0  (was: Maven Build 
fails with flag: -Phadoop-cloud )

> Maven Build fails with Hadoop Version 3.2.0
> ---
>
> Key: SPARK-29925
> URL: https://issues.apache.org/jira/browse/SPARK-29925
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.1.0
> Environment: The build was tested in two environments. The first was 
> Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
> OpenJDK 8 and Scala 2.12.
> The same error occurred in both environments. 
> Both environments used Linux kernel 4.19. Both environments were VirtualBox 
> VMs running on a MacBook. 
>Reporter: Douglas Colkitt
>Priority: Minor
>
> Build fails at Spark Core stage when using Maven with specified Hadoop Cloud 
> package. The build command run is:
> {code:java}
> ./build/mvn -DskipTests -Dhadoop.version=3.2.0 package
> {code}
> The build error output is
> {code:java}
> [INFO] 
> [INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
> spark-core_2.12 ---
> [INFO] Using incremental compilation using Mixed compile order
> [INFO] Compiling 262 Scala sources and 27 Java sources to 
> /usr/local/src/spark/core/target/scala-2.12/test-classes ...
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
>  object lang is not a member of package org.apache.commons
> [ERROR] [Error] 
> /usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
>  not found: value SerializationUtils
> [ERROR] two errors found{code}
> The problem does _not_ occur when building without Hadoop package 
> specification, i.e. when running:
> {code:java}
> ./build/mvn -DskipTests package
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29925) Maven Build fails with flag: -Phadoop-cloud

2019-11-16 Thread Douglas Colkitt (Jira)
Douglas Colkitt created SPARK-29925:
---

 Summary: Maven Build fails with flag: -Phadoop-cloud 
 Key: SPARK-29925
 URL: https://issues.apache.org/jira/browse/SPARK-29925
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1.0
 Environment: The build was tested in two environments. The first was 
Debian 10 running OpenJDK 11 with Scala 2.12. The second was Debian 9.1 with 
OpenJDK 8 and Scala 2.12.

The same error occurred in both environments. 

Both environments used Linux kernel 4.19. Both environments were VirtualBox VMs 
running on a MacBook. 
Reporter: Douglas Colkitt


Build fails at Spark Core stage when using Maven with specified Hadoop Cloud 
package. The build command run is:
{code:java}
./build/mvn -DskipTests -Phadoop-cloud -Dhadoop.version=3.2.0 package
{code}
The build error output is
{code:java}
[INFO] 
[INFO] --- scala-maven-plugin:4.2.0:testCompile (scala-test-compile-first) @ 
spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiling 262 Scala sources and 27 Java sources to 
/usr/local/src/spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23:
 object lang is not a member of package org.apache.commons
[ERROR] [Error] 
/usr/local/src/spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49:
 not found: value SerializationUtils
[ERROR] two errors found{code}
The problem does _not_ occur when building without Hadoop package 
specification, i.e. when running:
{code:java}
./build/mvn -DskipTests package
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org