[jira] [Closed] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-16 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao closed SPARK-25391.


> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102
 ] 

Kazuaki Ishizaki commented on SPARK-25437:
--

Is such a feature for major release, not for maintenance release?

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-09-16 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25444:


 Summary: Refactor GenArrayData.genCodeToCreateArrayData() method
 Key: SPARK-25444
 URL: https://issues.apache.org/jira/browse/SPARK-25444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Kazuaki Ishizaki


{{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
{{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24768) Have a built-in AVRO data source implementation

2018-09-16 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24768:
---

Assignee: Gengliang Wang

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24768) Have a built-in AVRO data source implementation

2018-09-16 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24768.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Have a built-in AVRO data source implementation
> ---
>
> Key: SPARK-24768
> URL: https://issues.apache.org/jira/browse/SPARK-24768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: Built-in AVRO Data Source In Spark 2.4.pdf
>
>
> Apache Avro (https://avro.apache.org) is a popular data serialization format. 
> It is widely used in the Spark and Hadoop ecosystem, especially for 
> Kafka-based data pipelines.  Using the external package 
> [https://github.com/databricks/spark-avro], Spark SQL can read and write the 
> avro data. Making spark-Avro built-in can provide a better experience for 
> first-time users of Spark SQL and structured streaming. We expect the 
> built-in Avro data source can further improve the adoption of structured 
> streaming. The proposal is to inline code from spark-avro package 
> ([https://github.com/databricks/spark-avro]). The target release is Spark 
> 2.4.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-09-16 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617090#comment-16617090
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc] Thanks! It is very helpful.

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25443) fix issues when building docs with release scripts in docker

2018-09-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25443:
---

 Summary: fix issues when building docs with release scripts in 
docker
 Key: SPARK-25443
 URL: https://issues.apache.org/jira/browse/SPARK-25443
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-16 Thread Goun Na (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617066#comment-16617066
 ] 

Goun Na commented on SPARK-25430:
-

Thanks [~hyukjin.kwon]. I will remember your guide.

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")

2018-09-16 Thread WEI PENG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WEI PENG closed SPARK-25435.


> df = sqlContext.read.json("examples/src/main/resources/people.json")
> 
>
> Key: SPARK-25435
> URL: https://issues.apache.org/jira/browse/SPARK-25435
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution
> .scala:55)
>  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe
> cution.scala:47)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>  at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio
> n.scala:431)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.createBaseDataset(JsonDataSource.scala:110)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.infer(JsonDataSource.scala:95)
>  at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS
> chema(JsonDataSource.scala:63)
>  at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS
> chema(JsonFileFormat.scala:57)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF
> ormatSchema(DataSource.scala:201)
>  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation
> (DataSource.scala:392)
>  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca
> la:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loade
> r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see 
> th
> e next exception for details.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc
> rossDRDA(Unknown Source)
>  ... 123 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> da
> tabase C:\Users\WEI\metastore_db.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL
> ockOnDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown
> Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO
> nDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown
>  Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
> Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn
> own Source)
>  at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
> Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow
> n Source)
>  at 

[jira] [Resolved] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")

2018-09-16 Thread WEI PENG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WEI PENG resolved SPARK-25435.
--
Resolution: Incomplete

> df = sqlContext.read.json("examples/src/main/resources/people.json")
> 
>
> Key: SPARK-25435
> URL: https://issues.apache.org/jira/browse/SPARK-25435
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution
> .scala:55)
>  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe
> cution.scala:47)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>  at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio
> n.scala:431)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.createBaseDataset(JsonDataSource.scala:110)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.infer(JsonDataSource.scala:95)
>  at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS
> chema(JsonDataSource.scala:63)
>  at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS
> chema(JsonFileFormat.scala:57)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF
> ormatSchema(DataSource.scala:201)
>  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation
> (DataSource.scala:392)
>  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca
> la:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loade
> r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see 
> th
> e next exception for details.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc
> rossDRDA(Unknown Source)
>  ... 123 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> da
> tabase C:\Users\WEI\metastore_db.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL
> ockOnDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown
> Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO
> nDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown
>  Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
> Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn
> own Source)
>  at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
> Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow
> n Source)
>  at 

[jira] [Closed] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path

2018-09-16 Thread WEI PENG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WEI PENG closed SPARK-25434.


Problem resolved

> failed to locate the winutils binary in the hadoop binary path
> --
>
> Key: SPARK-25434
> URL: https://issues.apache.org/jira/browse/SPARK-25434
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> C:\Users\WEI>pyspark
> Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC 
> v.
> 1900 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
> 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in 
> th
> e hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Ha
> doop binaries.
>  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
>  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
>  at org.apache.hadoop.util.Shell.(Shell.java:387)
>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:80)
>  at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur
> ityUtil.java:611)
>  at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI
> nformation.java:273)
>  at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use
> rGroupInformation.java:261)
>  at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:791)
>  at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou
> pInformation.java:761)
>  at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr
> oupInformation.java:634)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils
> .scala:2467)
>  at scala.Option.getOrElse(Option.scala:121)
>  at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467)
>  at org.apache.spark.SecurityManager.(SecurityManager.scala:220)
>  at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit.
> scala:408)
>  at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub
> mit$$secMgr$1(SparkSubmit.scala:408)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme
> nt$7.apply(SparkSubmit.scala:416)
>  at scala.Option.map(Option.scala:146)
>  at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark
> Submit.scala:415)
>  at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu
> bmit.scala:250)
>  at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop 
> lib
> rary for your platform... using builtin-java classes where applicable
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLeve
> l(newLevel).
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /__ / .__/\_,_/_/ /_/\_\ version 2.3.1
>  /_/
> Using Python version 3.5.6 (default, Aug 26 2018 16:05:27)
> SparkSession available as 'spark'.
> >>>
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster

2018-09-16 Thread Suryanarayana Garlapati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617018#comment-16617018
 ] 

Suryanarayana Garlapati commented on SPARK-25442:
-

There was an earlier PR(https://github.com/apache/spark/pull/20272) which had 
the half fix. PR [https://github.com/apache/spark/pull/22433] makes to have 
full solution for the same.

> Support STS to run in K8S deployment with spark deployment mode as cluster
> --
>
> Key: SPARK-25442
> URL: https://issues.apache.org/jira/browse/SPARK-25442
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, SQL
>Affects Versions: 2.4.0, 2.5.0
>Reporter: Suryanarayana Garlapati
>Priority: Major
>
> STS fails to start in kubernetes deployments with spark deploy mode as 
> cluster.  Support should be added to make it run in K8S deployments.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints

2018-09-16 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617016#comment-16617016
 ] 

Wenchen Fan commented on SPARK-23200:
-

We should definitely merge it to branch 2.4, but I won't block the release 
since it's not that critical and it's still in progress. After it's merged, 
feel free to vote -1 on the RC voting email to include this change, if 
necessary.

> Reset configuration when restarting from checkpoints
> 
>
> Key: SPARK-23200
> URL: https://issues.apache.org/jira/browse/SPARK-23200
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> Streaming workloads and restarting from checkpoints may need additional 
> changes, i.e. resetting properties -  see 
> https://github.com/apache-spark-on-k8s/spark/pull/516



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616994#comment-16616994
 ] 

Hyukjin Kwon commented on SPARK-25437:
--

Please feel the JIRA description.

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616993#comment-16616993
 ] 

Hyukjin Kwon commented on SPARK-25435:
--

Can you provide reproduible steps to verify this issue rather then just posting 
an error message?

> df = sqlContext.read.json("examples/src/main/resources/people.json")
> 
>
> Key: SPARK-25435
> URL: https://issues.apache.org/jira/browse/SPARK-25435
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.1
>Reporter: WEI PENG
>Priority: Major
>
> at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution
> .scala:55)
>  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe
> cution.scala:47)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>  at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio
> n.scala:431)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.createBaseDataset(JsonDataSource.scala:110)
>  at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour
> ce$.infer(JsonDataSource.scala:95)
>  at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS
> chema(JsonDataSource.scala:63)
>  at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS
> chema(JsonFileFormat.scala:57)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl
> y(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF
> ormatSchema(DataSource.scala:201)
>  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation
> (DataSource.scala:392)
>  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca
> la:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
> java:62)
>  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
> sorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>  at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>  at py4j.Gateway.invoke(Gateway.java:282)
>  at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>  at py4j.commands.CallCommand.execute(CallCommand.java:79)
>  at py4j.GatewayConnection.run(GatewayConnection.java:238)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loade
> r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see 
> th
> e next exception for details.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc
> rossDRDA(Unknown Source)
>  ... 123 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> da
> tabase C:\Users\WEI\metastore_db.
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.iapi.error.StandardException.newException(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL
> ockOnDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown
> Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO
> nDB(Unknown Source)
>  at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown
>  Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown
> Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow
> n Source)
>  at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn
> own Source)
>  at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So
> urce)
>  at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
>  at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc
> e)
>  at 

[jira] [Comment Edited] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616994#comment-16616994
 ] 

Hyukjin Kwon edited comment on SPARK-25437 at 9/17/18 2:13 AM:
---

Please fill the JIRA description.


was (Author: hyukjin.kwon):
Please feel the JIRA description.

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616992#comment-16616992
 ] 

Hyukjin Kwon commented on SPARK-25433:
--

What's advantages of adding this and what are the alternatives? JIRA should 
describe what do fix not how to fix.

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> This has been partly discussed in SPARK-13587
> I would like to provision the executors with a PEX package. I created a PR 
> with minimal necessary changes in PythonWorkerFactory.
> PR: [https://github.com/apache/spark/pull/22422/files]
> To run it one needs to set PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON variables 
> to the pex file and upload the pex file to the executors via 
> sparkContext.addFile or by setting the spark config 
> spark.yarn.dist.files/spark.file properties
> Also it is necessary to set the PEX_ROOT environment variable. By default 
> inside the executors it tries to access /home/.pex and this fails.
> Ideally, as this configuration is quite cumbersome, it would be interesting 
> to also add a parameter --pexFile to SparkContext and spark-submit in order 
> to directly provide a pexFile and then everything else is handled. Please 
> tell me what you think of this.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616984#comment-16616984
 ] 

Hyukjin Kwon commented on SPARK-25424:
--

Please avoid to set a target version which is usually reserved for committers. 
See https://spark.apache.org/contributing.html to open a PR which will 
automatically assign appropriate one.

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> 

[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616985#comment-16616985
 ] 

Hyukjin Kwon commented on SPARK-25429:
--

PR https://github.com/apache/spark/pull/22420

> SparkListenerBus inefficient due to 
> 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
> 
>
> Key: SPARK-25429
> URL: https://issues.apache.org/jira/browse/SPARK-25429
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Priority: Major
>
> {code:java}
> private def updateStageMetrics(
>   stageId: Int,
>   attemptId: Int,
>   taskId: Long,
>   accumUpdates: Seq[AccumulableInfo],
>   succeeded: Boolean): Unit = {
> Option(stageMetrics.get(stageId)).foreach { metrics =>
>   if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) {
> return
>   }
>   val oldTaskMetrics = metrics.taskMetrics.get(taskId)
>   if (oldTaskMetrics != null && oldTaskMetrics.succeeded) {
> return
>   }
>   val updates = accumUpdates
> .filter { acc => acc.update.isDefined && 
> metrics.accumulatorIds.contains(acc.id) }
> .sortBy(_.id)
>   if (updates.isEmpty) {
> return
>   }
>   val ids = new Array[Long](updates.size)
>   val values = new Array[Long](updates.size)
>   updates.zipWithIndex.foreach { case (acc, idx) =>
> ids(idx) = acc.id
> // In a live application, accumulators have Long values, but when 
> reading from event
> // logs, they have String values. For now, assume all accumulators 
> are Long and covert
> // accordingly.
> values(idx) = acc.update.get match {
>   case s: String => s.toLong
>   case l: Long => l
>   case o => throw new IllegalArgumentException(s"Unexpected: $o")
> }
>   }
>   // TODO: storing metrics by task ID can cause metrics for the same task 
> index to be
>   // counted multiple times, for example due to speculation or 
> re-attempts.
>   metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, 
> succeeded))
> }
>   }
> {code}
> 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated 
> many accumulator, it's inefficient use Arrray#contains.
> Actually, application may timeout while quit and will killed by RM on YARN 
> mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast

2018-09-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25424:
-
Target Version/s:   (was: 2.4.0)

> Window duration and slide duration with negative values should fail fast
> 
>
> Key: SPARK-25424
> URL: https://issues.apache.org/jira/browse/SPARK-25424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Raghav Kumar Gautam
>Priority: Major
> Fix For: 2.4.0
>
>
> In TimeWindow class window duration and slide duration should not be allowed 
> to take negative values.
> Currently this behaviour enforced by catalyst. It can be enforced by 
> constructor of TimeWindow allowing it to fail fast.
> For e.g. the code below throws following error. Note that the error is 
> produced at the time of count() call instead of window() call.
> {code:java}
> val df = spark.readStream
>   .format("rate")
>   .option("numPartitions", "2")
>   .option("rowsPerSecond", "10")
>   .load()
>   .filter("value % 20 == 0")
>   .withWatermark("timestamp", "10 seconds")
>   .groupBy(window($"timestamp", "-10 seconds", "5 seconds"))
>   .count()
> {code}
> Error:
> {code:java}
> cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data 
> type mismatch: The window duration (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
> org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, 
> -1000, 500, 0)' due to data type mismatch: The window duration 
> (-1000) must be greater than 0.;;
> 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], 
> [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, 
> count(1) AS count#57L]
> +- AnalysisBarrier
>   +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds
>  +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint))
> +- StreamingRelationV2 
> org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, 
> Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], 
> StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond
>  -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>   at 
> 

[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616981#comment-16616981
 ] 

Hyukjin Kwon commented on SPARK-25430:
--

Please avoid to set the target version which is usually reserved for committers.

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25430) Add map parameter for withColumnRenamed

2018-09-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25430:
-
Target Version/s:   (was: 2.4.0)

> Add map parameter for withColumnRenamed
> ---
>
> Key: SPARK-25430
> URL: https://issues.apache.org/jira/browse/SPARK-25430
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Goun Na
>Priority: Major
>
> WithColumnRenamed method should work with map parameter. It removes code 
> redundancy.
> {code:java}
> // example
> df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" 
> )){code}
> {code:java}
> // from abbr columns to desc columns
> val m = Map( "c1" -> "first_column", "c2" -> "second_column" )
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}
> It is useful for CJK users when they are working on analysis in notebook 
> environment such as Zeppelin, Databricks, Apache Toree. 
> {code:java}
> // for CJK users once define dictionary into map, reuse column map to 
> translate columns whenever report visualization is required
> val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") 
> df1.withColumnRenamed(m) 
> df2.withColumnRenamed(m)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25420) Dataset.count() every time is different.

2018-09-16 Thread huanghuai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huanghuai resolved SPARK-25420.
---
Resolution: Fixed

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.

2018-09-16 Thread huanghuai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616978#comment-16616978
 ] 

huanghuai commented on SPARK-25420:
---

Because every time I use dropDuplicates() ,it's all in disorder, and every itme 
I get different first row after group by. That's why there are one or two 
record missing every time.

 

But no matter how I try to use sort function before or after dropDuplicates(), 
It is no use.

Finally , I use dataset.cache() after dropDuplicates().

> Dataset.count()  every time is different.
> -
>
> Key: SPARK-25420
> URL: https://issues.apache.org/jira/browse/SPARK-25420
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: spark2.3
> standalone
>Reporter: huanghuai
>Priority: Major
>  Labels: SQL
>
> Dataset dataset = sparkSession.read().format("csv").option("sep", 
> ",").option("inferSchema", "true")
>  .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true")
>  .option("encoding", "UTF-8")
>  .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv");
> System.out.println("source count="+dataset.count());
> Dataset dropDuplicates = dataset.dropDuplicates(new 
> String[]\{"DATE","TIME","VEL","COMPANY"});
> System.out.println("dropDuplicates count1="+dropDuplicates.count());
> System.out.println("dropDuplicates count2="+dropDuplicates.count());
> Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 
> and (status = 0 or status = 1)");
> System.out.println("filter count1="+filter.count());
> System.out.println("filter count2="+filter.count());
> System.out.println("filter count3="+filter.count());
> System.out.println("filter count4="+filter.count());
> System.out.println("filter count5="+filter.count());
>  
>  
> --The above is code 
> ---
>  
>  
> console output:
> source count=459275
> dropDuplicates count1=453987
> dropDuplicates count2=453987
> filter count1=445798
> filter count2=445797
> filter count3=445797
> filter count4=445798
> filter count5=445799
>  
> question:
>  
> Why is filter.count() different everytime?
> if I remove dropDuplicates() everything will be ok!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-16 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616955#comment-16616955
 ] 

Dongjoon Hyun commented on SPARK-25423:
---

[~yumwang]'s PR link is added.

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-16 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616953#comment-16616953
 ] 

Dongjoon Hyun commented on SPARK-25423:
---

For the new feature, `Affects Version/s` should be the version of master branch.

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25423:
-

Assignee: Yuming Wang

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Assignee: Yuming Wang
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata

2018-09-16 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25423:
--
Affects Version/s: (was: 2.3.1)
   2.5.0

> Output "dataFilters" in DataSourceScanExec.metadata
> ---
>
> Key: SPARK-25423
> URL: https://issues.apache.org/jira/browse/SPARK-25423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.5.0
>Reporter: Maryann Xue
>Priority: Trivial
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25153) Improve error messages for columns with dots/periods

2018-09-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fernando Díaz updated SPARK-25153:
--
Comment: was deleted

(was: I will take a look at it. Quick question:

Given a dataframe with just one column named "Spark.Dots", should the error 
appear for both of the following examples? 
{code:java}
df.col("Spark.Dots")
df.col("Completely.Different.Name"){code}
 In the first case, Spark will try to extract the value "Dots" from the column. 
In the second case, it won't be able to resolve the column because there is no 
column with the qualifier "Completely".)

> Improve error messages for columns with dots/periods
> 
>
> Key: SPARK-25153
> URL: https://issues.apache.org/jira/browse/SPARK-25153
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> When we fail to resolve a column name with a dot in it, and the column name 
> is present as a string literal the error message could mention using 
> backticks to have the string treated as a literal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-16 Thread Nir Hedvat (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616675#comment-16616675
 ] 

Nir Hedvat edited comment on SPARK-25380 at 9/16/18 11:21 AM:
--

Experiencing the same problem

  !image-2018-09-16-14-21-38-939.png! 


was (Author: nir hedvat):
Same problem here (using Spark 2.3.1)

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-09-16 Thread Nir Hedvat (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616675#comment-16616675
 ] 

Nir Hedvat commented on SPARK-25380:


Same problem here (using Spark 2.3.1)

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24315) Multiple streaming jobs detected error causing job failure

2018-09-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616643#comment-16616643
 ] 

Marco Gaido commented on SPARK-24315:
-

[~joeyfezster] it has been a while ago, so I may be wrong, bur IIRC this was 
caused by some corrupted checkpoint dir and deleting it and restarting the job 
solved the issue.

> Multiple streaming jobs detected error causing job failure
> --
>
> Key: SPARK-24315
> URL: https://issues.apache.org/jira/browse/SPARK-24315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> We are running a simple structured streaming job. It reads data from Kafka 
> and writes it to HDFS. Unfortunately at startup, the application fails with 
> the following error. After some restarts the application finally starts 
> successfully.
> {code}
> org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: 
> Concurrent update to the log. Multiple streaming jobs detected for 1
> === Streaming Query ===
> 
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: java.lang.AssertionError: assertion failed: Concurrent update to 
> the log. Multiple streaming jobs detected for 1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
> ... 1 more
> {code}
> We have not set any value for `spark.streaming.concurrentJobs`. Our code 
> looks like:
> {code}
>   // read from kafka
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count()
>   // simple select of some fields with casts
>   .coalesce(1)
>   .writeStream
>   .trigger(Trigger.ProcessingTime("2 minutes"))
>   .option("checkpointLocation", checkpointDir)
>   // write to HDFS
>   .start()
>   .awaitTermination()
> {code}
> This may also be related to the presence of some data in the kafka queue to 
> process, so the time for the first batch may be longer than usual (as it is 
> quite common I think).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-24315) Multiple streaming jobs detected error causing job failure

2018-09-16 Thread Joey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616622#comment-16616622
 ] 

Joey commented on SPARK-24315:
--

[~mgaido] , can you please explain why this is not a bug? 

What could be the cause of this?

> Multiple streaming jobs detected error causing job failure
> --
>
> Key: SPARK-24315
> URL: https://issues.apache.org/jira/browse/SPARK-24315
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> We are running a simple structured streaming job. It reads data from Kafka 
> and writes it to HDFS. Unfortunately at startup, the application fails with 
> the following error. After some restarts the application finally starts 
> successfully.
> {code}
> org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: 
> Concurrent update to the log. Multiple streaming jobs detected for 1
> === Streaming Query ===
> 
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: java.lang.AssertionError: assertion failed: Concurrent update to 
> the log. Multiple streaming jobs detected for 1
> at scala.Predef$.assert(Predef.scala:170)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
> ... 1 more
> {code}
> We have not set any value for `spark.streaming.concurrentJobs`. Our code 
> looks like:
> {code}
>   // read from kafka
>   .withWatermark("timestamp", "30 minutes")
>   .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count()
>   // simple select of some fields with casts
>   .coalesce(1)
>   .writeStream
>   .trigger(Trigger.ProcessingTime("2 minutes"))
>   .option("checkpointLocation", checkpointDir)
>   // write to HDFS
>   .start()
>   .awaitTermination()
> {code}
> This may also be related to the presence of some data in the kafka queue to 
> process, so the time for the first batch may be longer than usual (as it is 
> quite common I think).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Resolved] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source

2018-09-16 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao resolved SPARK-25391.
--
Resolution: Won't Do

> Make behaviors consistent when converting parquet hive table to parquet data 
> source
> ---
>
> Key: SPARK-25391
> URL: https://issues.apache.org/jira/browse/SPARK-25391
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Priority: Major
>
> parquet data source tables and hive parquet tables have different behaviors 
> about parquet field resolution. So, when 
> {{spark.sql.hive.convertMetastoreParquet}} is true, users might face 
> inconsistent behaviors. The differences are:
>  * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both 
> data source tables and hive tables do NOT respect 
> {{spark.sql.caseSensitive}}. However data source tables always do 
> case-sensitive parquet field resolution, while hive tables always do 
> case-insensitive parquet field resolution no matter whether 
> {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data 
> source tables respect {{spark.sql.caseSensitive}} while hive serde table 
> behavior is not changed.
>  * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, 
> data source tables do case-sensitive resolution and return columns with the 
> corresponding letter cases, while hive tables always return the first matched 
> column ignoring cases. SPARK-25132 let data source tables throw exception 
> when there is ambiguity while hive table behavior is not changed.
> This ticket aims to make behaviors consistent when converting hive table to 
> data source table.
>  * The behavior must be consistent to do the conversion, so we skip the 
> conversion in case-sensitive mode because hive parquet table always do 
> case-insensitive field resolution.
>  * In case-insensitive mode, when converting hive parquet table to parquet 
> data source, we switch the duplicated fields resolution mode to ask parquet 
> data source to pick the first matched field - the same behavior as hive 
> parquet table - to keep behaviors consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org