date:20180321

[jira] [Resolved] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23372.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20579
[https://github.com/apache/spark/pull/20579]

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.4.0
>
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier in the processing and raise the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23372:
---

Assignee: Dilip Biswal

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.4.0
>
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier in the processing and raise the error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23760.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20870
[https://github.com/apache/spark/pull/20870]

> CodegenContext.withSubExprEliminationExprs should save/restore CSE state 
> correctly
> --
>
> Key: SPARK-23760
> URL: https://issues.apache.org/jira/browse/SPARK-23760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes 
> it effectively always clear the subexpression elimination state after it's 
> called.
> The original intent of this function was that it should save the old state, 
> set the given new state as current and perform codegen (invoke 
> {{Expression.genCode()}}), and at the end restore the subexpression 
> elimination state back to the old state. This ticket tracks a fix to actually 
> implement the original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23760) CodegenContext.withSubExprEliminationExprs should save/restore CSE state correctly

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23760:
---

Assignee: Kris Mok

> CodegenContext.withSubExprEliminationExprs should save/restore CSE state 
> correctly
> --
>
> Key: SPARK-23760
> URL: https://issues.apache.org/jira/browse/SPARK-23760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> There's a bug in {{CodegenContext.withSubExprEliminationExprs()}} that makes 
> it effectively always clear the subexpression elimination state after it's 
> called.
> The original intent of this function was that it should save the old state, 
> set the given new state as current and perform codegen (invoke 
> {{Expression.genCode()}}), and at the end restore the subexpression 
> elimination state back to the old state. This ticket tracks a fix to actually 
> implement the original intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-21 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408864#comment-16408864
 ] 

Dongjoon Hyun edited comment on SPARK-23710 at 3/22/18 12:56 AM:
-

[~q79969786]. IMHO, `hive-storage-api-2.4.0.jar` is not allowed to be there 
without `-Phive`.

For conflict issues, what about turning on and off `nohive` classifier based on 
`hive` profile?


was (Author: dongjoon):
[~q79969786]. IMHO, `hive-storage-api-2.4.0.jar` is not allowed to be there 
without `-Phive`.

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-21 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408864#comment-16408864
 ] 

Dongjoon Hyun commented on SPARK-23710:
---

[~q79969786]. IMHO, `hive-storage-api-2.4.0.jar` is not allowed to be there 
without `-Phive`.

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23729) Glob resolution breaks remote naming of files/archives

2018-03-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23729.

   Resolution: Fixed
 Assignee: Mihaly Toth
Fix Version/s: 2.4.0
   2.3.1

> Glob resolution breaks remote naming of files/archives
> --
>
> Key: SPARK-23729
> URL: https://issues.apache.org/jira/browse/SPARK-23729
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Mihaly Toth
>Assignee: Mihaly Toth
>Priority: Major
>  Labels: regression
> Fix For: 2.3.1, 2.4.0
>
>
> Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the 
> {{\-\-files}} parameters, in case the file name actually contains glob 
> patterns, the rename part ({{...#nameAs}}) of the filename will eventually be 
> ignored.
> Thinking over the resolution cases, if the resolution results in multiple 
> files, it does not make sense to send all of them under the same remote name. 
> So this should then result in an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23729) Glob resolution breaks remote naming of files/archives

2018-03-21 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-23729:
---
Labels: regression  (was: )

> Glob resolution breaks remote naming of files/archives
> --
>
> Key: SPARK-23729
> URL: https://issues.apache.org/jira/browse/SPARK-23729
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Mihaly Toth
>Priority: Major
>  Labels: regression
>
> Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the 
> {{\-\-files}} parameters, in case the file name actually contains glob 
> patterns, the rename part ({{...#nameAs}}) of the filename will eventually be 
> ignored.
> Thinking over the resolution cases, if the resolution results in multiple 
> files, it does not make sense to send all of them under the same remote name. 
> So this should then result in an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23429) Add executor memory metrics to heartbeat and expose in executors REST API

2018-03-21 Thread assia ydroudj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408723#comment-16408723
 ] 

assia ydroudj commented on SPARK-23429:
---

Hi  [~elu], 

Still interesting to get this values... can you please provide how to do it?

thanks.

> Add executor memory metrics to heartbeat and expose in executors REST API
> -
>
> Key: SPARK-23429
> URL: https://issues.apache.org/jira/browse/SPARK-23429
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
>
> Add new executor level memory metrics ( jvmUsedMemory, executionMemory, 
> storageMemory, and unifiedMemory), and expose these via the executors REST 
> API. This information will help provide insight into how executor and driver 
> JVM memory is used, and for the different memory regions. It can be used to 
> help determine good values for spark.executor.memory, spark.driver.memory, 
> spark.memory.fraction, and spark.memory.storageFraction.
> Add an ExecutorMetrics class, with jvmUsedMemory, executionMemory, and 
> storageMemory. This will track the memory usage at the executor level. The 
> new ExecutorMetrics will be sent by executors to the driver as part of the 
> Heartbeat. A heartbeat will be added for the driver as well, to collect these 
> metrics for the driver.
> Modify the EventLoggingListener to log ExecutorMetricsUpdate events if there 
> is a new peak value for one of the memory metrics for an executor and stage. 
> Only the ExecutorMetrics will be logged, and not the TaskMetrics, to minimize 
> additional logging. Analysis on a set of sample applications showed an 
> increase of 0.25% in the size of the Spark history log, with this approach.
> Modify the AppStatusListener to collect snapshots of peak values for each 
> memory metric. Each snapshot has the time, jvmUsedMemory, executionMemory and 
> storageMemory, and list of active stages.
> Add the new memory metrics (snapshots of peak values for each memory metric) 
> to the executors REST API.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22239) User-defined window functions with pandas udf

2018-03-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408639#comment-16408639
 ] 

Li Jin commented on SPARK-22239:


I am looking into this. I will start with unbounded window first, i.e., 
Window.partitionBy(df.id). Growing/Shrinking/Moving windows are much more 
complicated because we don't want to send the each window to python worker. I 
will try to solve the simple case (unbounded window) first.

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Priority: Major
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22239) User-defined window functions with pandas udf

2018-03-21 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408639#comment-16408639
 ] 

Li Jin edited comment on SPARK-22239 at 3/21/18 9:55 PM:
-

I am looking into this. I will start with unbounded window first, i.e., 
Window.partitionBy(df.id). Growing/Shrinking/Moving windows are much more 
complicated because we don't want to send each window to python worker. I will 
try to solve the simple case (unbounded window) first.


was (Author: icexelloss):
I am looking into this. I will start with unbounded window first, i.e., 
Window.partitionBy(df.id). Growing/Shrinking/Moving windows are much more 
complicated because we don't want to send the each window to python worker. I 
will try to solve the simple case (unbounded window) first.

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Priority: Major
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20964) Make some keywords reserved along with the ANSI/SQL standard

2018-03-21 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408578#comment-16408578
 ] 

Frederick Reiss commented on SPARK-20964:
-

SQL2016 prohibits the use of reserved words in unquoted identifiers. See 
[http://jtc1sc32.org/doc/N2301-2350/32N2311T-text_for_ballot-CD_9075-2.pdf], 
page 176:
 _26) The case-normal form of  shall not be equal, according 
to the comparison rules in Subclause 8.2, “”, to any 
 (with every letter that is a lower-case letter replaced by the 
corresponding upper-case letter or letters), treated as the repetition of a 
 that specifes a  of 
SQL_IDENTIFIER._

That said, strict enforcement of that rule will break existing queries every 
time a new reserved word is added. This factor in turn leads users to start 
quoting *every* identifier as a defensive measure. You end up with SQL like 
this:

{{SELECT "R"."A", "S"."B"}}
{{ FROM "MYSCHEMA"."R" as "R", "MYSCHEMA"."S" as "S"}}

DB2 allows reserved words to be used unquoted; for example, {{SELECT * FROM 
SYSIBM.SYSTABLES WHERE, WHERE}} (see 
[https://www.ibm.com/support/knowledgecenter/en/SSEPEK_11.0.0/sqlref/src/tpc/db2z_reservedwords.html]),
 which lets users use a more literate form of SQL.

> Make some keywords reserved along with the ANSI/SQL standard
> 
>
> Key: SPARK-20964
> URL: https://issues.apache.org/jira/browse/SPARK-20964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current Spark has many non-reserved words that are essentially reserved 
> in the ANSI/SQL standard 
> (http://developer.mimer.se/validator/sql-reserved-words.tml). 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L709
> This is because there are many datasources (for instance twitter4j) that 
> unfortunately use reserved keywords for column names (See [~hvanhovell]'s 
> comments: https://github.com/apache/spark/pull/18079#discussion_r118842186). 
> We might fix this issue in future major releases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2018-03-21 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-18580.

  Resolution: Fixed
Target Version/s: 2.4.0

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>Assignee: Oleksandr Konopko
>Priority: Major
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18580) Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream

2018-03-21 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger reassigned SPARK-18580:
--

Assignee: Oleksandr Konopko

> Use spark.streaming.backpressure.initialRate in DirectKafkaInputDStream
> ---
>
> Key: SPARK-18580
> URL: https://issues.apache.org/jira/browse/SPARK-18580
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Oleg Muravskiy
>Assignee: Oleksandr Konopko
>Priority: Major
>
> Currently the `spark.streaming.kafka.maxRatePerPartition` is used as the 
> initial rate when the backpressure is enabled. This is too exhaustive for the 
> application while it still warms up.
> This is similar to SPARK-11627, applying the solution provided there to 
> DirectKafkaInputDStream.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408316#comment-16408316
 ] 

Darek edited comment on SPARK-23534 at 3/21/18 6:05 PM:


It seems that Hive upgrade to 2.3.2 is almost done 
[SPARK-23710|https://issues.apache.org/jira/browse/SPARK-23710]
, once it's done, hopefully Hadoop 3.0 will build.


was (Author: bidek):
It seems that Hive upgrade to 2.3.2 is almost done ( 
https://issues.apache.org/jira/browse/SPARK-23710 ), once it's done, hopefully 
Hadoop 3.0 will build.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408324#comment-16408324
 ] 

Darek edited comment on SPARK-23710 at 3/21/18 6:04 PM:


Can we merge PR 20659 into master? it's blocking a lot of tickets Thanks
[SPARK-23534|https://issues.apache.org/jira/browse/SPARK-23534]
[SPARK-18673|https://issues.apache.org/jira/browse/SPARK-18673]


was (Author: bidek):
Can we merge PR 20659 into master? it's blocking a lot of tickets Thanks

[#SPARK-23534]
[#SPARK-18673]

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408324#comment-16408324
 ] 

Darek edited comment on SPARK-23710 at 3/21/18 6:00 PM:


Can we merge PR 20659 into master? it's blocking a lot of tickets Thanks

[#SPARK-23534]
[#SPARK-18673]


was (Author: bidek):
Can we merge PR 20659 into master? it's blocking a lot of tickets Thanks

https://issues.apache.org/jira/browse/SPARK-23534
https://issues.apache.org/jira/browse/SPARK-18673

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23710) Upgrade Hive to 2.3.2

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408324#comment-16408324
 ] 

Darek commented on SPARK-23710:
---

Can we merge PR 20659 into master? it's blocking a lot of tickets Thanks

https://issues.apache.org/jira/browse/SPARK-23534
https://issues.apache.org/jira/browse/SPARK-18673

> Upgrade Hive to 2.3.2
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> h1. Mainly changes
>  * Maven dependency:
>  hive.version from {{1.2.1.spark2}} to {{2.3.2}} and change 
> {{hive.classifier}} to {{core}}
>  calcite.version from {{1.2.0-incubating}} to {{1.10.0}}
>  datanucleus-core.version from {{3.2.10}} to {{4.1.17}}
>  remove {{orc.classifier}}, it means orc use the {{hive.storage.api}}, see: 
> ORC-174
>  add new dependency {{avatica}} and {{hive.storage.api}}
>  * ORC compatibility changes:
>  OrcColumnVector.java, OrcColumnarBatchReader.java, OrcDeserializer.scala, 
> OrcFilters.scala, OrcSerializer.scala, OrcFilterSuite.scala
>  * hive-thriftserver java file update:
>  update {{sql/hive-thriftserver/if/TCLIService.thrift}} to hive 2.3.2
>  update {{sql/hive-thriftserver/src/main/java/org/apache/hive/service/*}} to 
> hive 2.3.2
>  * TestSuite should update:
> ||TestSuite||Reason||
> |StatisticsSuite|HIVE-16098|
> |SessionCatalogSuite|Similar to [VersionsSuite.scala#L427|#L427]|
> |CliSuite, HiveThriftServer2Suites, HiveSparkSubmitSuite, HiveQuerySuite, 
> SQLQuerySuite|Update hive-hcatalog-core-0.13.1.jar to 
> hive-hcatalog-core-2.3.2.jar|
> |SparkExecuteStatementOperationSuite|Interface changed from 
> org.apache.hive.service.cli.Type.NULL_TYPE to 
> org.apache.hadoop.hive.serde2.thrift.Type.NULL_TYPE|
> |ClasspathDependenciesSuite|org.apache.hive.com.esotericsoftware.kryo.Kryo 
> change to com.esotericsoftware.kryo.Kryo|
> |HiveMetastoreCatalogSuite|Result format changed from Seq("1.1\t1", "2.1\t2") 
> to Seq("1.100\t1", "2.100\t2")|
> |HiveOrcFilterSuite|Result format changed|
> |HiveDDLSuite|Remove $ (This change needs to be reconsidered)|
> |HiveExternalCatalogVersionsSuite| java.lang.ClassCastException: 
> org.datanucleus.identity.DatastoreIdImpl cannot be cast to 
> org.datanucleus.identity.OID|
>  * Other changes:
> Close hive schema verification:  
> [HiveClientImpl.scala#L251|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L251]
>  and 
> [HiveExternalCatalog.scala#L58|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala#L58]
> Update 
> [IsolatedClientLoader.scala#L189-L192|https://github.com/wangyum/spark/blob/75e4cc9e80f85517889e87a35da117bc361f2ff3/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L189-L192]
> Because Hive 2.3.2's {{org.apache.hadoop.hive.ql.metadata.Hive}} can't 
> connect to Hive 1.x metastore, We should use 
> {{HiveMetaStoreClient.getDelegationToken}} instead of 
> {{Hive.getDelegationToken}} and update {{HiveClientImpl.toHiveTable}}
> All changes can be found at 
> [PR-20659|https://github.com/apache/spark/pull/20659].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408316#comment-16408316
 ] 

Darek edited comment on SPARK-23534 at 3/21/18 5:45 PM:


It seems that Hive upgrade to 2.3.2 is almost done ( 
https://issues.apache.org/jira/browse/SPARK-23710 ), once it's done, hopefully 
Hadoop 3.0 will build.


was (Author: bidek):
It seems that Hive upgrade to 2.3.2 is almost done 
(https://issues.apache.org/jira/browse/SPARK-23710), once it's done, hopefully 
Hadoop 3.0 will build.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23534) Spark run on Hadoop 3.0.0

2018-03-21 Thread Darek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408316#comment-16408316
 ] 

Darek commented on SPARK-23534:
---

It seems that Hive upgrade to 2.3.2 is almost done 
(https://issues.apache.org/jira/browse/SPARK-23710), once it's done, hopefully 
Hadoop 3.0 will build.

> Spark run on Hadoop 3.0.0
> -
>
> Key: SPARK-23534
> URL: https://issues.apache.org/jira/browse/SPARK-23534
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
>
> Major Hadoop vendors already/will step in Hadoop 3.0. So we should also make 
> sure Spark can run with Hadoop 3.0. This Jira tracks the work to make Spark 
> run on Hadoop 3.0.
> The work includes:
>  # Add a Hadoop 3.0.0 new profile to make Spark build-able with Hadoop 3.0.
>  # Test to see if there's dependency issues with Hadoop 3.0.
>  # Investigating the feasibility to use shaded client jars (HADOOP-11804).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23288) Incorrect number of written records in structured streaming

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23288.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20745
[https://github.com/apache/spark/pull/20745]

> Incorrect number of written records in structured streaming
> ---
>
> Key: SPARK-23288
> URL: https://issues.apache.org/jira/browse/SPARK-23288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuriy Bondaruk
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: Metrics, metrics
> Fix For: 2.4.0, 2.3.1
>
>
> I'm using SparkListener.onTaskEnd() to capture input and output metrics but 
> it seems that number of written records 
> ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here 
> is my stream construction:
>  
> {code:java}
> StreamingQuery writeStream = session
> .readStream()
> .schema(RecordSchema.fromClass(TestRecord.class))
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv(inputFolder.getRoot().toPath().toString())
> .as(Encoders.bean(TestRecord.class))
> .flatMap(
> ((FlatMapFunction) (u) -> {
> List resultIterable = new ArrayList<>();
> try {
> TestVendingRecord result = transformer.convert(u);
> resultIterable.add(result);
> } catch (Throwable t) {
> System.err.println("Ooops");
> t.printStackTrace();
> }
> return resultIterable.iterator();
> }),
> Encoders.bean(TestVendingRecord.class))
> .writeStream()
> .outputMode(OutputMode.Append())
> .format("parquet")
> .option("path", outputFolder.getRoot().toPath().toString())
> .option("checkpointLocation", 
> checkpointFolder.getRoot().toPath().toString())
> .start();
> writeStream.processAllAvailable();
> writeStream.stop();
> {code}
> Tested it with one good and one bad (throwing exception in 
> transformer.convert(u)) input records and it produces following metrics:
>  
> {code:java}
> (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS
> (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0
> (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2
> (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables():
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  323
> (TestMain.java:onTaskEnd(84)) - name = number of output rows
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  364
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead
> (TestMain.java:onTaskEnd(85)) - value =  157
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.resultSerializationTime
> (TestMain.java:onTaskEnd(85)) - value =  3
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize
> (TestMain.java:onTaskEnd(85)) - value =  2396
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  633807000
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime
> (TestMain.java:onTaskEnd(85)) - value =  683
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  55662000
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeTime
> (TestMain.java:onTaskEnd(85)) - value =  58
> (TestMain.java:onTaskEnd(89)) - input records 2
> Streaming query made progress: {
>   "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5",
>   "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7",
>   "name" : null,
>   "timestamp" : "2018-01-26T14:44:05.362Z",
>   "numInputRows" : 2,
>   "processedRowsPerSecond" : 0.8163265306122448,
>   "durationMs" : {
> "addBatch" : 1994,
> "getBatch" : 126,
> "getOffset" : 52,
> "queryPlanning" : 220,
> "triggerExecution" : 2450,
> "walCommit" : 41
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : 
> "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]",
> "startOffset" : null,
> "endOffset" : {
>   "logOffset" : 0
> },
>

[jira] [Assigned] (SPARK-23288) Incorrect number of written records in structured streaming

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23288:
---

Assignee: Gabor Somogyi

> Incorrect number of written records in structured streaming
> ---
>
> Key: SPARK-23288
> URL: https://issues.apache.org/jira/browse/SPARK-23288
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Yuriy Bondaruk
>Assignee: Gabor Somogyi
>Priority: Major
>  Labels: Metrics, metrics
> Fix For: 2.3.1, 2.4.0
>
>
> I'm using SparkListener.onTaskEnd() to capture input and output metrics but 
> it seems that number of written records 
> ('taskEnd.taskMetrics().outputMetrics().recordsWritten()') is incorrect. Here 
> is my stream construction:
>  
> {code:java}
> StreamingQuery writeStream = session
> .readStream()
> .schema(RecordSchema.fromClass(TestRecord.class))
> .option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
> .option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
> .csv(inputFolder.getRoot().toPath().toString())
> .as(Encoders.bean(TestRecord.class))
> .flatMap(
> ((FlatMapFunction) (u) -> {
> List resultIterable = new ArrayList<>();
> try {
> TestVendingRecord result = transformer.convert(u);
> resultIterable.add(result);
> } catch (Throwable t) {
> System.err.println("Ooops");
> t.printStackTrace();
> }
> return resultIterable.iterator();
> }),
> Encoders.bean(TestVendingRecord.class))
> .writeStream()
> .outputMode(OutputMode.Append())
> .format("parquet")
> .option("path", outputFolder.getRoot().toPath().toString())
> .option("checkpointLocation", 
> checkpointFolder.getRoot().toPath().toString())
> .start();
> writeStream.processAllAvailable();
> writeStream.stop();
> {code}
> Tested it with one good and one bad (throwing exception in 
> transformer.convert(u)) input records and it produces following metrics:
>  
> {code:java}
> (TestMain.java:onTaskEnd(73)) - ---status--> SUCCESS
> (TestMain.java:onTaskEnd(75)) - ---recordsWritten--> 0
> (TestMain.java:onTaskEnd(76)) - ---recordsRead-> 2
> (TestMain.java:onTaskEnd(83)) - taskEnd.taskInfo().accumulables():
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  323
> (TestMain.java:onTaskEnd(84)) - name = number of output rows
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = duration total (min, med, max)
> (TestMain.java:onTaskEnd(85)) - value =  364
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.recordsRead
> (TestMain.java:onTaskEnd(85)) - value =  2
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.input.bytesRead
> (TestMain.java:onTaskEnd(85)) - value =  157
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.resultSerializationTime
> (TestMain.java:onTaskEnd(85)) - value =  3
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.resultSize
> (TestMain.java:onTaskEnd(85)) - value =  2396
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  633807000
> (TestMain.java:onTaskEnd(84)) - name = internal.metrics.executorRunTime
> (TestMain.java:onTaskEnd(85)) - value =  683
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeCpuTime
> (TestMain.java:onTaskEnd(85)) - value =  55662000
> (TestMain.java:onTaskEnd(84)) - name = 
> internal.metrics.executorDeserializeTime
> (TestMain.java:onTaskEnd(85)) - value =  58
> (TestMain.java:onTaskEnd(89)) - input records 2
> Streaming query made progress: {
>   "id" : "1231f9cb-b2e8-4d10-804d-73d7826c1cb5",
>   "runId" : "bd23b60c-93f9-4e17-b3bc-55403edce4e7",
>   "name" : null,
>   "timestamp" : "2018-01-26T14:44:05.362Z",
>   "numInputRows" : 2,
>   "processedRowsPerSecond" : 0.8163265306122448,
>   "durationMs" : {
> "addBatch" : 1994,
> "getBatch" : 126,
> "getOffset" : 52,
> "queryPlanning" : 220,
> "triggerExecution" : 2450,
> "walCommit" : 41
>   },
>   "stateOperators" : [ ],
>   "sources" : [ {
> "description" : 
> "FileStreamSource[file:/var/folders/4w/zks_kfls2s3glmrj3f725p7hllyb5_/T/junit3661035412295337071]",
> "startOffset" : null,
> "endOffset" : {
>   "logOffset" : 0
> },
> "numInputRows" : 2,
> "processedRowsPerSecond" : 0.8163265306122448
>   } ],
>   "sink" : {
> "description" : 
>

[jira] [Resolved] (SPARK-23264) Support interval values without INTERVAL clauses

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23264.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20872
[https://github.com/apache/spark/pull/20872]

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23264) Support interval values without INTERVAL clauses

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23264:
---

Assignee: Takeshi Yamamuro

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23577) Supports line separator for text datasource

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23577:
---

Assignee: Hyukjin Kwon

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23577) Supports line separator for text datasource

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23577.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20727
[https://github.com/apache/spark/pull/20727]

> Supports line separator for text datasource
> ---
>
> Key: SPARK-23577
> URL: https://issues.apache.org/jira/browse/SPARK-23577
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> Same as its parent but this JIRA is specific to text datasource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10884) Support prediction on single instance for regression and classification related models

2018-03-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10884.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19381
[https://github.com/apache/spark/pull/19381]

> Support prediction on single instance for regression and classification 
> related models
> --
>
> Key: SPARK-10884
> URL: https://issues.apache.org/jira/browse/SPARK-10884
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
>  Labels: 2.2.0
> Fix For: 2.4.0
>
>
> Support prediction on single instance for regression and classification 
> related models (i.e., PredictionModel, ClassificationModel and their sub 
> classes). 
> Add corresponding test cases.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23764) Utils.tryWithSafeFinally swollows fatal exceptions in the finally block

2018-03-21 Thread Yavgeni Hotimsky (JIRA)

Yavgeni Hotimsky created SPARK-23764:


 Summary: Utils.tryWithSafeFinally swollows fatal exceptions in the 
finally block
 Key: SPARK-23764
 URL: https://issues.apache.org/jira/browse/SPARK-23764
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Yavgeni Hotimsky


This is from my driver stdout:

{noformat}
[dag-scheduler-event-loop] WARN  org.apache.spark.util.Utils - Suppressing 
exception in finally: Java heap space
java.lang.OutOfMemoryError: Java heap space
   at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
   at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
   at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
   at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:271)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
   at 
org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
   at 
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
   at 
net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
   at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
   at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
   at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
   at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
   at 
org.apache.spark.serializer.JavaSerializationStream.close(JavaSerializer.scala:57)
   at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$1.apply$mcV$sp(TorrentBroadcast.scala:278)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
   at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:277)
   at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
   at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
   at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
   at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1488)
   at 
org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1006)
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:933)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:932)
   at scala.collection.immutable.List.foreach(List.scala:381)
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:932)
   at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1695)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
{noformat}

After this my driver stayed up but all my streaming queries stopped triggering. 
I would of course expect the driver to terminate in this case.

 

This util shouldn't suppress fatal exceptions in the finally block. The fix is 
as simple as replacing Throwable with NonFatal(e) in the finally block. Also 
the util {color:#33}tryWithSafeFinallyAndFailureCallbacks should behave the 
same.
{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22744) Cannot get the submit hostname of application

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22744:


Assignee: (was: Apache Spark)

> Cannot get the submit hostname of application
> -
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22744) Cannot get the submit hostname of application

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22744:


Assignee: Apache Spark

> Cannot get the submit hostname of application
> -
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23568) Silhouette should get number of features from metadata if available

2018-03-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23568.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20719
[https://github.com/apache/spark/pull/20719]

> Silhouette should get number of features from metadata if available
> ---
>
> Key: SPARK-23568
> URL: https://issues.apache.org/jira/browse/SPARK-23568
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> In Silhouette computation we need to know the number of features. This is 
> done taking the first row and checking the size of the features vector. 
> Despite it works fine, if the number of attributes is present in the metadata 
> of the column, we can avoid the additional job which is generated by using 
> `first`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23568) Silhouette should get number of features from metadata if available

2018-03-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23568:
-

Assignee: Marco Gaido

> Silhouette should get number of features from metadata if available
> ---
>
> Key: SPARK-23568
> URL: https://issues.apache.org/jira/browse/SPARK-23568
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> In Silhouette computation we need to know the number of features. This is 
> done taking the first row and checking the size of the features vector. 
> Despite it works fine, if the number of attributes is present in the metadata 
> of the column, we can avoid the additional job which is generated by using 
> `first`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23746) HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

2018-03-21 Thread Izhar Ahmed (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izhar Ahmed updated SPARK-23746:

Affects Version/s: 1.6.3

> HashMap UserDefinedType giving cast exception in Spark 1.6.2 while 
> implementing UDAF
> 
>
> Key: SPARK-23746
> URL: https://issues.apache.org/jira/browse/SPARK-23746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Izhar Ahmed
>Priority: Major
>
> I am trying to use a custom HashMap implementation as UserDefinedType instead 
> of MapType in spark. The code is *working fine in spark 1.5.2* but giving 
> {{java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 
> cannot be cast to org.apache.spark.sql.catalyst.util.MapData}} *exception in 
> spark 1.6.2*
> The code:- 
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import scala.collection.immutable.HashMap
> class Test extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(Array(StructField("input", StringType)))
>   def bufferSchema = StructType(Array(StructField("top_n", 
> CustomHashMapType)))
>   def dataType: DataType = CustomHashMapType
>   def deterministic = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = HashMap.empty[String, Long]
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> val buff0 = buffer.getAs[HashMap[String, Long]](0)
> buffer(0) = buff0.updated("test", buff0.getOrElse("test", 0L) + 1L)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.
>   getAs[HashMap[String, Long]](0)
>   .merged(buffer2.getAs[HashMap[String, Long]](0))({ case ((k, v1), (_, 
> v2)) => (k, v1 + v2) })
>   }
>   def evaluate(buffer: Row): Any = {
> buffer(0)
>   }
> }
> private case object CustomHashMapType extends UserDefinedType[HashMap[String, 
> Long]] {
>   override def sqlType: DataType = MapType(StringType, LongType)
>   override def serialize(obj: Any): Map[String, Long] =
> obj.asInstanceOf[Map[String, Long]]
>   override def deserialize(datum: Any): HashMap[String, Long] = {
> datum.asInstanceOf[Map[String, Long]] ++: HashMap.empty[String, Long]
>   }
>   override def userClass: Class[HashMap[String, Long]] = 
> classOf[HashMap[String, Long]]
> }
> {code}
> The wrapper Class to run the UDAF:-
> {code:scala}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object TestJob {
>   def main(args: Array[String]): Unit = {
> val conf = new 
> SparkConf().setMaster("local[4]").setAppName("DataStatsExecution")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(Seq(1,2,3,4)).toDF("col")
> val udaf = new Test()
> val outdf = df.agg(udaf(df("col")))
> outdf.show
>   }
> }
> {code}
> Stacktrace:-
> {code:java}
> Caused by: java.lang.ClassCastException: 
> scala.collection.immutable.HashMap$HashMap1 cannot be cast to 
> org.apache.spark.sql.catalyst.util.MapData
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getMap(rows.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getMap(rows.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getMap(JoinedRow.scala:115)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:345)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:344)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:154)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at 
>

[jira] [Commented] (SPARK-23746) HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

2018-03-21 Thread Izhar Ahmed (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407894#comment-16407894
 ] 

Izhar Ahmed commented on SPARK-23746:
-

[~hyukjin.kwon] I get the same exception when testing in *spark 1.6.3.*

Also _UserDefinedType_ API is private in spark 2.x so, it cant be tested there.

> HashMap UserDefinedType giving cast exception in Spark 1.6.2 while 
> implementing UDAF
> 
>
> Key: SPARK-23746
> URL: https://issues.apache.org/jira/browse/SPARK-23746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3
>Reporter: Izhar Ahmed
>Priority: Major
>
> I am trying to use a custom HashMap implementation as UserDefinedType instead 
> of MapType in spark. The code is *working fine in spark 1.5.2* but giving 
> {{java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 
> cannot be cast to org.apache.spark.sql.catalyst.util.MapData}} *exception in 
> spark 1.6.2*
> The code:- 
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import scala.collection.immutable.HashMap
> class Test extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(Array(StructField("input", StringType)))
>   def bufferSchema = StructType(Array(StructField("top_n", 
> CustomHashMapType)))
>   def dataType: DataType = CustomHashMapType
>   def deterministic = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = HashMap.empty[String, Long]
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> val buff0 = buffer.getAs[HashMap[String, Long]](0)
> buffer(0) = buff0.updated("test", buff0.getOrElse("test", 0L) + 1L)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.
>   getAs[HashMap[String, Long]](0)
>   .merged(buffer2.getAs[HashMap[String, Long]](0))({ case ((k, v1), (_, 
> v2)) => (k, v1 + v2) })
>   }
>   def evaluate(buffer: Row): Any = {
> buffer(0)
>   }
> }
> private case object CustomHashMapType extends UserDefinedType[HashMap[String, 
> Long]] {
>   override def sqlType: DataType = MapType(StringType, LongType)
>   override def serialize(obj: Any): Map[String, Long] =
> obj.asInstanceOf[Map[String, Long]]
>   override def deserialize(datum: Any): HashMap[String, Long] = {
> datum.asInstanceOf[Map[String, Long]] ++: HashMap.empty[String, Long]
>   }
>   override def userClass: Class[HashMap[String, Long]] = 
> classOf[HashMap[String, Long]]
> }
> {code}
> The wrapper Class to run the UDAF:-
> {code:scala}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object TestJob {
>   def main(args: Array[String]): Unit = {
> val conf = new 
> SparkConf().setMaster("local[4]").setAppName("DataStatsExecution")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(Seq(1,2,3,4)).toDF("col")
> val udaf = new Test()
> val outdf = df.agg(udaf(df("col")))
> outdf.show
>   }
> }
> {code}
> Stacktrace:-
> {code:java}
> Caused by: java.lang.ClassCastException: 
> scala.collection.immutable.HashMap$HashMap1 cannot be cast to 
> org.apache.spark.sql.catalyst.util.MapData
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getMap(rows.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getMap(rows.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getMap(JoinedRow.scala:115)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:345)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:344)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:154)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
>

[jira] [Assigned] (SPARK-23763) OffHeapColumnVector uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23763:


Assignee: (was: Apache Spark)

> OffHeapColumnVector uses MemoryBlock
> 
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23763) OffHeapColumnVector uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23763:


Assignee: Apache Spark

> OffHeapColumnVector uses MemoryBlock
> 
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23763) OffHeapColumnVector uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407879#comment-16407879
 ] 

Apache Spark commented on SPARK-23763:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20874

> OffHeapColumnVector uses MemoryBlock
> 
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23763) OffHeapColumnVector uses MemoryBlock

2018-03-21 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23763:
-
Description: 
This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.


  was:
This JIRA entry tries to use {{MemoryBlock}} in {{BufferHolder}}.


Summary: OffHeapColumnVector uses MemoryBlock  (was: BufferHolder uses 
MemoryBlock)

> OffHeapColumnVector uses MemoryBlock
> 
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{OffHeapColumnVector}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

2018-03-21 Thread Natalia Gorchakova (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407851#comment-16407851
 ] 

Natalia Gorchakova commented on SPARK-10848:


As I understand, intent of calling .asNullable on schema was be safe (as there 
is no way to check that field in present in all files) . Don't see any reason 
of that for cases when we have explicit schema in files (for example avro 
files). 

With the current implementation (2.2.x, 2.3.x), dataframe based on avro files 
(with required fields) has all fields nullable. 

Should it be some additional logic ( flag ) to be added to apply nullable only 
for formats without explicit schema?

> Applied JSON Schema Works for json RDD but not when loading json file
> -
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Miklos Christine
>Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23761) Dataframe filter(udf) followed by groupby in pyspark throws a casting error

2018-03-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407848#comment-16407848
 ] 

Hyukjin Kwon commented on SPARK-23761:
--

Seems this one is fixed in the current master. Would you be able to test this 
in a higher version of Spark?

 

If it's unable to reproduce this in higher versions, I would rather resolve 
this as {{Cannot Reproduce}} and try to find the JIRA, and then backport if 
applicable. 

> Dataframe filter(udf) followed by groupby in pyspark throws a casting error
> ---
>
> Key: SPARK-23761
> URL: https://issues.apache.org/jira/browse/SPARK-23761
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: pyspark 1.6.0
> Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37) 
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
> CentOS 6.7
>Reporter: Dhaniram Kshirsagar
>Priority: Major
>
> On pyspark with dataframe, we are getting following exception when 
> 'filter(with UDF) is followed by groupby' :-
> # Snippet of error observed in pyspark
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o56.filter.
> : java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate{code}
> This one looks like https://issues.apache.org/jira/browse/SPARK-12981 however 
> not sure if this one is same.
>  
> Here is gist with pyspark steps to reproduce this issue:
> [https://gist.github.com/dhaniram-kshirsagar/d72545620b6a05d145a1a6bece797b6d]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23707) Don't need shuffle exchange with single partition for 'spark.range'

2018-03-21 Thread Xianyang Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-23707:
-
Description: Just like #20726. There is no need 'Exchange' when 
`spark.range` produce only one partition.  (was: We should call `ctx.freshName` 
to get the `initRange` to avoid name conflicts.)

> Don't need shuffle exchange with single partition for 'spark.range' 
> 
>
> Key: SPARK-23707
> URL: https://issues.apache.org/jira/browse/SPARK-23707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xianyang Liu
>Priority: Major
>
> Just like #20726. There is no need 'Exchange' when `spark.range` produce only 
> one partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23707) Don't need shuffle exchange with single partition for 'spark.range'

2018-03-21 Thread Xianyang Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianyang Liu updated SPARK-23707:
-
Summary: Don't need shuffle exchange with single partition for 
'spark.range'   (was: Fresh 'initRange' name to avoid method name conflicts)

> Don't need shuffle exchange with single partition for 'spark.range' 
> 
>
> Key: SPARK-23707
> URL: https://issues.apache.org/jira/browse/SPARK-23707
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xianyang Liu
>Priority: Major
>
> We should call `ctx.freshName` to get the `initRange` to avoid name conflicts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23746) HashMap UserDefinedType giving cast exception in Spark 1.6.2 while implementing UDAF

2018-03-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407846#comment-16407846
 ] 

Hyukjin Kwon commented on SPARK-23746:
--

Would you be able to test this against higher version of Spark?

> HashMap UserDefinedType giving cast exception in Spark 1.6.2 while 
> implementing UDAF
> 
>
> Key: SPARK-23746
> URL: https://issues.apache.org/jira/browse/SPARK-23746
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2
>Reporter: Izhar Ahmed
>Priority: Major
>
> I am trying to use a custom HashMap implementation as UserDefinedType instead 
> of MapType in spark. The code is *working fine in spark 1.5.2* but giving 
> {{java.lang.ClassCastException: scala.collection.immutable.HashMap$HashMap1 
> cannot be cast to org.apache.spark.sql.catalyst.util.MapData}} *exception in 
> spark 1.6.2*
> The code:- 
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import scala.collection.immutable.HashMap
> class Test extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(Array(StructField("input", StringType)))
>   def bufferSchema = StructType(Array(StructField("top_n", 
> CustomHashMapType)))
>   def dataType: DataType = CustomHashMapType
>   def deterministic = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = HashMap.empty[String, Long]
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> val buff0 = buffer.getAs[HashMap[String, Long]](0)
> buffer(0) = buff0.updated("test", buff0.getOrElse("test", 0L) + 1L)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.
>   getAs[HashMap[String, Long]](0)
>   .merged(buffer2.getAs[HashMap[String, Long]](0))({ case ((k, v1), (_, 
> v2)) => (k, v1 + v2) })
>   }
>   def evaluate(buffer: Row): Any = {
> buffer(0)
>   }
> }
> private case object CustomHashMapType extends UserDefinedType[HashMap[String, 
> Long]] {
>   override def sqlType: DataType = MapType(StringType, LongType)
>   override def serialize(obj: Any): Map[String, Long] =
> obj.asInstanceOf[Map[String, Long]]
>   override def deserialize(datum: Any): HashMap[String, Long] = {
> datum.asInstanceOf[Map[String, Long]] ++: HashMap.empty[String, Long]
>   }
>   override def userClass: Class[HashMap[String, Long]] = 
> classOf[HashMap[String, Long]]
> }
> {code}
> The wrapper Class to run the UDAF:-
> {code:scala}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object TestJob {
>   def main(args: Array[String]): Unit = {
> val conf = new 
> SparkConf().setMaster("local[4]").setAppName("DataStatsExecution")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
> val df = sc.parallelize(Seq(1,2,3,4)).toDF("col")
> val udaf = new Test()
> val outdf = df.agg(udaf(df("col")))
> outdf.show
>   }
> }
> {code}
> Stacktrace:-
> {code:java}
> Caused by: java.lang.ClassCastException: 
> scala.collection.immutable.HashMap$HashMap1 cannot be cast to 
> org.apache.spark.sql.catalyst.util.MapData
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getMap(rows.scala:50)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getMap(rows.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getMap(JoinedRow.scala:115)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:345)
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$31.apply(AggregationIterator.scala:344)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:154)
> at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at

[jira] [Reopened] (SPARK-22744) Cannot get the submit hostname of application

2018-03-21 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin reopened SPARK-22744:


I think its very useful if we can get the submit host of application. Change 
some code to prevent setting manually.

> Cannot get the submit hostname of application
> -
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22744) Cannot get the submit hostname of application

2018-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1640#comment-1640
 ] 

Apache Spark commented on SPARK-22744:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20873

> Cannot get the submit hostname of application
> -
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22744) Cannot get the submit hostname of application

2018-03-21 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-22744:
---
Summary: Cannot get the submit hostname of application  (was: Set the 
application submit hostname by default)

> Cannot get the submit hostname of application
> -
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22744) Set the application submit hostname by default

2018-03-21 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-22744:
---
Description: 
In MapReduce, we can get the submit hostname via checking the value of 
configuration {{mapreduce.job.submithostname}}. It can help infra team or other 
people to do support work and debug. Bu in Spark, the information is not 
included.

In a big company,  spark infra team can get the submit host list of all 
applications in a clusters. It can help them to control the submition,  unify 
upgrade and find the bad jobs where it submitted.
Then, I suggest to set the submit hostname in spark internal with name 
{{spark.submit.hostname}}, spark user can not set it ( any manually set will be 
override). 

  was:
In MapReduce, we can get the submit hostname via checking the value of 
configuration {{mapreduce.job.submithostname}}. It can help infra team or other 
people to do support work and debug. Bu in Spark, the information is not 
included.

Then, I suggest to introduce a new configuration parameter, 
{{spark.submit.hostname}}, which will be set automatically by default, but it 
also allows to set this to an user defined hostname if needed (e.g, using an 
user node instead of Livy server node).


> Set the application submit hostname by default
> --
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> In a big company,  spark infra team can get the submit host list of all 
> applications in a clusters. It can help them to control the submition,  unify 
> upgrade and find the bad jobs where it submitted.
> Then, I suggest to set the submit hostname in spark internal with name 
> {{spark.submit.hostname}}, spark user can not set it ( any manually set will 
> be override). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22744) Set the application submit hostname by default

2018-03-21 Thread Lantao Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-22744:
---
Summary: Set the application submit hostname by default  (was: Add a 
configuration to show the application submit hostname)

> Set the application submit hostname by default
> --
>
> Key: SPARK-22744
> URL: https://issues.apache.org/jira/browse/SPARK-22744
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> In MapReduce, we can get the submit hostname via checking the value of 
> configuration {{mapreduce.job.submithostname}}. It can help infra team or 
> other people to do support work and debug. Bu in Spark, the information is 
> not included.
> Then, I suggest to introduce a new configuration parameter, 
> {{spark.submit.hostname}}, which will be set automatically by default, but it 
> also allows to set this to an user defined hostname if needed (e.g, using an 
> user node instead of Livy server node).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23264) Support interval values without INTERVAL clauses

2018-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407733#comment-16407733
 ] 

Apache Spark commented on SPARK-23264:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20872

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

2018-03-21 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407729#comment-16407729
 ] 

sam commented on SPARK-6190:


[~bdolbeare] [~UZiVcbfPXaNrMtT]

I completely agree that it's depressing many of these basic Spark issues are 
not resolved and development attention seems to go on useless features purely 
to make it easier to market Spark to non-big-data experts (e.g. SparkSQL, or 
poorly designed non-functional APIs like Dataframes/Datasets).

Nevertheless in the situation where you have a prod job where the input data 
grows over time, you could have a pre-job that determines the size of the input 
data, and then calculates optimal partitioning.  One may even need to consider 
automatically increasing the size of cluster too.

It's very normal for Big Data jobs to work for months then start falling over.  
Trust me it used to be much worse in the Hadoop days.

 

> create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
> --
>
> Key: SPARK-6190
> URL: https://issues.apache.org/jira/browse/SPARK-6190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
> Attachments: LargeByteBuffer_v3.pdf
>
>
> A key component in eliminating the 2GB limit on blocks is creating a proper 
> abstraction for storing more than 2GB.  Currently spark is limited by a 
> reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 
> 2GB.  This task will introduce the new abstraction and the relevant 
> implementation and utilities, without effecting the existing implementation 
> at all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23762) UTF8StringBuilder uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23762:


Assignee: (was: Apache Spark)

> UTF8StringBuilder uses MemoryBlock
> --
>
> Key: SPARK-23762
> URL: https://issues.apache.org/jira/browse/SPARK-23762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23762) UTF8StringBuilder uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23762:


Assignee: Apache Spark

> UTF8StringBuilder uses MemoryBlock
> --
>
> Key: SPARK-23762
> URL: https://issues.apache.org/jira/browse/SPARK-23762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23762) UTF8StringBuilder uses MemoryBlock

2018-03-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407613#comment-16407613
 ] 

Apache Spark commented on SPARK-23762:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20871

> UTF8StringBuilder uses MemoryBlock
> --
>
> Key: SPARK-23762
> URL: https://issues.apache.org/jira/browse/SPARK-23762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23739) Spark structured streaming long running problem

2018-03-21 Thread Florencio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407611#comment-16407611
 ] 

Florencio commented on SPARK-23739:
---

Thanks for your response. I am getting this error after some time, it could be 
hours or days. The spark-submit command is:

_export SPARK_MAJOR_VERSION=2_
_spark-submit --master yarn --deploy-mode client \_
_--executor-cores 6 --num-executors 4 --driver-memory 2G --executor-memory 2G \_
_--jars 
/usr/hdp/current/hbase-client/lib/hbase-client-1.1.2.2.6.1.0-129.jar,/usr/hdp/current/hbase-client/lib/hbase-common-1.1.2.2.6.1.0-129.jar,/usr/hdp/current/hbase-client/lib/hbase-server-1.1.2.2.6.1.0-129.jar,/usr/hdp/current/hbase-client/lib/hbase-protocol-1.1.2.2.6.1.0-129.jar
 \_
_--class classname jarname \_
_-topic=topicname\_
_-tablehbase= namehbasetable\_
_-checkpointLocation=/tmp/PassingStructuredStreamingEntrate \_

 

Some useful information could be the configuration of the structured stream:

 _val ds1 = spark.readStream_
 _.format("kafka")_
 _.option("kafka.bootstrap.servers", "serversnames")_
 _.option("startingOffsets","latest")_
 _.option("failOnDataLoss","false")_ 
 _.option("fetchOffset.numRetries",5)_
 _.option("subscribe", topic)_
 _.load()_

 

 _val writer = new HbaseSink(tablehbase)_
 _val query = results.writeStream.foreach(writer)_
 _.outputMode("complete")_
 _.option("checkpointLocation", checkpointLocation)_
_.start()_

 

 

Thanks.

 

 

> Spark structured streaming long running problem
> ---
>
> Key: SPARK-23739
> URL: https://issues.apache.org/jira/browse/SPARK-23739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Florencio
>Priority: Critical
>  Labels: spark, streaming, structured
>
> I had a problem with long running spark structured streaming in spark 2.1. 
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.kafka.common.requests.LeaveGroupResponse.
> The detailed error is the following:
> 18/03/16 16:10:57 INFO StreamExecution: Committed offsets for batch 2110. 
> Metadata OffsetSeqMetadata(0,1521216656590)
> 18/03/16 16:10:57 INFO KafkaSource: GetBatch called with start = 
> Some(\{"TopicName":{"2":5520197,"1":5521045,"3":5522054,"0":5527915}}), end = 
> \{"TopicName":{"2":5522730,"1":5523577,"3":5524586,"0":5530441}}
> 18/03/16 16:10:57 INFO KafkaSource: Partitions added: Map()
> 18/03/16 16:10:57 ERROR StreamExecution: Query [id = 
> a233b9ff-cc39-44d3-b953-a255986c04bf, runId = 
> 8520e3c0-2455-4ac1-9021-8518fb58b3f8] terminated with error
> java.util.zip.ZipException: invalid code lengths set
>  at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.FilterInputStream.read(FilterInputStream.java:107)
>  at 
> org.apache.spark.util.Utils$$anonfun$copyStream$1.apply$mcJ$sp(Utils.scala:354)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$$anonfun$copyStream$1.apply(Utils.scala:322)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
>  at org.apache.spark.util.Utils$.copyStream(Utils.scala:362)
>  at 
> org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:45)
>  at 
> org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:83)
>  at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:173)
>  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
>  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
>  at org.apache.spark.sql.kafka010.KafkaSource.getBatch(KafkaSource.scala:287)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:503)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$2$$anonfun$apply$6.apply(StreamExecution.scala:499)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at

[jira] [Updated] (SPARK-23763) BufferHolder uses MemoryBlock

2018-03-21 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23763:
-
Description: 
This JIRA entry tries to use {{MemoryBlock}} in {{BufferHolder}}.


  was:
This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.


Summary: BufferHolder uses MemoryBlock  (was: UTF8StringBuilder uses 
MemoryBlock)

> BufferHolder uses MemoryBlock
> -
>
> Key: SPARK-23763
> URL: https://issues.apache.org/jira/browse/SPARK-23763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in {{BufferHolder}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23763) UTF8StringBuilder uses MemoryBlock

2018-03-21 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-23763:


 Summary: UTF8StringBuilder uses MemoryBlock
 Key: SPARK-23763
 URL: https://issues.apache.org/jira/browse/SPARK-23763
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23762) UTF8StringBuilder uses MemoryBlock

2018-03-21 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-23762:


 Summary: UTF8StringBuilder uses MemoryBlock
 Key: SPARK-23762
 URL: https://issues.apache.org/jira/browse/SPARK-23762
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23503) continuous execution should sequence committed epochs

2018-03-21 Thread Efim Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407545#comment-16407545
 ] 

Efim Poberezkin commented on SPARK-23503:
-

I'd like to work on this one

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23761) Dataframe filter(udf) followed by groupby in pyspark throws a casting error

2018-03-21 Thread Dhaniram Kshirsagar (JIRA)

Dhaniram Kshirsagar created SPARK-23761:
---

 Summary: Dataframe filter(udf) followed by groupby in pyspark 
throws a casting error
 Key: SPARK-23761
 URL: https://issues.apache.org/jira/browse/SPARK-23761
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.0
 Environment: pyspark 1.6.0

Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2

CentOS 6.7
Reporter: Dhaniram Kshirsagar


On pyspark with dataframe, we are getting following exception when 'filter(with 
UDF) is followed by groupby' :-

# Snippet of error observed in pyspark
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o56.filter.
: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate{code}
This one looks like https://issues.apache.org/jira/browse/SPARK-12981 however 
not sure if this one is same.

 

Here is gist with pyspark steps to reproduce this issue:

[https://gist.github.com/dhaniram-kshirsagar/d72545620b6a05d145a1a6bece797b6d] 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23666) Undeterministic column name with UDFs

2018-03-21 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23666.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Undeterministic column name with UDFs
> -
>
> Key: SPARK-23666
> URL: https://issues.apache.org/jira/browse/SPARK-23666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Daniel Darabos
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> When you access structure fields in Spark SQL, the auto-generated result 
> column name includes an internal ID.
> {code:java}
> scala> import spark.implicits._
> scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x")
> scala> spark.udf.register("f", (a: Int) => a)
> scala> spark.sql("select f(a._1) from x").show
> +-+
> |UDF:f(a._1 AS _1#148)|
> +-+
> |1|
> +-+
> {code}
> This ID ({{#148}}) is only included for UDFs.
> {code:java}
> scala> spark.sql("select factorial(a._1) from x").show
> +---+
> |factorial(a._1 AS `_1`)|
> +---+
> |  1|
> +---+
> {code}
> The internal ID is different on every invocation. The problem this causes for 
> us is that the schema of the SQL output is never the same:
> {code:java}
> scala> spark.sql("select f(a._1) from x").schema ==
>spark.sql("select f(a._1) from x").schema
> Boolean = false
> {code}
> We rely on similar schema checks when reloading persisted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23234) ML python test failure due to default outputCol

2018-03-21 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23234.
--
Resolution: Duplicate

> ML python test failure due to default outputCol
> ---
>
> Key: SPARK-23234
> URL: https://issues.apache.org/jira/browse/SPARK-23234
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Major
>
> SPARK-22799 and SPARK-22797 are causing valid Python test failures. The 
> reason is that Python is setting the default params with set. So they are not 
> considered as defaults, but as params passed by the user.
> This means that an outputCol is set not as a default but as a real value.
> Anyway, this is a misbehavior of the python API which can cause serious 
> problems and I'd suggest to rethink the way this is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-21 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407510#comment-16407510
 ] 

Bryan Cutler commented on SPARK-23244:
--

Just to clarify, the PySpark save/load is just a wrapper making the same calls 
in Java, so that will fix the root cause of the issue

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-21 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23244.
--
Resolution: Duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23244) Incorrect handling of default values when deserializing python wrappers of scala transformers

2018-03-21 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16407507#comment-16407507
 ] 

Bryan Cutler commented on SPARK-23244:
--

I looked into this and it is a little bit different because with save/load, 
params are only transferred from Java to Python.  So the actual problem is in 
Scala:
{code:java}
scala> import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.ml.feature.Bucketizer

scala> val a = new Bucketizer()
a: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> a.isSet(a.outputCol)
res2: Boolean = false

scala> a.save("bucketizer0")

scala> val b = Bucketizer.load("bucketizer0")
b: org.apache.spark.ml.feature.Bucketizer = bucketizer_30c66d09db18

scala> b.isSet(b.outputCol)
res4: Boolean = true{code}
It seems this is being worked on in SPARK-23455, so I'll still close this as a 
duplicate

> Incorrect handling of default values when deserializing python wrappers of 
> scala transformers
> -
>
> Key: SPARK-23244
> URL: https://issues.apache.org/jira/browse/SPARK-23244
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Tomas Nykodym
>Priority: Minor
>
> Default values are not handled properly when serializing/deserializing python 
> trasnformers which are wrappers around scala objects. It looks like that 
> after deserialization the default values which were based on uid do not get 
> properly restored and values which were not set are set to their (original) 
> default values.
> Here's a simple code example using Bucketizer:
> {code:python}
> >>> from pyspark.ml.feature import Bucketizer
> >>> a = Bucketizer() 
> >>> a.save("bucketizer0")
> >>> b = load("bucketizer0") 
> >>> a._defaultParamMap[a.outputCol]
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b._defaultParamMap[b.outputCol]
> u'Bucketizer_41cf9afbc559ca2bfc9a__output'
> >>> a.isSet(a.outputCol)
> False 
> >>> b.isSet(b.outputCol)
> True
> >>> a.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> >>> b.getOutputCol()
> u'Bucketizer_440bb49206c148989db7__output'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

64 matches

Mail list logo