[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-02-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23186:

Fix Version/s: 2.2.2

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like the following or STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
> (Interpreted frame)
>  - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
> frame)
>  - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
> @bci=0 (Compiled frame)
>  - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
> frame)
>  - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
>  java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
> (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
>  java.util.Properties) @bci=22, line=57 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
>  org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
>  org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 
> (Interpreted frame)
>  - 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
> (Interpreted frame)
> Thread 9170: (state = BLOCKED)
>  - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
> (Interpreted frame)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
>  @bci=89, line=46 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=7, line=53 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=1, line=52 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
> (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
>  org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Harleen Singh Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359202#comment-16359202
 ] 

Harleen Singh Mann commented on SPARK-23372:


[~dkbiswal] How will it throw the error during compile time?

with ref to your statement: 
_"We should detect this earlier and failed during compilation of the query."_ I 
mean the use of "compilation" in the sentence is probably incorrect. I will 
suggest changing it to "during preparing/executing the query".

 
 
 

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359200#comment-16359200
 ] 

Harleen Singh Mann commented on SPARK-23370:


[~srowen] Yes should be able to implement in the Oracle JDBC dialect. I want to 
start working on it once we agree it adds value.

Do you mean overhead for Spark? Or for the Oracle DB? Or for the developer? haha

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23379:


Assignee: (was: Apache Spark)

> remove redundant metastore access if the current database name is the same
> --
>
> Key: SPARK-23379
> URL: https://issues.apache.org/jira/browse/SPARK-23379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Priority: Major
>
> We should be able to reduce one metastore access if the target database name 
> is as same as the current database:
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359177#comment-16359177
 ] 

Apache Spark commented on SPARK-23379:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20565

> remove redundant metastore access if the current database name is the same
> --
>
> Key: SPARK-23379
> URL: https://issues.apache.org/jira/browse/SPARK-23379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Priority: Major
>
> We should be able to reduce one metastore access if the target database name 
> is as same as the current database:
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23379:


Assignee: Apache Spark

> remove redundant metastore access if the current database name is the same
> --
>
> Key: SPARK-23379
> URL: https://issues.apache.org/jira/browse/SPARK-23379
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Assignee: Apache Spark
>Priority: Major
>
> We should be able to reduce one metastore access if the target database name 
> is as same as the current database:
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23379) remove redundant metastore access if the current database name is the same

2018-02-09 Thread Feng Liu (JIRA)
Feng Liu created SPARK-23379:


 Summary: remove redundant metastore access if the current database 
name is the same
 Key: SPARK-23379
 URL: https://issues.apache.org/jira/browse/SPARK-23379
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Feng Liu


We should be able to reduce one metastore access if the target database name is 
as same as the current database:

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23378:


Assignee: (was: Apache Spark)

> move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
> --
>
> Key: SPARK-23378
> URL: https://issues.apache.org/jira/browse/SPARK-23378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Priority: Major
>
> Conceptually, no methods of HiveExternalCatalog, besides the 
> `setCurrentDatabase`, should change the `currentDatabase` in the hive session 
> state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
> in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359135#comment-16359135
 ] 

Apache Spark commented on SPARK-23378:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20564

> move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
> --
>
> Key: SPARK-23378
> URL: https://issues.apache.org/jira/browse/SPARK-23378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Priority: Major
>
> Conceptually, no methods of HiveExternalCatalog, besides the 
> `setCurrentDatabase`, should change the `currentDatabase` in the hive session 
> state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
> in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23378:


Assignee: Apache Spark

> move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
> --
>
> Key: SPARK-23378
> URL: https://issues.apache.org/jira/browse/SPARK-23378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Feng Liu
>Assignee: Apache Spark
>Priority: Major
>
> Conceptually, no methods of HiveExternalCatalog, besides the 
> `setCurrentDatabase`, should change the `currentDatabase` in the hive session 
> state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
> in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl

2018-02-09 Thread Feng Liu (JIRA)
Feng Liu created SPARK-23378:


 Summary: move setCurrentDatabase from HiveExternalCatalog to 
HiveClientImpl
 Key: SPARK-23378
 URL: https://issues.apache.org/jira/browse/SPARK-23378
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Feng Liu


Conceptually, no methods of HiveExternalCatalog, besides the 
`setCurrentDatabase`, should change the `currentDatabase` in the hive session 
state. We can enforce this rule by removing the usage of `setCurrentDatabase` 
in the HiveExternalCatalog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-23377:

Description: 
A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}

And the trace:

{code:java}
java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the 
inputCols Param set for multi-column transform. The following Params are not 
applicable and should not be set: outputCol.
at 
org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
at 
org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
at 
org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
at 
org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
at 
line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)

{code}



  was:
A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}


> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-09 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23377:
---

 Summary: Bucketizer with multiple columns persistence bug
 Key: SPARK-23377
 URL: https://issues.apache.org/jira/browse/SPARK-23377
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21232) New built-in SQL function - Data_Type

2018-02-09 Thread Mario Molina (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mario Molina updated SPARK-21232:
-
Fix Version/s: 2.3.0
   2.2.2

> New built-in SQL function - Data_Type
> -
>
> Key: SPARK-21232
> URL: https://issues.apache.org/jira/browse/SPARK-21232
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR, SQL
>Affects Versions: 2.1.1
>Reporter: Mario Molina
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> This function returns the data type of a given column.
> {code:java}
> data_type("a")
> // returns string
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359062#comment-16359062
 ] 

Dilip Biswal commented on SPARK-23372:
--

[~mannharleen] Hello, my current plan is to add a validation check when we 
prepare to write for parquet. We have such checks
for text file. I plan to do something similar for parquet.

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359047#comment-16359047
 ] 

Sean Owen commented on SPARK-23370:
---

It's possible to implement that just in the JDBC dialect for Oracle I suppose. 
Is it extra overhead? that is I wonder about leaving in the workaround that 
impacts all Oracle users for a long time.

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23186) Initialize DriverManager first before loading Drivers

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359002#comment-16359002
 ] 

Apache Spark commented on SPARK-23186:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20563

> Initialize DriverManager first before loading Drivers
> -
>
> Key: SPARK-23186
> URL: https://issues.apache.org/jira/browse/SPARK-23186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> Since some JDBC Drivers have class initialization code to call 
> `DriverManager`, we need to initialize DriverManager first in order to avoid 
> potential deadlock situation like the following or STORM-2527.
> {code}
> Thread 9587: (state = BLOCKED)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 
> (Interpreted frame)
>  - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted 
> frame)
>  - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame)
>  - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame)
>  - 
> java.security.AccessController.doPrivileged(java.security.PrivilegedAction) 
> @bci=0 (Compiled frame)
>  - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted 
> frame)
>  - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String,
>  java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 
> (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration,
>  java.util.Properties) @bci=22, line=57 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext,
>  org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame)
>  - 
> org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit,
>  org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 
> (Interpreted frame)
>  - 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 
> (Interpreted frame)
> Thread 9170: (state = BLOCKED)
>  - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 
> (Interpreted frame)
>  - 
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor,
>  java.lang.Object[]) @bci=0 (Compiled frame)
>  - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=85, line=62 (Compiled frame)
>  - 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) 
> @bci=5, line=45 (Compiled frame)
>  - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, 
> line=423 (Compiled frame)
>  - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String)
>  @bci=89, line=46 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=7, line=53 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply()
>  @bci=1, line=52 (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD,
>  org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 
> (Interpreted frame)
>  - 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition,
>  org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358971#comment-16358971
 ] 

Apache Spark commented on SPARK-23275:
--

User 'liufengdb' has created a pull request for this issue:
https://github.com/apache/spark/pull/20562

> hive/tests have been failing when run locally on the laptop (Mac) with OOM 
> ---
>
> Key: SPARK-23275
> URL: https://issues.apache.org/jira/browse/SPARK-23275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.3.0
>
>
> hive tests have been failing when they are run locally (Mac Os)  after a 
> recent change in the trunk. After running the tests for some time, the test 
> fails with OOM with  Error: unable to create new native thread. 
> I noticed the thread count goes all the way up to 2000+ after which we start 
> getting these OOM errors. Most of the threads seem to be related to the 
> connection pool in hive metastore (BoneCP-x- ). This behaviour change 
> is happening after we made the following change to HiveClientImpl.reset()
> {code}
>  def reset(): Unit = withHiveState {
> try {
>   // code
> } finally {
>   runSqlHive("USE default")  ===> this is causing the issue
> }
> {code}
> I am proposing to temporarily back-out part of a fix made to address 
> SPARK-23000 to resolve this issue while we work-out the exact reason for this 
> sudden increase in thread counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-09 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358866#comment-16358866
 ] 

Edwina Lu commented on SPARK-23206:
---

[~irashid], total memory and off heap memory is also very useful for us, so we 
are interested in the work being done for SPARK-21157 and SPARK-9103. The 
infrastructure (using the heartbeat and selectively logging to the history log) 
is also similar. We are planning to discuss with [~cltlfcjin] on Monday.

For stage level logging, we've modified LiveExecutorStageSummary to store peak 
values for the new memory metrics, and these are checked and updated for active 
stages in AppStatusListener.onExecutorMetricsUpdate(). For history logging, our 
design is a bit simpler: we track the peak values per executor, and immediately 
log if there is a new peak value. The peak values are reinitialized whenever a 
new stage starts, and this would provide the peak value for a memory metric for 
a stage. In the design doc for SPARK-9103, the heartbeats are combined and 
logged at each stage end – this design could work for us as well.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, ExecutorsTab2.png, 
> MemoryTuningMetricsDesignDoc.pdf, StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2018-02-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16501:
--

Assignee: Rob Vesse  (was: Marcelo Vanzin)

> spark.mesos.secret exposed on UI and command line
> -
>
> Key: SPARK-16501
> URL: https://issues.apache.org/jira/browse/SPARK-16501
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.2
>Reporter: Eric Daniel
>Assignee: Rob Vesse
>Priority: Major
>  Labels: security
> Fix For: 2.4.0
>
>
> There are two related problems with spark.mesos.secret:
> 1) The web UI shows its value in the "environment" tab
> 2) Passing it as a command-line option to spark-submit (or creating a 
> SparkContext from python, with the effect of launching spark-submit)  exposes 
> it to "ps"
> I'll be happy to submit a patch but I could use some advice first.
> The first problem is easy enough, just don't show that value in the UI
> For the second problem, I'm not sure what the best solution is. A 
> "spark.mesos.secret-file" parameter would let the user store the secret in a 
> non-world-readable file. Alternatively, the mesos secret could be obtained 
> from the environment, which other users don't have access to.  Either 
> solution would work in client mode, but I don't know if they're workable in 
> cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2018-02-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-16501:
--

Assignee: Marcelo Vanzin

> spark.mesos.secret exposed on UI and command line
> -
>
> Key: SPARK-16501
> URL: https://issues.apache.org/jira/browse/SPARK-16501
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.2
>Reporter: Eric Daniel
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: security
> Fix For: 2.4.0
>
>
> There are two related problems with spark.mesos.secret:
> 1) The web UI shows its value in the "environment" tab
> 2) Passing it as a command-line option to spark-submit (or creating a 
> SparkContext from python, with the effect of launching spark-submit)  exposes 
> it to "ps"
> I'll be happy to submit a patch but I could use some advice first.
> The first problem is easy enough, just don't show that value in the UI
> For the second problem, I'm not sure what the best solution is. A 
> "spark.mesos.secret-file" parameter would let the user store the secret in a 
> non-world-readable file. Alternatively, the mesos secret could be obtained 
> from the environment, which other users don't have access to.  Either 
> solution would work in client mode, but I don't know if they're workable in 
> cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16501) spark.mesos.secret exposed on UI and command line

2018-02-09 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16501.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20167
[https://github.com/apache/spark/pull/20167]

> spark.mesos.secret exposed on UI and command line
> -
>
> Key: SPARK-16501
> URL: https://issues.apache.org/jira/browse/SPARK-16501
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, Web UI
>Affects Versions: 1.6.2
>Reporter: Eric Daniel
>Priority: Major
>  Labels: security
> Fix For: 2.4.0
>
>
> There are two related problems with spark.mesos.secret:
> 1) The web UI shows its value in the "environment" tab
> 2) Passing it as a command-line option to spark-submit (or creating a 
> SparkContext from python, with the effect of launching spark-submit)  exposes 
> it to "ps"
> I'll be happy to submit a patch but I could use some advice first.
> The first problem is easy enough, just don't show that value in the UI
> For the second problem, I'm not sure what the best solution is. A 
> "spark.mesos.secret-file" parameter would let the user store the secret in a 
> non-world-readable file. Alternatively, the mesos secret could be obtained 
> from the environment, which other users don't have access to.  Either 
> solution would work in client mode, but I don't know if they're workable in 
> cluster mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath

2018-02-09 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23360:
-
Summary: SparkSession.createDataFrame timestamps can be incorrect with 
non-Arrow codepath  (was: SparkSession.createDataFrame results in correct 
results with non-Arrow codepath)

> SparkSession.createDataFrame timestamps can be incorrect with non-Arrow 
> codepath
> 
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23374) Checkstyle/Scalastyle only work from top level build

2018-02-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358775#comment-16358775
 ] 

Marcelo Vanzin commented on SPARK-23374:


I find it's just easier to run everything from the top level instead of doing 
crazy pom hacking...

e.g. {{mvn -pl :spark-mesos_2.11 verify}} instead of {{cd blah/mesos && mvn 
verify}}

> Checkstyle/Scalastyle only work from top level build
> 
>
> Key: SPARK-23374
> URL: https://issues.apache.org/jira/browse/SPARK-23374
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rob Vesse
>Priority: Trivial
>
> The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML 
> configs for the style rule locations that are only valid relative to the top 
> level POM.  Therefore if you try and do a {{mvn verify}} in an individual 
> module you get the following error:
> {noformat}
> [ERROR] Failed to execute goal 
> org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project 
> spark-mesos_2.11: Failed during scalastyle execution: Unable to find 
> configuration file at location scalastyle-config.xml
> {noformat}
> As the paths are hardcoded in XML and don't use Maven properties you can't 
> override these settings so you can't style check a single module which makes 
> doing style checking require a full project {{mvn verify}} which is not ideal.
> By introducing Maven properties for these two paths it would become possible 
> to run checks on a single module like so:
> {noformat}
> mvn verify -Dscalastyle.location=../scalastyle-config.xml
> {noformat}
> Obviously the override would need to vary depending on the specific module 
> you are trying to run it against but this would be a relatively simply change 
> that would streamline dev workflows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358759#comment-16358759
 ] 

Xuefu Zhang commented on SPARK-22683:
-

On a side note, besides the name of the configuration that's subject to change, 
I think (and mentioned previously) that the value doesn't have to be an 
integer, to allow finer control.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-09 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358668#comment-16358668
 ] 

Eyal Farago commented on SPARK-19870:
-

I'll remember to share relevant future logs
Re. The exception code path missing a cleanup, you're definitely right but I'm 
less concerned about this one as this code path is 'reserved' to tasks (I don't 
think Netty threads ever gets to this code) hence cleanup (+warning) is 
guaranteed.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: cs.executor.log, stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-09 Thread V Luong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358634#comment-16358634
 ] 

V Luong commented on SPARK-2:
-

[~cloud_fan] alternatively, is there any way that 
VectorAssembler.transform(...) can get the "numAttributes" 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88)]
 metadata from somewhere else instead of materializing a row? Does the current 
need to materialize a row mean that some metadata is lacking somewhere?

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358623#comment-16358623
 ] 

Wenchen Fan commented on SPARK-2:
-

This is not a trivial change, we need to introduce an `AnyRow` operator that 
can eliminate unneeded sort(maybe more) operators. If we can get what we want 
from any row, does it mean we want something like a metadata?

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-09 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358612#comment-16358612
 ] 

Imran Rashid commented on SPARK-19870:
--

to be honest, I'm not really sure what I'm looking for :)

even INFO logs are pretty useful though at helping walk through the code and 
figuring out which parts to look at more suspiciously.  Eg. in the logs you 
uploaded, I can say those WARN msgs are probably benign as its just related to 
a take / limit in the stage.  Another example is that I noticed that this call 
to {{releaseLocks}}: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L218

doesn't have a corresponding case in the exception path: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L226

logs would make it clear if you ever hit that exception -- though I don't think 
thats it as I don't think you should ever actually hit that exception.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: cs.executor.log, stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> 

[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-09 Thread V Luong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358599#comment-16358599
 ] 

V Luong commented on SPARK-2:
-

[~cloud_fan] there are many scenarios in which oldDF involves sorting in its 
plan, e.g. if certain feature columns are calculated using windowed functions. 
In general, it would be a pain to always make sure that oldDF doesn't involve 
sorting (e.g. by checkpointing to files) prior to VectorAssembler. Anyway, 
VectorAssembler metadata shouldn't strictly need the first row.

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23376:


Assignee: Apache Spark  (was: Wenchen Fan)

> creating UnsafeKVExternalSorter with BytesToBytesMap may fail
> -
>
> Key: SPARK-23376
> URL: https://issues.apache.org/jira/browse/SPARK-23376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358542#comment-16358542
 ] 

Apache Spark commented on SPARK-23376:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20561

> creating UnsafeKVExternalSorter with BytesToBytesMap may fail
> -
>
> Key: SPARK-23376
> URL: https://issues.apache.org/jira/browse/SPARK-23376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23376:


Assignee: Wenchen Fan  (was: Apache Spark)

> creating UnsafeKVExternalSorter with BytesToBytesMap may fail
> -
>
> Key: SPARK-23376
> URL: https://issues.apache.org/jira/browse/SPARK-23376
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail

2018-02-09 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23376:
---

 Summary: creating UnsafeKVExternalSorter with BytesToBytesMap may 
fail
 Key: SPARK-23376
 URL: https://issues.apache.org/jira/browse/SPARK-23376
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1, 2.1.2, 2.0.2, 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23269) FP-growth: Provide last transaction for each detected frequent pattern

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23269.
---
Resolution: Won't Fix

> FP-growth: Provide last transaction for each detected frequent pattern
> --
>
> Key: SPARK-23269
> URL: https://issues.apache.org/jira/browse/SPARK-23269
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Arseniy Tashoyan
>Priority: Minor
>  Labels: MLlib, fp-growth
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> FP-growth implementation gives patterns and their frequences:
> _model.freqItemsets_:
> ||items||freq||
> |[5]|3|
> |[5, 1]|3|
> It would be great to know when each pattern occurred last time - what is the 
> last transaction having this pattern?
> To do so, it will be necessary to tell FPGrowth what is the timestamp column 
> in the transactions data frame:
> {code:java}
> val fpgrowth = new FPGrowth()
>   .setItemsCol("items")
>   .setTimestampCol("timestamp")
> {code}
> So the data frame with patterns could look like:
> ||items||freq||lastOccurrence||
> |[5]|3|2018-01-01 12:15:00|
> |[5, 1]|3|2018-01-01 12:15:00|
> Without this functionality, it is necessary to traverse the transactions data 
> frame with the set of detected patterns and determine the last transaction 
> for each pattern. Why traverse transactions once again if it has been already 
> done in FP-growth execution?
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358495#comment-16358495
 ] 

Harleen Singh Mann commented on SPARK-23370:


[~q79969786] your suggestion would work but only if one knows in advance that 
there exists a column in Oracle DB of type Numeric and created using alter 
table statement. This information is seldom available to developers.

[~srowen] True, it is an Oracle issue. If everyone agrees that Spark has 
nothing to do with it we may close this issue as is.

However, I feel there may be merit in evaluating the way spark is fetching 
schema information from jdbc - i.e. resultSet.getMetaData.getColumnType VS from 
all_tabs_columns

 

Thanks.

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23354.
---
Resolution: Not A Problem

> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
> ---
>
> Key: SPARK-23354
> URL: https://issues.apache.org/jira/browse/SPARK-23354
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Lav Patel
>Priority: Major
>
> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
>  
> To fix this, I have written code so it will figure out length of column and 
> it does the conversion.
>  
> I can put more details with a code sample if the community is interested. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23358:
--
Affects Version/s: (was: 2.4.0)
   2.3.0
 Priority: Minor  (was: Major)

> When the number of partitions is greater than 2^28, it will result in an 
> error result
> -
>
> Key: SPARK-23358
> URL: https://issues.apache.org/jira/browse/SPARK-23358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.2.2, 2.3.0
>
>
> In the `checkIndexAndDataFile`,the _blocks_ is the  _Int_ type,  when it is 
> greater than 2^28, `blocks*8` will overflow, and this will result in an error 
> result.
>  In fact, `blocks` is actually the number of partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table

2018-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358488#comment-16358488
 ] 

Sean Owen commented on SPARK-23371:
---

It sounds like you have multiple versions of Parquet on your classpath, or at 
least, you're writing with a new version and reading with an old version? 
that's not going to work. This does not look like a Spark problem.

> Parquet Footer data is wrong on window in parquet format partition table 
> -
>
> Key: SPARK-23371
> URL: https://issues.apache.org/jira/browse/SPARK-23371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2
>Reporter: pin_zhang
>Priority: Major
>
> On window
> Run SQL in spark shell
>  spark.sql("create table part_test (id string )partitioned by( index int) 
> stored as parquet")
>  spark.sql("insert into part_test partition (index =1) values ('1')")
> Get exception when query spark.sql("select * from part_test ").show()
> For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the 
> version info in spark on window. Classloader try to get version in the 
> parquet-format-2.3.0-incubating.jar
> 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because 
> created_by
>  could not be parsed (see PARQUET-251): parquet-mr
>  org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_
>  by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*))
>  at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
>  at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt
>  atistics.java:60)
>  at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq
>  uetStatistics(ParquetMetadataConverter.java:263)
>  at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque
>  tFileReader.java:583)
>  at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF
>  ileReader.java:513)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
>  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
>  ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>  at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe
>  xt(RecordReaderIterator.scala:39)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
>  t(FileScanRDD.scala:109)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt
>  erator(FileScanRDD.scala:184)
>  at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
>  t(FileScanRDD.scala:109)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
>  rator.scan_nextBatch$(Unknown Source)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
>  rator.processNext(Unknown Source)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo
>  wIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon
>  $1.hasNext(WholeStageCodegenExec.scala:377)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
>  cala:231)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
>  cala:225)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
>  ply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
>  ply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
>  38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
>  java:1142)
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
>  .java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23358.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

Issue resolved by pull request 20544
[https://github.com/apache/spark/pull/20544]

> When the number of partitions is greater than 2^28, it will result in an 
> error result
> -
>
> Key: SPARK-23358
> URL: https://issues.apache.org/jira/browse/SPARK-23358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> In the `checkIndexAndDataFile`,the _blocks_ is the  _Int_ type,  when it is 
> greater than 2^28, `blocks*8` will overflow, and this will result in an error 
> result.
>  In fact, `blocks` is actually the number of partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23358:
-

Assignee: liuxian

> When the number of partitions is greater than 2^28, it will result in an 
> error result
> -
>
> Key: SPARK-23358
> URL: https://issues.apache.org/jira/browse/SPARK-23358
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> In the `checkIndexAndDataFile`,the _blocks_ is the  _Int_ type,  when it is 
> greater than 2^28, `blocks*8` will overflow, and this will result in an error 
> result.
>  In fact, `blocks` is actually the number of partitions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23347) Introduce buffer between Java data stream and gzip stream

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23347.
---
Resolution: Not A Problem

> Introduce buffer between Java data stream and gzip stream
> -
>
> Key: SPARK-23347
> URL: https://issues.apache.org/jira/browse/SPARK-23347
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ted Yu
>Priority: Minor
>
> Currently GZIPOutputStream is used directly around ByteArrayOutputStream 
> e.g. from KVStoreSerializer :
> {code}
>   ByteArrayOutputStream bytes = new ByteArrayOutputStream();
>   GZIPOutputStream out = new GZIPOutputStream(bytes);
> {code}
> This seems inefficient.
> GZIPOutputStream does not implement the write(byte) method. It only provides 
> a write(byte[], offset, len) method, which calls the corresponding JNI zlib 
> function.
> BufferedOutputStream can be introduced wrapping GZIPOutputStream for better 
> performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2:
--
Priority: Minor  (was: Major)

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23374) Checkstyle/Scalastyle only work from top level build

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23374:
--
  Priority: Trivial  (was: Minor)
Issue Type: Improvement  (was: Bug)

This isn't a bug; it's how it's supposed to work, as it's there for Jenkins 
jobs. If you can suggest a clean change that makes it more flexible, sure, but 
otherwise I'd close this.

> Checkstyle/Scalastyle only work from top level build
> 
>
> Key: SPARK-23374
> URL: https://issues.apache.org/jira/browse/SPARK-23374
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rob Vesse
>Priority: Trivial
>
> The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML 
> configs for the style rule locations that are only valid relative to the top 
> level POM.  Therefore if you try and do a {{mvn verify}} in an individual 
> module you get the following error:
> {noformat}
> [ERROR] Failed to execute goal 
> org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project 
> spark-mesos_2.11: Failed during scalastyle execution: Unable to find 
> configuration file at location scalastyle-config.xml
> {noformat}
> As the paths are hardcoded in XML and don't use Maven properties you can't 
> override these settings so you can't style check a single module which makes 
> doing style checking require a full project {{mvn verify}} which is not ideal.
> By introducing Maven properties for these two paths it would become possible 
> to run checks on a single module like so:
> {noformat}
> mvn verify -Dscalastyle.location=../scalastyle-config.xml
> {noformat}
> Obviously the override would need to vary depending on the specific module 
> you are trying to run it against but this would be a relatively simply change 
> that would streamline dev workflows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23372:
--
Issue Type: Improvement  (was: Bug)

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23364) 'desc table' command in spark-sql add column head display

2018-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358477#comment-16358477
 ] 

Sean Owen commented on SPARK-23364:
---

[~guoxiaolongzte] please don't reopen JIRAs without any change. You have 
provided no description of the change or reason it's needed.

> 'desc table' command in spark-sql add column head display
> -
>
> Key: SPARK-23364
> URL: https://issues.apache.org/jira/browse/SPARK-23364
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> fix before: 
>  !2.png! 
> fix after:
>  !1.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-09 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358458#comment-16358458
 ] 

Eyal Farago commented on SPARK-19870:
-

[~irashid], I'm afraid I don't have a documentation of which executor got to 
this hang, so I can't think of a way to find its logs (on top of this the 
spark-ui via history server seems a bit unreliable, i.e. jobs 'running' in the 
ui are rported to complete in the executor logs).

can you please share, what is it you're looking for in the executor logs? as 
you could see in the one I've shared spark's logging level is set to WARN so 
there's not much into it...

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: cs.executor.log, stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> 

[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358435#comment-16358435
 ] 

Apache Spark commented on SPARK-23375:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20560

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23375) Optimizer should remove unneeded Sort

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23375:


Assignee: Apache Spark

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23375) Optimizer should remove unneeded Sort

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23375:


Assignee: (was: Apache Spark)

> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23375) Optimizer should remove unneeded Sort

2018-02-09 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23375:

Description: 
As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
operator on an already sorted plan, ie. if we have a query like:

{code}
SELECT b
FROM (
SELECT a, b
FROM table1
ORDER BY a
) t
ORDER BY a
{code}


The sort is actually executed twice, even though it is not needed.

  was:
As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
operator on an already sorted plan, ie. if we have a query like:

{{code}}
SELECT b
FROM (
SELECT a, b
FROM table1
ORDER BY a
) t
ORDER BY a
{{code}}

The sort is actually executed twice, even though it is not needed.


> Optimizer should remove unneeded Sort
> -
>
> Key: SPARK-23375
> URL: https://issues.apache.org/jira/browse/SPARK-23375
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
> operator on an already sorted plan, ie. if we have a query like:
> {code}
> SELECT b
> FROM (
> SELECT a, b
> FROM table1
> ORDER BY a
> ) t
> ORDER BY a
> {code}
> The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23375) Optimizer should remove unneeded Sort

2018-02-09 Thread Marco Gaido (JIRA)
Marco Gaido created SPARK-23375:
---

 Summary: Optimizer should remove unneeded Sort
 Key: SPARK-23375
 URL: https://issues.apache.org/jira/browse/SPARK-23375
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


As pointed out in SPARK-23368, as of now there is no rule to remove the Sort 
operator on an already sorted plan, ie. if we have a query like:

{{code}}
SELECT b
FROM (
SELECT a, b
FROM table1
ORDER BY a
) t
ORDER BY a
{{code}}

The sort is actually executed twice, even though it is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23363) Fix spark-sql bug or improvement

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23363.
---
Resolution: Invalid

[~guoxiaolongzte] do not reopen JIRAs with no change. There is no purpose in 
this one; it's an umbrella of one issue, and the umbrella is just about "bugs"

> Fix spark-sql bug or improvement
> 
>
> Key: SPARK-23363
> URL: https://issues.apache.org/jira/browse/SPARK-23363
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358408#comment-16358408
 ] 

Apache Spark commented on SPARK-23360:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20559

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23360:


Assignee: (was: Apache Spark)

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23363) Fix spark-sql bug or improvement

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-23363.
-

> Fix spark-sql bug or improvement
> 
>
> Key: SPARK-23363
> URL: https://issues.apache.org/jira/browse/SPARK-23363
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: guoxiaolongzte
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath

2018-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23360:


Assignee: Apache Spark

> SparkSession.createDataFrame results in correct results with non-Arrow 
> codepath
> ---
>
> Key: SPARK-23360
> URL: https://issues.apache.org/jira/browse/SPARK-23360
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> import datetime
> import pandas as pd
> import os
> dt = [datetime.datetime(2015, 10, 31, 22, 30)]
> pdf = pd.DataFrame({'time': dt})
> os.environ['TZ'] = 'America/New_York'
> df1 = spark.createDataFrame(pdf)
> df1.show()
> +---+
> |   time|
> +---+
> |2015-10-31 21:30:00|
> +---+
> {code}
> Seems to related to this line here:
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776]
> It appears to be an issue with "tzlocal()"
> Wrong:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "tzlocal()"
> s.apply(lambda ts:  
> ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 21:30:00
> Name: time, dtype: datetime64[ns]
> {code}
> Correct:
> {code:java}
> from_tz = "America/New_York"
> to_tz = "America/New_York"
> s.apply(
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
> if ts is not pd.NaT else pd.NaT)
> 0   2015-10-31 22:30:00
> Name: time, dtype: datetime64[ns]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23370:
--
  Shepherd:   (was: Sean Owen)
 Flags:   (was: Important)
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Don't assign shepherds please; I don't accept this even as an issue)
This is an Oracle problem as you say, so, not a Spark bug.
A clean workaround is OK, but, sounds like there's one that doesn't even 
require code changes. So I'd close this.

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Minor
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-23373.
-
Resolution: Cannot Reproduce

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at 
> 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358402#comment-16358402
 ] 

Marco Gaido commented on SPARK-23373:
-

Then I think we can close this, thanks.

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at 
> 

[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358401#comment-16358401
 ] 

Yuming Wang commented on SPARK-23370:
-

User can config the column type like below now:
{code:scala}
val props = new Properties()
props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean")
val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", 
props)
dfRead.show()
{code}
More details:
https://github.com/apache/spark/pull/18266

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12378) CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR

2018-02-09 Thread Arun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358393#comment-16358393
 ] 

Arun commented on SPARK-12378:
--

I am also getting the same issue when I am trying to insert data in hive from 
spark.

My table is an external table stores in AWS S3.

Although the data gets inserted in the table, but it gives this message:

 
{code:java}
-chgrp: '' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
18/02/09 13:25:56 ERROR KeyProviderCache: Could not find uri with key 
[dfs.encryption.key.provider.uri] to create a keyProvider !!
-chgrp: '' does not match expected pattern for group
Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...{code}
Any resolution please?

> CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR
> ---
>
> Key: SPARK-12378
> URL: https://issues.apache.org/jira/browse/SPARK-12378
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: AWS EMR 4.2.0
> Just Master Running m3.xlarge
> Applications:
> Hive 1.0.0
> Spark 1.5.2
>Reporter: CESAR MICHELETTI
>Priority: Major
>
> I am receive the bellow error during try exporting data to AWS S3, in 
> spark-sql.
> Command:
> CREATE external TABLE export 
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
> -- lines terminated by '\n' 
>  STORED AS TEXTFILE
>  LOCATION 's3://xxx/yyy'
>  AS
> SELECT 
> xxx
> 
> (complete query)
> ;
> Error:
> -chgrp: '' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> -chgrp: '' does not match expected pattern for group
> Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
> 15/12/16 21:09:25 ERROR SparkSQLDriver: Failed in [CREATE external TABLE 
> csvexport
> ...
> (create table + query)
> ...
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358392#comment-16358392
 ] 

Yuming Wang commented on SPARK-23373:
-

I cannot reproduce on current master as your mentioned too.

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> 

[jira] [Comment Edited] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355
 ] 

Wang, Gang edited comment on SPARK-23373 at 2/9/18 1:01 PM:


Yes. Seems related to my test environment.

While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, 
method_ test("SPARK-20986 Reset table's statistics after 
PruneFileSourcePartitions rule").

Add 

_sql("select count(distinct id) from tbl").collect()_

Got the same exception. Could you please have a try in your side?


was (Author: gwang3):
Yes. Seems related to my test environment.

While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, 
method_ test("SPARK-20986 Reset table's statistics after 
PruneFileSourcePartitions rule").

Add 

_sql("select count(distinct id) from tbl").collect()_

 __ got the same exception. Could you please have a try in your side?

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355
 ] 

Wang, Gang commented on SPARK-23373:


Yes. Seems related to my test environment.

While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, 
method_ test("SPARK-20986 Reset table's statistics after 
PruneFileSourcePartitions rule").

Add 

_sql("select count(distinct id) from tbl").collect()_

 __ got the same exception. Could you please have a try in your side?

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at 

[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358315#comment-16358315
 ] 

Marco Gaido commented on SPARK-23373:
-

I cannot reproduce on current master... May you try and check whether the issue 
still exists?

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
>  {color:#ff}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> 

[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Harleen Singh Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358294#comment-16358294
 ] 

Harleen Singh Mann commented on SPARK-23372:


what is your proposal on fixing this?

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23374) Checkstyle/Scalastyle only work from top level build

2018-02-09 Thread Rob Vesse (JIRA)
Rob Vesse created SPARK-23374:
-

 Summary: Checkstyle/Scalastyle only work from top level build
 Key: SPARK-23374
 URL: https://issues.apache.org/jira/browse/SPARK-23374
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Rob Vesse


The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML 
configs for the style rule locations that are only valid relative to the top 
level POM.  Therefore if you try and do a {{mvn verify}} in an individual 
module you get the following error:

{noformat}
[ERROR] Failed to execute goal 
org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project 
spark-mesos_2.11: Failed during scalastyle execution: Unable to find 
configuration file at location scalastyle-config.xml
{noformat}

As the paths are hardcoded in XML and don't use Maven properties you can't 
override these settings so you can't style check a single module which makes 
doing style checking require a full project {{mvn verify}} which is not ideal.

By introducing Maven properties for these two paths it would become possible to 
run checks on a single module like so:

{noformat}
mvn verify -Dscalastyle.location=../scalastyle-config.xml
{noformat}

Obviously the override would need to vary depending on the specific module you 
are trying to run it against but this would be a relatively simply change that 
would streamline dev workflows



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-23373:
---
Description: 
I failed to run sql "select count(distinct n_name) from nation", table nation 
is formatted in Parquet, error trace is as following.

_spark-sql> select count(distinct n_name) from nation;_
 _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_
 _Error in query: Table or view not found: nation; line 1 pos 35_
 _spark-sql> select count(distinct n_name) from nation_parquet;_
 _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_parquet_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
array_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
struct_
 _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
 _18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
 _18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
true_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
 _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values 
in memory (estimated size 305.0 KB, free 366.0 MB)_
 _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
 _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 
MB)_
 _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
processCmd at CliDriver.java:376_
 _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
partition pruning:_
 _PartitionDirectory([empty 
row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
 isDirectory=false; length=3216; replication=3; blocksize=134217728; 
modification_time=1516619879024; access_time=0; owner=; group=; 
permission=rw-rw-rw-; isSymlink=false}))_
 _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes._
 _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
count(distinct n_name) from nation_parquet]_
 {color:#ff}*_org.apache.spark.SparkException: Task not 
serializable_*{color}
 _at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
 _at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
 _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
 _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_
 _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_
 _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_
 _at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
 _at 

[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-23373:
---
Description: 
I failed to run sql "select count(distinct n_name) from nation", table nation 
is formatted in Parquet, error trace is as following.

_spark-sql> select count(distinct n_name) from nation;_
 _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_
 _Error in query: Table or view not found: nation; line 1 pos 35_
 _spark-sql> select count(distinct n_name) from nation_parquet;_
 _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_parquet_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
 _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
array_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
 _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
struct_
 _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
 _18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
 _18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
 _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
true_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
 _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
 _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values 
in memory (estimated size 305.0 KB, free 366.0 MB)_
 _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
 _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 
MB)_
 _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
processCmd at CliDriver.java:376_
 _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
partition pruning:_
 _PartitionDirectory([empty 
row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0;
 isDirectory=false; length=3216; replication=3; blocksize=134217728; 
modification_time=1516619879024; access_time=0; owner=; group=; 
permission=rw-rw-rw-; isSymlink=false}))_
 _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes._
 _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
count(distinct n_name) from nation_parquet]_
 {color:#ff}*_org.apache.spark.SparkException: Task not 
serializable_*{color}
 _at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
 _at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
 _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
 _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_
 _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_
 _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_
 _at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
 _at 

[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang, Gang updated SPARK-23373:
---
Issue Type: Bug  (was: New Feature)

> Can not execute "count distinct" queries on parquet formatted table
> ---
>
> Key: SPARK-23373
> URL: https://issues.apache.org/jira/browse/SPARK-23373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wang, Gang
>Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation 
> is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
> _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_
> _Error in query: Table or view not found: nation; line 1 pos 35_
> _spark-sql> select count(distinct n_name) from nation_parquet;_
> _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
> count(distinct n_name) from nation_parquet_
> _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
> _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
> _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
> _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
> _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
> array_
> _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
> _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
> _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
> _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
> struct_
> _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
> _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
> _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
> _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
> _18/02/09 03:55:39 INFO main HashAggregateExec:54 
> spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
> version of codegened fast hashmap does not support this aggregate._
> _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
> _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is 
> true_
> _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
> _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
> _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
> _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as 
> values in memory (estimated size 305.0 KB, free 366.0 MB)_
> _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored 
> as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
> _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
> broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 
> 366.3 MB)_
> _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
> processCmd at CliDriver.java:376_
> _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
> partition pruning:_
>  _PartitionDirectory([empty 
> row],ArrayBuffer(LocatedFileStatus\{path=hdfs://btd-dev-2425209.lvs01.dev.ebayc3.com:8020/apps/hive/warehouse/nation_parquet/00_0;
>  isDirectory=false; length=3216; replication=3; blocksize=134217728; 
> modification_time=1516619879024; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false}))_
> _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
> packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
> bytes._
> _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
> count(distinct n_name) from nation_parquet]_
> {color:#FF}*_org.apache.spark.SparkException: Task not 
> serializable_*{color}
>  _at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at 
> 

[jira] [Created] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

2018-02-09 Thread Wang, Gang (JIRA)
Wang, Gang created SPARK-23373:
--

 Summary: Can not execute "count distinct" queries on parquet 
formatted table
 Key: SPARK-23373
 URL: https://issues.apache.org/jira/browse/SPARK-23373
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wang, Gang


I failed to run sql "select count(distinct n_name) from nation", table nation 
is formatted in Parquet, error trace is as following.


_spark-sql> select count(distinct n_name) from nation;_
_18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_
_Error in query: Table or view not found: nation; line 1 pos 35_
_spark-sql> select count(distinct n_name) from nation_parquet;_
_18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select 
count(distinct n_name) from nation_parquet_
_18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
_18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
_18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
_18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
_18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: 
array_
_18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
_18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
_18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
_18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: 
struct_
_18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
_18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
_18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
_18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
_18/02/09 03:55:39 INFO main HashAggregateExec:54 
spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current 
version of codegened fast hashmap does not support this aggregate._
_18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
_18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is true_
_18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
_18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
_18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
_18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values 
in memory (estimated size 305.0 KB, free 366.0 MB)_
_18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as 
bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
_18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added 
broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 
MB)_
_18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from 
processCmd at CliDriver.java:376_
_18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after 
partition pruning:_
 _PartitionDirectory([empty 
row],ArrayBuffer(LocatedFileStatus\{path=hdfs://btd-dev-2425209.lvs01.dev.ebayc3.com:8020/apps/hive/warehouse/nation_parquet/00_0;
 isDirectory=false; length=3216; replication=3; blocksize=134217728; 
modification_time=1516619879024; access_time=0; owner=; group=; 
permission=rw-rw-rw-; isSymlink=false}))_
_18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin 
packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 
bytes._
_18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select 
count(distinct n_name) from nation_parquet]_
{color:#FF}*_org.apache.spark.SparkException: Task not serializable_*{color}
 _at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
 _at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
 _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
 _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
 _at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
 _at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_
 _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_
 _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_
 _at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_
 _at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
 _at 

[jira] [Created] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-23372:


 Summary: Writing empty struct in parquet fails during execution. 
It should fail earlier during analysis.
 Key: SPARK-23372
 URL: https://issues.apache.org/jira/browse/SPARK-23372
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dilip Biswal


*Running*

spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)

*Results in*
{code:java}
 org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
an empty group: message spark_schema {
 }

at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
 at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
 at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
 at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
 at 
org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
 at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
 at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:109)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.
 {code}

We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.

2018-02-09 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358165#comment-16358165
 ] 

Dilip Biswal commented on SPARK-23372:
--

Working on a fix for this.

> Writing empty struct in parquet fails during execution. It should fail 
> earlier during analysis.
> ---
>
> Key: SPARK-23372
> URL: https://issues.apache.org/jira/browse/SPARK-23372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dilip Biswal
>Priority: Minor
>
> *Running*
> spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
> *Results in*
> {code:java}
>  org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with 
> an empty group: message spark_schema {
>  }
> at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
>  at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
>  at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
>  at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
>  at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.
>  {code}
> We should detect this earlier and failed during compilation of the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23096) Migrate rate source to v2

2018-02-09 Thread Jose Torres (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358143#comment-16358143
 ] 

Jose Torres commented on SPARK-23096:
-

Sure! Happy to have help.

The "ratev2" source is just something I hacked together to exercise the v2 
streaming execution path. You're right that it can really be replaced with a 
fully migrated version of the v1 source.

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table

2018-02-09 Thread pin_zhang (JIRA)
pin_zhang created SPARK-23371:
-

 Summary: Parquet Footer data is wrong on window in parquet format 
partition table 
 Key: SPARK-23371
 URL: https://issues.apache.org/jira/browse/SPARK-23371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.2, 2.1.1
Reporter: pin_zhang


On window

Run SQL in spark shell
 spark.sql("create table part_test (id string )partitioned by( index int) 
stored as parquet")
 spark.sql("insert into part_test partition (index =1) values ('1')")

Get exception when query spark.sql("select * from part_test ").show()

For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the 
version info in spark on window. Classloader try to get version in the 
parquet-format-2.3.0-incubating.jar

18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because created_by
 could not be parsed (see PARQUET-251): parquet-mr
 org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_
 by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*))
 at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
 at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt
 atistics.java:60)
 at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq
 uetStatistics(ParquetMetadataConverter.java:263)
 at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque
 tFileReader.java:583)
 at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF
 ileReader.java:513)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
 at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR
 ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
 at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe
 xt(RecordReaderIterator.scala:39)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
 t(FileScanRDD.scala:109)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt
 erator(FileScanRDD.scala:184)
 at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex
 t(FileScanRDD.scala:109)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
 rator.scan_nextBatch$(Unknown Source)
 at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte
 rator.processNext(Unknown Source)
 at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo
 wIterator.java:43)
 at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon
 $1.hasNext(WholeStageCodegenExec.scala:377)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
 cala:231)
 at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s
 cala:225)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
 ply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap
 ply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:
 38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
 java:1142)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
 .java:617)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21860) Improve memory reuse for heap memory in `HeapMemoryAllocator`

2018-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358114#comment-16358114
 ] 

Apache Spark commented on SPARK-21860:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20558

> Improve memory reuse for heap memory in `HeapMemoryAllocator`
> -
>
> Key: SPARK-21860
> URL: https://issues.apache.org/jira/browse/SPARK-21860
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.0
>
>
> In `HeapMemoryAllocator`, when allocating memory from pool,  and the key of 
> pool is memory size.
> Actually  some size of memory ,such as 1025bytes,1026bytes,..1032bytes,  
> we can think they are the same,because we allocate memory in multiples of 8 
> bytes.
> In this case, we can improve memory reuse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harleen Singh Mann updated SPARK-23370:
---
Shepherd: Sean Owen  (was: Xiangrui Meng)

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harleen Singh Mann updated SPARK-23370:
---
Shepherd: Xiangrui Meng

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
>  _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}{color:#f6c342}I can implement the changes, but require some 
> inputs on the approach from the gatekeepers here{color}.{color}
>  {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harleen Singh Mann updated SPARK-23370:
---
Description: 
Currently, on jdbc read spark obtains the schema of a table from using 
{color:#654982} resultSet.getMetaData.getColumnType{color}

This works 99.99% of the times except when the column of Number type is added 
on an Oracle table using the alter statement. This is essentially an Oracle DB 
+ JDBC bug that has been documented on Oracle KB and patches exist. [oracle 
KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]

{color:#ff}As a result of the above mentioned issue, Spark receives a size 
of 0 for the field and defaults the field type to be BigDecimal(30,10) instead 
of what it actually should be. This is done in OracleDialect.scala. This may 
cause issues in the downstream application where relevant information may be 
missed to the changed precision and scale.{color}

_The versions that are affected are:_ 
 _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
 _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
_[Release: 11.1 to 11.2]_ 

+Proposed approach:+

There is another way of fetching the schema information in Oracle: Which is 
through the all_tab_columns table. If we use this table to fetch the precision 
and scale of Number time, the above issue is mitigated.

 

{color:#14892c}{color:#f6c342}I can implement the changes, but require some 
inputs on the approach from the gatekeepers here{color}.{color}


 {color:#14892c}PS. This is also my first Jira issue and my first fork for 
Spark, so I will need some guidance along the way. (yes, I am a newbee to this) 
Thanks...{color}

  was:
Currently, on jdbc read spark obtains the schema of a table from using 
{color:#654982} resultSet.getMetaData.getColumnType{color}

This works 99.99% of the times except when the column of Number type is added 
on an Oracle table using the alter statement. This is essentially an Oracle DB 
+ JDBC bug that has been documented on Oracle KB and patches exist. [oracle 
KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]

{color:#FF}As a result of the above mentioned issue, Spark receives a size 
of 0 for the field and defaults the field type to be BigDecimal(30,10) instead 
of what it actually should be. This is done in OracleDialect.scala. This may 
cause issues in the downstream application where relevant information may be 
missed to the changed precision and scale.{color}

_The versions that are affected are:_ 
_JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
_Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
_[Release: 11.1 to 11.2]_ 

+Proposed approach:+

There is another way of fetching the schema information in Oracle: Which is 
through the all_tab_columns table. If we use this table to fetch the precision 
and scale of Number time, the above issue is mitigated.

 

{color:#14892c}I can implement the changes, but require some inputs on the 
approach from the gatekeepers here.{color}
{color:#14892c}PS. This is also my first Jira issue and my first fork for 
Spark, so I will need some guidance along the way. (yes, I am a newbee to this) 
Thanks...{color}


> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#ff}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
>  _JDBC - Version: 

[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale

2018-02-09 Thread Harleen Singh Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harleen Singh Mann updated SPARK-23370:
---
Summary: Spark receives a size of 0 for an Oracle Number field and defaults 
the field type to be BigDecimal(30,10) instead of the actual precision and 
scale  (was: Spark receives a size of 0 for an Oracle Number field defaults the 
field type to be BigDecimal(30,10))

> Spark receives a size of 0 for an Oracle Number field and defaults the field 
> type to be BigDecimal(30,10) instead of the actual precision and scale
> ---
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#FF}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
> _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
> _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}I can implement the changes, but require some inputs on the 
> approach from the gatekeepers here.{color}
> {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10)

2018-02-09 Thread Harleen Singh Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harleen Singh Mann updated SPARK-23370:
---
Attachment: Oracle KB Document 1266785.pdf

> Spark receives a size of 0 for an Oracle Number field defaults the field type 
> to be BigDecimal(30,10)
> -
>
> Key: SPARK-23370
> URL: https://issues.apache.org/jira/browse/SPARK-23370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
> Environment: Spark 2.2
> Oracle 11g
> JDBC ojdbc6.jar
>Reporter: Harleen Singh Mann
>Priority: Major
> Attachments: Oracle KB Document 1266785.pdf
>
>
> Currently, on jdbc read spark obtains the schema of a table from using 
> {color:#654982} resultSet.getMetaData.getColumnType{color}
> This works 99.99% of the times except when the column of Number type is added 
> on an Oracle table using the alter statement. This is essentially an Oracle 
> DB + JDBC bug that has been documented on Oracle KB and patches exist. 
> [oracle 
> KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]
> {color:#FF}As a result of the above mentioned issue, Spark receives a 
> size of 0 for the field and defaults the field type to be BigDecimal(30,10) 
> instead of what it actually should be. This is done in OracleDialect.scala. 
> This may cause issues in the downstream application where relevant 
> information may be missed to the changed precision and scale.{color}
> _The versions that are affected are:_ 
> _JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
> _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
> _[Release: 11.1 to 11.2]_ 
> +Proposed approach:+
> There is another way of fetching the schema information in Oracle: Which is 
> through the all_tab_columns table. If we use this table to fetch the 
> precision and scale of Number time, the above issue is mitigated.
>  
> {color:#14892c}I can implement the changes, but require some inputs on the 
> approach from the gatekeepers here.{color}
> {color:#14892c}PS. This is also my first Jira issue and my first fork for 
> Spark, so I will need some guidance along the way. (yes, I am a newbee to 
> this) Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10)

2018-02-09 Thread Harleen Singh Mann (JIRA)
Harleen Singh Mann created SPARK-23370:
--

 Summary: Spark receives a size of 0 for an Oracle Number field 
defaults the field type to be BigDecimal(30,10)
 Key: SPARK-23370
 URL: https://issues.apache.org/jira/browse/SPARK-23370
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
 Environment: Spark 2.2

Oracle 11g

JDBC ojdbc6.jar
Reporter: Harleen Singh Mann


Currently, on jdbc read spark obtains the schema of a table from using 
{color:#654982} resultSet.getMetaData.getColumnType{color}

This works 99.99% of the times except when the column of Number type is added 
on an Oracle table using the alter statement. This is essentially an Oracle DB 
+ JDBC bug that has been documented on Oracle KB and patches exist. [oracle 
KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html]

{color:#FF}As a result of the above mentioned issue, Spark receives a size 
of 0 for the field and defaults the field type to be BigDecimal(30,10) instead 
of what it actually should be. This is done in OracleDialect.scala. This may 
cause issues in the downstream application where relevant information may be 
missed to the changed precision and scale.{color}

_The versions that are affected are:_ 
_JDBC - Version: 11.2.0.1 and later   [Release: 11.2 and later ]_
_Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_  
_[Release: 11.1 to 11.2]_ 

+Proposed approach:+

There is another way of fetching the schema information in Oracle: Which is 
through the all_tab_columns table. If we use this table to fetch the precision 
and scale of Number time, the above issue is mitigated.

 

{color:#14892c}I can implement the changes, but require some inputs on the 
approach from the gatekeepers here.{color}
{color:#14892c}PS. This is also my first Jira issue and my first fork for 
Spark, so I will need some guidance along the way. (yes, I am a newbee to this) 
Thanks...{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358067#comment-16358067
 ] 

Wenchen Fan commented on SPARK-2:
-

I'm a little confused. If we wanna get a random row, why we need to sort? Do we 
have a way to get the dataframe before the sort and call its `first`?

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Major
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23096) Migrate rate source to v2

2018-02-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358064#comment-16358064
 ] 

Saisai Shao commented on SPARK-23096:
-

[~joseph.torres] [~tdas] can I take a crack on this if you're not working it. 
In the current code base, there exists two rate stream source (v1 and v2), I 
think we can consolidate them.

 

 

> Migrate rate source to v2
> -
>
> Key: SPARK-23096
> URL: https://issues.apache.org/jira/browse/SPARK-23096
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org