[jira] [Updated] (SPARK-23186) Initialize DriverManager first before loading Drivers
[ https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23186: Fix Version/s: 2.2.2 > Initialize DriverManager first before loading Drivers > - > > Key: SPARK-23186 > URL: https://issues.apache.org/jira/browse/SPARK-23186 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > Since some JDBC Drivers have class initialization code to call > `DriverManager`, we need to initialize DriverManager first in order to avoid > potential deadlock situation like the following or STORM-2527. > {code} > Thread 9587: (state = BLOCKED) > - > sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, > java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=85, line=62 (Compiled frame) > - > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=5, line=45 (Compiled frame) > - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, > line=423 (Compiled frame) > - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame) > - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 > (Interpreted frame) > - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted > frame) > - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame) > - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame) > - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame) > - > java.security.AccessController.doPrivileged(java.security.PrivilegedAction) > @bci=0 (Compiled frame) > - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted > frame) > - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame) > - > org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String, > java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 > (Interpreted frame) > - > org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration, > java.util.Properties) @bci=22, line=57 (Interpreted frame) > - > org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext, > org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame) > - > org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit, > org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 > (Interpreted frame) > - > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD, > org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 > (Interpreted frame) > Thread 9170: (state = BLOCKED) > - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 > (Interpreted frame) > - > sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, > java.lang.Object[]) @bci=0 (Compiled frame) > - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=85, line=62 (Compiled frame) > - > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=5, line=45 (Compiled frame) > - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, > line=423 (Compiled frame) > - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame) > - > org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String) > @bci=89, line=46 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() > @bci=7, line=53 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() > @bci=1, line=52 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, > org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 > (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition, > org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
[ https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359202#comment-16359202 ] Harleen Singh Mann commented on SPARK-23372: [~dkbiswal] How will it throw the error during compile time? with ref to your statement: _"We should detect this earlier and failed during compilation of the query."_ I mean the use of "compilation" in the sentence is probably incorrect. I will suggest changing it to "during preparing/executing the query". > Writing empty struct in parquet fails during execution. It should fail > earlier during analysis. > --- > > Key: SPARK-23372 > URL: https://issues.apache.org/jira/browse/SPARK-23372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Minor > > *Running* > spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) > *Results in* > {code:java} > org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with > an empty group: message spark_schema { > } > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) > at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread. > {code} > We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359200#comment-16359200 ] Harleen Singh Mann commented on SPARK-23370: [~srowen] Yes should be able to implement in the Oracle JDBC dialect. I want to start working on it once we agree it adds value. Do you mean overhead for Spark? Or for the Oracle DB? Or for the developer? haha > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23379) remove redundant metastore access if the current database name is the same
[ https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23379: Assignee: (was: Apache Spark) > remove redundant metastore access if the current database name is the same > -- > > Key: SPARK-23379 > URL: https://issues.apache.org/jira/browse/SPARK-23379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Priority: Major > > We should be able to reduce one metastore access if the target database name > is as same as the current database: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23379) remove redundant metastore access if the current database name is the same
[ https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359177#comment-16359177 ] Apache Spark commented on SPARK-23379: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20565 > remove redundant metastore access if the current database name is the same > -- > > Key: SPARK-23379 > URL: https://issues.apache.org/jira/browse/SPARK-23379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Priority: Major > > We should be able to reduce one metastore access if the target database name > is as same as the current database: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23379) remove redundant metastore access if the current database name is the same
[ https://issues.apache.org/jira/browse/SPARK-23379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23379: Assignee: Apache Spark > remove redundant metastore access if the current database name is the same > -- > > Key: SPARK-23379 > URL: https://issues.apache.org/jira/browse/SPARK-23379 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Assignee: Apache Spark >Priority: Major > > We should be able to reduce one metastore access if the target database name > is as same as the current database: > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23379) remove redundant metastore access if the current database name is the same
Feng Liu created SPARK-23379: Summary: remove redundant metastore access if the current database name is the same Key: SPARK-23379 URL: https://issues.apache.org/jira/browse/SPARK-23379 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Feng Liu We should be able to reduce one metastore access if the target database name is as same as the current database: https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L295 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23378: Assignee: (was: Apache Spark) > move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl > -- > > Key: SPARK-23378 > URL: https://issues.apache.org/jira/browse/SPARK-23378 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Priority: Major > > Conceptually, no methods of HiveExternalCatalog, besides the > `setCurrentDatabase`, should change the `currentDatabase` in the hive session > state. We can enforce this rule by removing the usage of `setCurrentDatabase` > in the HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359135#comment-16359135 ] Apache Spark commented on SPARK-23378: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20564 > move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl > -- > > Key: SPARK-23378 > URL: https://issues.apache.org/jira/browse/SPARK-23378 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Priority: Major > > Conceptually, no methods of HiveExternalCatalog, besides the > `setCurrentDatabase`, should change the `currentDatabase` in the hive session > state. We can enforce this rule by removing the usage of `setCurrentDatabase` > in the HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
[ https://issues.apache.org/jira/browse/SPARK-23378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23378: Assignee: Apache Spark > move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl > -- > > Key: SPARK-23378 > URL: https://issues.apache.org/jira/browse/SPARK-23378 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Feng Liu >Assignee: Apache Spark >Priority: Major > > Conceptually, no methods of HiveExternalCatalog, besides the > `setCurrentDatabase`, should change the `currentDatabase` in the hive session > state. We can enforce this rule by removing the usage of `setCurrentDatabase` > in the HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23378) move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl
Feng Liu created SPARK-23378: Summary: move setCurrentDatabase from HiveExternalCatalog to HiveClientImpl Key: SPARK-23378 URL: https://issues.apache.org/jira/browse/SPARK-23378 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Feng Liu Conceptually, no methods of HiveExternalCatalog, besides the `setCurrentDatabase`, should change the `currentDatabase` in the hive session state. We can enforce this rule by removing the usage of `setCurrentDatabase` in the HiveExternalCatalog. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-23377: Description: A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} And the trace: {code:java} java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol. at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) {code} was: A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23377) Bucketizer with multiple columns persistence bug
Bago Amirbekian created SPARK-23377: --- Summary: Bucketizer with multiple columns persistence bug Key: SPARK-23377 URL: https://issues.apache.org/jira/browse/SPARK-23377 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21232) New built-in SQL function - Data_Type
[ https://issues.apache.org/jira/browse/SPARK-21232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mario Molina updated SPARK-21232: - Fix Version/s: 2.3.0 2.2.2 > New built-in SQL function - Data_Type > - > > Key: SPARK-21232 > URL: https://issues.apache.org/jira/browse/SPARK-21232 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR, SQL >Affects Versions: 2.1.1 >Reporter: Mario Molina >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > This function returns the data type of a given column. > {code:java} > data_type("a") > // returns string > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
[ https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359062#comment-16359062 ] Dilip Biswal commented on SPARK-23372: -- [~mannharleen] Hello, my current plan is to add a validation check when we prepare to write for parquet. We have such checks for text file. I plan to do something similar for parquet. > Writing empty struct in parquet fails during execution. It should fail > earlier during analysis. > --- > > Key: SPARK-23372 > URL: https://issues.apache.org/jira/browse/SPARK-23372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Minor > > *Running* > spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) > *Results in* > {code:java} > org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with > an empty group: message spark_schema { > } > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) > at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread. > {code} > We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359047#comment-16359047 ] Sean Owen commented on SPARK-23370: --- It's possible to implement that just in the JDBC dialect for Oracle I suppose. Is it extra overhead? that is I wonder about leaving in the workaround that impacts all Oracle users for a long time. > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23186) Initialize DriverManager first before loading Drivers
[ https://issues.apache.org/jira/browse/SPARK-23186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359002#comment-16359002 ] Apache Spark commented on SPARK-23186: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20563 > Initialize DriverManager first before loading Drivers > - > > Key: SPARK-23186 > URL: https://issues.apache.org/jira/browse/SPARK-23186 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > > Since some JDBC Drivers have class initialization code to call > `DriverManager`, we need to initialize DriverManager first in order to avoid > potential deadlock situation like the following or STORM-2527. > {code} > Thread 9587: (state = BLOCKED) > - > sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, > java.lang.Object[]) @bci=0 (Compiled frame; information may be imprecise) > - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=85, line=62 (Compiled frame) > - > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=5, line=45 (Compiled frame) > - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, > line=423 (Compiled frame) > - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame) > - java.util.ServiceLoader$LazyIterator.nextService() @bci=119, line=380 > (Interpreted frame) > - java.util.ServiceLoader$LazyIterator.next() @bci=11, line=404 (Interpreted > frame) > - java.util.ServiceLoader$1.next() @bci=37, line=480 (Interpreted frame) > - java.sql.DriverManager$2.run() @bci=21, line=603 (Interpreted frame) > - java.sql.DriverManager$2.run() @bci=1, line=583 (Interpreted frame) > - > java.security.AccessController.doPrivileged(java.security.PrivilegedAction) > @bci=0 (Compiled frame) > - java.sql.DriverManager.loadInitialDrivers() @bci=27, line=583 (Interpreted > frame) > - java.sql.DriverManager.() @bci=32, line=101 (Interpreted frame) > - > org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(java.lang.String, > java.lang.Integer, java.lang.String, java.util.Properties) @bci=12, line=98 > (Interpreted frame) > - > org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(org.apache.hadoop.conf.Configuration, > java.util.Properties) @bci=22, line=57 (Interpreted frame) > - > org.apache.phoenix.mapreduce.PhoenixInputFormat.getQueryPlan(org.apache.hadoop.mapreduce.JobContext, > org.apache.hadoop.conf.Configuration) @bci=61, line=116 (Interpreted frame) > - > org.apache.phoenix.mapreduce.PhoenixInputFormat.createRecordReader(org.apache.hadoop.mapreduce.InputSplit, > org.apache.hadoop.mapreduce.TaskAttemptContext) @bci=10, line=71 > (Interpreted frame) > - > org.apache.spark.rdd.NewHadoopRDD$$anon$1.(org.apache.spark.rdd.NewHadoopRDD, > org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=233, line=156 > (Interpreted frame) > Thread 9170: (state = BLOCKED) > - org.apache.phoenix.jdbc.PhoenixDriver.() @bci=35, line=125 > (Interpreted frame) > - > sun.reflect.NativeConstructorAccessorImpl.newInstance0(java.lang.reflect.Constructor, > java.lang.Object[]) @bci=0 (Compiled frame) > - sun.reflect.NativeConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=85, line=62 (Compiled frame) > - > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(java.lang.Object[]) > @bci=5, line=45 (Compiled frame) > - java.lang.reflect.Constructor.newInstance(java.lang.Object[]) @bci=79, > line=423 (Compiled frame) > - java.lang.Class.newInstance() @bci=138, line=442 (Compiled frame) > - > org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(java.lang.String) > @bci=89, line=46 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() > @bci=7, line=53 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply() > @bci=1, line=52 (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD, > org.apache.spark.Partition, org.apache.spark.TaskContext) @bci=81, line=347 > (Interpreted frame) > - > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(org.apache.spark.Partition, > org.apache.spark.TaskContext) @bci=7, line=339 (Interpreted frame) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
[ https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358971#comment-16358971 ] Apache Spark commented on SPARK-23275: -- User 'liufengdb' has created a pull request for this issue: https://github.com/apache/spark/pull/20562 > hive/tests have been failing when run locally on the laptop (Mac) with OOM > --- > > Key: SPARK-23275 > URL: https://issues.apache.org/jira/browse/SPARK-23275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.3.0 > > > hive tests have been failing when they are run locally (Mac Os) after a > recent change in the trunk. After running the tests for some time, the test > fails with OOM with Error: unable to create new native thread. > I noticed the thread count goes all the way up to 2000+ after which we start > getting these OOM errors. Most of the threads seem to be related to the > connection pool in hive metastore (BoneCP-x- ). This behaviour change > is happening after we made the following change to HiveClientImpl.reset() > {code} > def reset(): Unit = withHiveState { > try { > // code > } finally { > runSqlHive("USE default") ===> this is causing the issue > } > {code} > I am proposing to temporarily back-out part of a fix made to address > SPARK-23000 to resolve this issue while we work-out the exact reason for this > sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358866#comment-16358866 ] Edwina Lu commented on SPARK-23206: --- [~irashid], total memory and off heap memory is also very useful for us, so we are interested in the work being done for SPARK-21157 and SPARK-9103. The infrastructure (using the heartbeat and selectively logging to the history log) is also similar. We are planning to discuss with [~cltlfcjin] on Monday. For stage level logging, we've modified LiveExecutorStageSummary to store peak values for the new memory metrics, and these are checked and updated for active stages in AppStatusListener.onExecutorMetricsUpdate(). For history logging, our design is a bit simpler: we track the peak values per executor, and immediately log if there is a new peak value. The peak values are reinitialized whenever a new stage starts, and this would provide the peak value for a memory metric for a stage. In the design doc for SPARK-9103, the heartbeats are combined and logged at each stage end – this design could work for us as well. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, ExecutorsTab2.png, > MemoryTuningMetricsDesignDoc.pdf, StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16501) spark.mesos.secret exposed on UI and command line
[ https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-16501: -- Assignee: Rob Vesse (was: Marcelo Vanzin) > spark.mesos.secret exposed on UI and command line > - > > Key: SPARK-16501 > URL: https://issues.apache.org/jira/browse/SPARK-16501 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, Web UI >Affects Versions: 1.6.2 >Reporter: Eric Daniel >Assignee: Rob Vesse >Priority: Major > Labels: security > Fix For: 2.4.0 > > > There are two related problems with spark.mesos.secret: > 1) The web UI shows its value in the "environment" tab > 2) Passing it as a command-line option to spark-submit (or creating a > SparkContext from python, with the effect of launching spark-submit) exposes > it to "ps" > I'll be happy to submit a patch but I could use some advice first. > The first problem is easy enough, just don't show that value in the UI > For the second problem, I'm not sure what the best solution is. A > "spark.mesos.secret-file" parameter would let the user store the secret in a > non-world-readable file. Alternatively, the mesos secret could be obtained > from the environment, which other users don't have access to. Either > solution would work in client mode, but I don't know if they're workable in > cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16501) spark.mesos.secret exposed on UI and command line
[ https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-16501: -- Assignee: Marcelo Vanzin > spark.mesos.secret exposed on UI and command line > - > > Key: SPARK-16501 > URL: https://issues.apache.org/jira/browse/SPARK-16501 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, Web UI >Affects Versions: 1.6.2 >Reporter: Eric Daniel >Assignee: Marcelo Vanzin >Priority: Major > Labels: security > Fix For: 2.4.0 > > > There are two related problems with spark.mesos.secret: > 1) The web UI shows its value in the "environment" tab > 2) Passing it as a command-line option to spark-submit (or creating a > SparkContext from python, with the effect of launching spark-submit) exposes > it to "ps" > I'll be happy to submit a patch but I could use some advice first. > The first problem is easy enough, just don't show that value in the UI > For the second problem, I'm not sure what the best solution is. A > "spark.mesos.secret-file" parameter would let the user store the secret in a > non-world-readable file. Alternatively, the mesos secret could be obtained > from the environment, which other users don't have access to. Either > solution would work in client mode, but I don't know if they're workable in > cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16501) spark.mesos.secret exposed on UI and command line
[ https://issues.apache.org/jira/browse/SPARK-16501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16501. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20167 [https://github.com/apache/spark/pull/20167] > spark.mesos.secret exposed on UI and command line > - > > Key: SPARK-16501 > URL: https://issues.apache.org/jira/browse/SPARK-16501 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, Web UI >Affects Versions: 1.6.2 >Reporter: Eric Daniel >Priority: Major > Labels: security > Fix For: 2.4.0 > > > There are two related problems with spark.mesos.secret: > 1) The web UI shows its value in the "environment" tab > 2) Passing it as a command-line option to spark-submit (or creating a > SparkContext from python, with the effect of launching spark-submit) exposes > it to "ps" > I'll be happy to submit a patch but I could use some advice first. > The first problem is easy enough, just don't show that value in the UI > For the second problem, I'm not sure what the best solution is. A > "spark.mesos.secret-file" parameter would let the user store the secret in a > non-world-readable file. Alternatively, the mesos secret could be obtained > from the environment, which other users don't have access to. Either > solution would work in client mode, but I don't know if they're workable in > cluster mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23360) SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23360: - Summary: SparkSession.createDataFrame timestamps can be incorrect with non-Arrow codepath (was: SparkSession.createDataFrame results in correct results with non-Arrow codepath) > SparkSession.createDataFrame timestamps can be incorrect with non-Arrow > codepath > > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23374) Checkstyle/Scalastyle only work from top level build
[ https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358775#comment-16358775 ] Marcelo Vanzin commented on SPARK-23374: I find it's just easier to run everything from the top level instead of doing crazy pom hacking... e.g. {{mvn -pl :spark-mesos_2.11 verify}} instead of {{cd blah/mesos && mvn verify}} > Checkstyle/Scalastyle only work from top level build > > > Key: SPARK-23374 > URL: https://issues.apache.org/jira/browse/SPARK-23374 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.1 >Reporter: Rob Vesse >Priority: Trivial > > The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML > configs for the style rule locations that are only valid relative to the top > level POM. Therefore if you try and do a {{mvn verify}} in an individual > module you get the following error: > {noformat} > [ERROR] Failed to execute goal > org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project > spark-mesos_2.11: Failed during scalastyle execution: Unable to find > configuration file at location scalastyle-config.xml > {noformat} > As the paths are hardcoded in XML and don't use Maven properties you can't > override these settings so you can't style check a single module which makes > doing style checking require a full project {{mvn verify}} which is not ideal. > By introducing Maven properties for these two paths it would become possible > to run checks on a single module like so: > {noformat} > mvn verify -Dscalastyle.location=../scalastyle-config.xml > {noformat} > Obviously the override would need to vary depending on the specific module > you are trying to run it against but this would be a relatively simply change > that would streamline dev workflows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358759#comment-16358759 ] Xuefu Zhang commented on SPARK-22683: - On a side note, besides the name of the configuration that's subject to change, I think (and mentioned previously) that the value doesn't have to be an integer, to allow finer control. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358668#comment-16358668 ] Eyal Farago commented on SPARK-19870: - I'll remember to share relevant future logs Re. The exception code path missing a cleanup, you're definitely right but I'm less concerned about this one as this code path is 'reserved' to tasks (I don't think Netty threads ever gets to this code) hence cleanup (+warning) is guaranteed. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert >Priority: Major > Attachments: cs.executor.log, stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358634#comment-16358634 ] V Luong commented on SPARK-2: - [~cloud_fan] alternatively, is there any way that VectorAssembler.transform(...) can get the "numAttributes" ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88)] metadata from somewhere else instead of materializing a row? Does the current need to materialize a row mean that some metadata is lacking somewhere? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358623#comment-16358623 ] Wenchen Fan commented on SPARK-2: - This is not a trivial change, we need to introduce an `AnyRow` operator that can eliminate unneeded sort(maybe more) operators. If we can get what we want from any row, does it mean we want something like a metadata? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358612#comment-16358612 ] Imran Rashid commented on SPARK-19870: -- to be honest, I'm not really sure what I'm looking for :) even INFO logs are pretty useful though at helping walk through the code and figuring out which parts to look at more suspiciously. Eg. in the logs you uploaded, I can say those WARN msgs are probably benign as its just related to a take / limit in the stage. Another example is that I noticed that this call to {{releaseLocks}}: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L218 doesn't have a corresponding case in the exception path: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala#L226 logs would make it clear if you ever hit that exception -- though I don't think thats it as I don't think you should ever actually hit that exception. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert >Priority: Major > Attachments: cs.executor.log, stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at >
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358599#comment-16358599 ] V Luong commented on SPARK-2: - [~cloud_fan] there are many scenarios in which oldDF involves sorting in its plan, e.g. if certain feature columns are calculated using windowed functions. In general, it would be a pain to always make sure that oldDF doesn't involve sorting (e.g. by checkpointing to files) prior to VectorAssembler. Anyway, VectorAssembler metadata shouldn't strictly need the first row. > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail
[ https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23376: Assignee: Apache Spark (was: Wenchen Fan) > creating UnsafeKVExternalSorter with BytesToBytesMap may fail > - > > Key: SPARK-23376 > URL: https://issues.apache.org/jira/browse/SPARK-23376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail
[ https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358542#comment-16358542 ] Apache Spark commented on SPARK-23376: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20561 > creating UnsafeKVExternalSorter with BytesToBytesMap may fail > - > > Key: SPARK-23376 > URL: https://issues.apache.org/jira/browse/SPARK-23376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail
[ https://issues.apache.org/jira/browse/SPARK-23376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23376: Assignee: Wenchen Fan (was: Apache Spark) > creating UnsafeKVExternalSorter with BytesToBytesMap may fail > - > > Key: SPARK-23376 > URL: https://issues.apache.org/jira/browse/SPARK-23376 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23376) creating UnsafeKVExternalSorter with BytesToBytesMap may fail
Wenchen Fan created SPARK-23376: --- Summary: creating UnsafeKVExternalSorter with BytesToBytesMap may fail Key: SPARK-23376 URL: https://issues.apache.org/jira/browse/SPARK-23376 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1, 2.1.2, 2.0.2, 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23269) FP-growth: Provide last transaction for each detected frequent pattern
[ https://issues.apache.org/jira/browse/SPARK-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23269. --- Resolution: Won't Fix > FP-growth: Provide last transaction for each detected frequent pattern > -- > > Key: SPARK-23269 > URL: https://issues.apache.org/jira/browse/SPARK-23269 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: Arseniy Tashoyan >Priority: Minor > Labels: MLlib, fp-growth > Original Estimate: 120h > Remaining Estimate: 120h > > FP-growth implementation gives patterns and their frequences: > _model.freqItemsets_: > ||items||freq|| > |[5]|3| > |[5, 1]|3| > It would be great to know when each pattern occurred last time - what is the > last transaction having this pattern? > To do so, it will be necessary to tell FPGrowth what is the timestamp column > in the transactions data frame: > {code:java} > val fpgrowth = new FPGrowth() > .setItemsCol("items") > .setTimestampCol("timestamp") > {code} > So the data frame with patterns could look like: > ||items||freq||lastOccurrence|| > |[5]|3|2018-01-01 12:15:00| > |[5, 1]|3|2018-01-01 12:15:00| > Without this functionality, it is necessary to traverse the transactions data > frame with the set of detected patterns and determine the last transaction > for each pattern. Why traverse transactions once again if it has been already > done in FP-growth execution? > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358495#comment-16358495 ] Harleen Singh Mann commented on SPARK-23370: [~q79969786] your suggestion would work but only if one knows in advance that there exists a column in Oracle DB of type Numeric and created using alter table statement. This information is seldom available to developers. [~srowen] True, it is an Oracle issue. If everyone agrees that Spark has nothing to do with it we may close this issue as is. However, I feel there may be merit in evaluating the way spark is fetching schema information from jdbc - i.e. resultSet.getMetaData.getColumnType VS from all_tabs_columns Thanks. > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc
[ https://issues.apache.org/jira/browse/SPARK-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23354. --- Resolution: Not A Problem > spark jdbc does not maintain length of data type when I move data from MS sql > server to Oracle using spark jdbc > --- > > Key: SPARK-23354 > URL: https://issues.apache.org/jira/browse/SPARK-23354 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 >Reporter: Lav Patel >Priority: Major > > spark jdbc does not maintain length of data type when I move data from MS sql > server to Oracle using spark jdbc > > To fix this, I have written code so it will figure out length of column and > it does the conversion. > > I can put more details with a code sample if the community is interested. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result
[ https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23358: -- Affects Version/s: (was: 2.4.0) 2.3.0 Priority: Minor (was: Major) > When the number of partitions is greater than 2^28, it will result in an > error result > - > > Key: SPARK-23358 > URL: https://issues.apache.org/jira/browse/SPARK-23358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.2.2, 2.3.0 > > > In the `checkIndexAndDataFile`,the _blocks_ is the _Int_ type, when it is > greater than 2^28, `blocks*8` will overflow, and this will result in an error > result. > In fact, `blocks` is actually the number of partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table
[ https://issues.apache.org/jira/browse/SPARK-23371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358488#comment-16358488 ] Sean Owen commented on SPARK-23371: --- It sounds like you have multiple versions of Parquet on your classpath, or at least, you're writing with a new version and reading with an old version? that's not going to work. This does not look like a Spark problem. > Parquet Footer data is wrong on window in parquet format partition table > - > > Key: SPARK-23371 > URL: https://issues.apache.org/jira/browse/SPARK-23371 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.1.2 >Reporter: pin_zhang >Priority: Major > > On window > Run SQL in spark shell > spark.sql("create table part_test (id string )partitioned by( index int) > stored as parquet") > spark.sql("insert into part_test partition (index =1) values ('1')") > Get exception when query spark.sql("select * from part_test ").show() > For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the > version info in spark on window. Classloader try to get version in the > parquet-format-2.3.0-incubating.jar > 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because > created_by > could not be parsed (see PARQUET-251): parquet-mr > org.apache.parquet.VersionParser$VersionParseException: Could not parse > created_ > by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*)) > at org.apache.parquet.VersionParser.parse(VersionParser.java:112) > at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt > atistics.java:60) > at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq > uetStatistics(ParquetMetadataConverter.java:263) > at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque > tFileReader.java:583) > at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF > ileReader.java:513) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.nextBatch(VectorizedParquetRecordReader.java:225) > at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR > ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) > at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe > xt(RecordReaderIterator.scala:39) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex > t(FileScanRDD.scala:109) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt > erator(FileScanRDD.scala:184) > at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex > t(FileScanRDD.scala:109) > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte > rator.scan_nextBatch$(Unknown Source) > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte > rator.processNext(Unknown Source) > at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo > wIterator.java:43) > at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon > $1.hasNext(WholeStageCodegenExec.scala:377) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s > cala:231) > at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s > cala:225) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap > ply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap > ply$25.apply(RDD.scala:827) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: > 38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. > java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor > .java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result
[ https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23358. --- Resolution: Fixed Fix Version/s: 2.3.0 2.2.2 Issue resolved by pull request 20544 [https://github.com/apache/spark/pull/20544] > When the number of partitions is greater than 2^28, it will result in an > error result > - > > Key: SPARK-23358 > URL: https://issues.apache.org/jira/browse/SPARK-23358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > In the `checkIndexAndDataFile`,the _blocks_ is the _Int_ type, when it is > greater than 2^28, `blocks*8` will overflow, and this will result in an error > result. > In fact, `blocks` is actually the number of partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23358) When the number of partitions is greater than 2^28, it will result in an error result
[ https://issues.apache.org/jira/browse/SPARK-23358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23358: - Assignee: liuxian > When the number of partitions is greater than 2^28, it will result in an > error result > - > > Key: SPARK-23358 > URL: https://issues.apache.org/jira/browse/SPARK-23358 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Assignee: liuxian >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > In the `checkIndexAndDataFile`,the _blocks_ is the _Int_ type, when it is > greater than 2^28, `blocks*8` will overflow, and this will result in an error > result. > In fact, `blocks` is actually the number of partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23347) Introduce buffer between Java data stream and gzip stream
[ https://issues.apache.org/jira/browse/SPARK-23347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23347. --- Resolution: Not A Problem > Introduce buffer between Java data stream and gzip stream > - > > Key: SPARK-23347 > URL: https://issues.apache.org/jira/browse/SPARK-23347 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Ted Yu >Priority: Minor > > Currently GZIPOutputStream is used directly around ByteArrayOutputStream > e.g. from KVStoreSerializer : > {code} > ByteArrayOutputStream bytes = new ByteArrayOutputStream(); > GZIPOutputStream out = new GZIPOutputStream(bytes); > {code} > This seems inefficient. > GZIPOutputStream does not implement the write(byte) method. It only provides > a write(byte[], offset, len) method, which calls the corresponding JNI zlib > function. > BufferedOutputStream can be introduced wrapping GZIPOutputStream for better > performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2: -- Priority: Minor (was: Major) > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23374) Checkstyle/Scalastyle only work from top level build
[ https://issues.apache.org/jira/browse/SPARK-23374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23374: -- Priority: Trivial (was: Minor) Issue Type: Improvement (was: Bug) This isn't a bug; it's how it's supposed to work, as it's there for Jenkins jobs. If you can suggest a clean change that makes it more flexible, sure, but otherwise I'd close this. > Checkstyle/Scalastyle only work from top level build > > > Key: SPARK-23374 > URL: https://issues.apache.org/jira/browse/SPARK-23374 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.1 >Reporter: Rob Vesse >Priority: Trivial > > The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML > configs for the style rule locations that are only valid relative to the top > level POM. Therefore if you try and do a {{mvn verify}} in an individual > module you get the following error: > {noformat} > [ERROR] Failed to execute goal > org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project > spark-mesos_2.11: Failed during scalastyle execution: Unable to find > configuration file at location scalastyle-config.xml > {noformat} > As the paths are hardcoded in XML and don't use Maven properties you can't > override these settings so you can't style check a single module which makes > doing style checking require a full project {{mvn verify}} which is not ideal. > By introducing Maven properties for these two paths it would become possible > to run checks on a single module like so: > {noformat} > mvn verify -Dscalastyle.location=../scalastyle-config.xml > {noformat} > Obviously the override would need to vary depending on the specific module > you are trying to run it against but this would be a relatively simply change > that would streamline dev workflows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
[ https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23372: -- Issue Type: Improvement (was: Bug) > Writing empty struct in parquet fails during execution. It should fail > earlier during analysis. > --- > > Key: SPARK-23372 > URL: https://issues.apache.org/jira/browse/SPARK-23372 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Minor > > *Running* > spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) > *Results in* > {code:java} > org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with > an empty group: message spark_schema { > } > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) > at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread. > {code} > We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23364) 'desc table' command in spark-sql add column head display
[ https://issues.apache.org/jira/browse/SPARK-23364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358477#comment-16358477 ] Sean Owen commented on SPARK-23364: --- [~guoxiaolongzte] please don't reopen JIRAs without any change. You have provided no description of the change or reason it's needed. > 'desc table' command in spark-sql add column head display > - > > Key: SPARK-23364 > URL: https://issues.apache.org/jira/browse/SPARK-23364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > fix before: > !2.png! > fix after: > !1.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358458#comment-16358458 ] Eyal Farago commented on SPARK-19870: - [~irashid], I'm afraid I don't have a documentation of which executor got to this hang, so I can't think of a way to find its logs (on top of this the spark-ui via history server seems a bit unreliable, i.e. jobs 'running' in the ui are rported to complete in the executor logs). can you please share, what is it you're looking for in the executor logs? as you could see in the one I've shared spark's logging level is set to WARN so there's not much into it... > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert >Priority: Major > Attachments: cs.executor.log, stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at >
[jira] [Commented] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358435#comment-16358435 ] Apache Spark commented on SPARK-23375: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20560 > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23375: Assignee: Apache Spark > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Minor > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23375: Assignee: (was: Apache Spark) > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23375) Optimizer should remove unneeded Sort
[ https://issues.apache.org/jira/browse/SPARK-23375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-23375: Description: As pointed out in SPARK-23368, as of now there is no rule to remove the Sort operator on an already sorted plan, ie. if we have a query like: {code} SELECT b FROM ( SELECT a, b FROM table1 ORDER BY a ) t ORDER BY a {code} The sort is actually executed twice, even though it is not needed. was: As pointed out in SPARK-23368, as of now there is no rule to remove the Sort operator on an already sorted plan, ie. if we have a query like: {{code}} SELECT b FROM ( SELECT a, b FROM table1 ORDER BY a ) t ORDER BY a {{code}} The sort is actually executed twice, even though it is not needed. > Optimizer should remove unneeded Sort > - > > Key: SPARK-23375 > URL: https://issues.apache.org/jira/browse/SPARK-23375 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > As pointed out in SPARK-23368, as of now there is no rule to remove the Sort > operator on an already sorted plan, ie. if we have a query like: > {code} > SELECT b > FROM ( > SELECT a, b > FROM table1 > ORDER BY a > ) t > ORDER BY a > {code} > The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23375) Optimizer should remove unneeded Sort
Marco Gaido created SPARK-23375: --- Summary: Optimizer should remove unneeded Sort Key: SPARK-23375 URL: https://issues.apache.org/jira/browse/SPARK-23375 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Marco Gaido As pointed out in SPARK-23368, as of now there is no rule to remove the Sort operator on an already sorted plan, ie. if we have a query like: {{code}} SELECT b FROM ( SELECT a, b FROM table1 ORDER BY a ) t ORDER BY a {{code}} The sort is actually executed twice, even though it is not needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23363) Fix spark-sql bug or improvement
[ https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23363. --- Resolution: Invalid [~guoxiaolongzte] do not reopen JIRAs with no change. There is no purpose in this one; it's an umbrella of one issue, and the umbrella is just about "bugs" > Fix spark-sql bug or improvement > > > Key: SPARK-23363 > URL: https://issues.apache.org/jira/browse/SPARK-23363 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358408#comment-16358408 ] Apache Spark commented on SPARK-23360: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20559 > SparkSession.createDataFrame results in correct results with non-Arrow > codepath > --- > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23360: Assignee: (was: Apache Spark) > SparkSession.createDataFrame results in correct results with non-Arrow > codepath > --- > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Priority: Major > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-23363) Fix spark-sql bug or improvement
[ https://issues.apache.org/jira/browse/SPARK-23363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-23363. - > Fix spark-sql bug or improvement > > > Key: SPARK-23363 > URL: https://issues.apache.org/jira/browse/SPARK-23363 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.4.0 >Reporter: guoxiaolongzte >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23360) SparkSession.createDataFrame results in correct results with non-Arrow codepath
[ https://issues.apache.org/jira/browse/SPARK-23360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23360: Assignee: Apache Spark > SparkSession.createDataFrame results in correct results with non-Arrow > codepath > --- > > Key: SPARK-23360 > URL: https://issues.apache.org/jira/browse/SPARK-23360 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Li Jin >Assignee: Apache Spark >Priority: Major > > {code:java} > import datetime > import pandas as pd > import os > dt = [datetime.datetime(2015, 10, 31, 22, 30)] > pdf = pd.DataFrame({'time': dt}) > os.environ['TZ'] = 'America/New_York' > df1 = spark.createDataFrame(pdf) > df1.show() > +---+ > | time| > +---+ > |2015-10-31 21:30:00| > +---+ > {code} > Seems to related to this line here: > [https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L1776] > It appears to be an issue with "tzlocal()" > Wrong: > {code:java} > from_tz = "America/New_York" > to_tz = "tzlocal()" > s.apply(lambda ts: > ts.tz_localize(from_tz,ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 21:30:00 > Name: time, dtype: datetime64[ns] > {code} > Correct: > {code:java} > from_tz = "America/New_York" > to_tz = "America/New_York" > s.apply( > lambda ts: ts.tz_localize(from_tz, > ambiguous=False).tz_convert(to_tz).tz_localize(None) > if ts is not pd.NaT else pd.NaT) > 0 2015-10-31 22:30:00 > Name: time, dtype: datetime64[ns] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23370: -- Shepherd: (was: Sean Owen) Flags: (was: Important) Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) (Don't assign shepherds please; I don't accept this even as an issue) This is an Oracle problem as you say, so, not a Spark bug. A clean workaround is OK, but, sounds like there's one that doesn't even require code changes. So I'd close this. > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Minor > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-23373. - Resolution: Cannot Reproduce > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ > _at >
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358402#comment-16358402 ] Marco Gaido commented on SPARK-23373: - Then I think we can close this, thanks. > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ > _at >
[jira] [Commented] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358401#comment-16358401 ] Yuming Wang commented on SPARK-23370: - User can config the column type like below now: {code:scala} val props = new Properties() props.put("customSchema", "ID decimal(38, 0), N1 int, N2 boolean") val dfRead = spark.read.schema(schema).jdbc(jdbcUrl, "tableWithCustomSchema", props) dfRead.show() {code} More details: https://github.com/apache/spark/pull/18266 > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12378) CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR
[ https://issues.apache.org/jira/browse/SPARK-12378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358393#comment-16358393 ] Arun commented on SPARK-12378: -- I am also getting the same issue when I am trying to insert data in hive from spark. My table is an external table stores in AWS S3. Although the data gets inserted in the table, but it gives this message: {code:java} -chgrp: '' does not match expected pattern for group Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... 18/02/09 13:25:56 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! -chgrp: '' does not match expected pattern for group Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...{code} Any resolution please? > CREATE EXTERNAL TABLE AS SELECT EXPORT AWS S3 ERROR > --- > > Key: SPARK-12378 > URL: https://issues.apache.org/jira/browse/SPARK-12378 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 > Environment: AWS EMR 4.2.0 > Just Master Running m3.xlarge > Applications: > Hive 1.0.0 > Spark 1.5.2 >Reporter: CESAR MICHELETTI >Priority: Major > > I am receive the bellow error during try exporting data to AWS S3, in > spark-sql. > Command: > CREATE external TABLE export > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054' > -- lines terminated by '\n' > STORED AS TEXTFILE > LOCATION 's3://xxx/yyy' > AS > SELECT > xxx > > (complete query) > ; > Error: > -chgrp: '' does not match expected pattern for group > Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... > -chgrp: '' does not match expected pattern for group > Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH... > 15/12/16 21:09:25 ERROR SparkSQLDriver: Failed in [CREATE external TABLE > csvexport > ... > (create table + query) > ... > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:441) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) > at > org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.hive.execution.CreateTableAsSelect.run(CreateTableAsSelect.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at >
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358392#comment-16358392 ] Yuming Wang commented on SPARK-23373: - I cannot reproduce on current master as your mentioned too. > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at >
[jira] [Comment Edited] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355 ] Wang, Gang edited comment on SPARK-23373 at 2/9/18 1:01 PM: Yes. Seems related to my test environment. While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, method_ test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule"). Add _sql("select count(distinct id) from tbl").collect()_ Got the same exception. Could you please have a try in your side? was (Author: gwang3): Yes. Seems related to my test environment. While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, method_ test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule"). Add _sql("select count(distinct id) from tbl").collect()_ __ got the same exception. Could you please have a try in your side? > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ >
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358355#comment-16358355 ] Wang, Gang commented on SPARK-23373: Yes. Seems related to my test environment. While, I tried in a Spark suite, in class _*PruneFileSourcePartitionsSuite*, method_ test("SPARK-20986 Reset table's statistics after PruneFileSourcePartitions rule"). Add _sql("select count(distinct id) from tbl").collect()_ __ got the same exception. Could you please have a try in your side? > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358315#comment-16358315 ] Marco Gaido commented on SPARK-23373: - I cannot reproduce on current master... May you try and check whether the issue still exists? > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#ff}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at >
[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
[ https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358294#comment-16358294 ] Harleen Singh Mann commented on SPARK-23372: what is your proposal on fixing this? > Writing empty struct in parquet fails during execution. It should fail > earlier during analysis. > --- > > Key: SPARK-23372 > URL: https://issues.apache.org/jira/browse/SPARK-23372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Minor > > *Running* > spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) > *Results in* > {code:java} > org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with > an empty group: message spark_schema { > } > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) > at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread. > {code} > We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23374) Checkstyle/Scalastyle only work from top level build
Rob Vesse created SPARK-23374: - Summary: Checkstyle/Scalastyle only work from top level build Key: SPARK-23374 URL: https://issues.apache.org/jira/browse/SPARK-23374 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Reporter: Rob Vesse The current Maven plugin definitions for Checkstyle/Scalastyle use fixed XML configs for the style rule locations that are only valid relative to the top level POM. Therefore if you try and do a {{mvn verify}} in an individual module you get the following error: {noformat} [ERROR] Failed to execute goal org.scalastyle:scalastyle-maven-plugin:1.0.0:check (default) on project spark-mesos_2.11: Failed during scalastyle execution: Unable to find configuration file at location scalastyle-config.xml {noformat} As the paths are hardcoded in XML and don't use Maven properties you can't override these settings so you can't style check a single module which makes doing style checking require a full project {{mvn verify}} which is not ideal. By introducing Maven properties for these two paths it would become possible to run checks on a single module like so: {noformat} mvn verify -Dscalastyle.location=../scalastyle-config.xml {noformat} Obviously the override would need to vary depending on the specific module you are trying to run it against but this would be a relatively simply change that would streamline dev workflows -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Gang updated SPARK-23373: --- Description: I failed to run sql "select count(distinct n_name) from nation", table nation is formatted in Parquet, error trace is as following. _spark-sql> select count(distinct n_name) from nation;_ _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_ _Error in query: Table or view not found: nation; line 1 pos 35_ _spark-sql> select count(distinct n_name) from nation_parquet;_ _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_parquet_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: array_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: struct_ _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is true_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values in memory (estimated size 305.0 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 MB)_ _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from processCmd at CliDriver.java:376_ _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after partition pruning:_ _PartitionDirectory([empty row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; isDirectory=false; length=3216; replication=3; blocksize=134217728; modification_time=1516619879024; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}))_ _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes._ _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select count(distinct n_name) from nation_parquet]_ {color:#ff}*_org.apache.spark.SparkException: Task not serializable_*{color} _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ _at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_ _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_ _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_ _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_ _at
[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Gang updated SPARK-23373: --- Description: I failed to run sql "select count(distinct n_name) from nation", table nation is formatted in Parquet, error trace is as following. _spark-sql> select count(distinct n_name) from nation;_ _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_ _Error in query: Table or view not found: nation; line 1 pos 35_ _spark-sql> select count(distinct n_name) from nation_parquet;_ _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_parquet_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: array_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: struct_ _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is true_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values in memory (estimated size 305.0 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 MB)_ _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from processCmd at CliDriver.java:376_ _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after partition pruning:_ _PartitionDirectory([empty row],ArrayBuffer(LocatedFileStatus\{path=hdfs://**.com:8020/apps/hive/warehouse/nation_parquet/00_0; isDirectory=false; length=3216; replication=3; blocksize=134217728; modification_time=1516619879024; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}))_ _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes._ _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select count(distinct n_name) from nation_parquet]_ {color:#ff}*_org.apache.spark.SparkException: Task not serializable_*{color} _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ _at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_ _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_ _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_ _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_ _at
[jira] [Updated] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
[ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wang, Gang updated SPARK-23373: --- Issue Type: Bug (was: New Feature) > Can not execute "count distinct" queries on parquet formatted table > --- > > Key: SPARK-23373 > URL: https://issues.apache.org/jira/browse/SPARK-23373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wang, Gang >Priority: Major > > I failed to run sql "select count(distinct n_name) from nation", table nation > is formatted in Parquet, error trace is as following. > _spark-sql> select count(distinct n_name) from nation;_ > _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_ > _Error in query: Table or view not found: nation; line 1 pos 35_ > _spark-sql> select count(distinct n_name) from nation_parquet;_ > _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select > count(distinct n_name) from nation_parquet_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ > _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: > array_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ > _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: > struct_ > _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ > _18/02/09 03:55:39 INFO main HashAggregateExec:54 > spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current > version of codegened fast hashmap does not support this aggregate._ > _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is > true_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ > _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as > values in memory (estimated size 305.0 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored > as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ > _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added > broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: > 366.3 MB)_ > _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from > processCmd at CliDriver.java:376_ > _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after > partition pruning:_ > _PartitionDirectory([empty > row],ArrayBuffer(LocatedFileStatus\{path=hdfs://btd-dev-2425209.lvs01.dev.ebayc3.com:8020/apps/hive/warehouse/nation_parquet/00_0; > isDirectory=false; length=3216; replication=3; blocksize=134217728; > modification_time=1516619879024; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false}))_ > _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes._ > _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select > count(distinct n_name) from nation_parquet]_ > {color:#FF}*_org.apache.spark.SparkException: Task not > serializable_*{color} > _at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ > _at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ > _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ > _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ > _at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ > _at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ > _at >
[jira] [Created] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table
Wang, Gang created SPARK-23373: -- Summary: Can not execute "count distinct" queries on parquet formatted table Key: SPARK-23373 URL: https://issues.apache.org/jira/browse/SPARK-23373 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Wang, Gang I failed to run sql "select count(distinct n_name) from nation", table nation is formatted in Parquet, error trace is as following. _spark-sql> select count(distinct n_name) from nation;_ _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_ _Error in query: Table or view not found: nation; line 1 pos 35_ _spark-sql> select count(distinct n_name) from nation_parquet;_ _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_parquet_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_ _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: array_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_ _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: struct_ _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_ _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._ _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is true_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_ _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values in memory (estimated size 305.0 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_ _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 MB)_ _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from processCmd at CliDriver.java:376_ _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after partition pruning:_ _PartitionDirectory([empty row],ArrayBuffer(LocatedFileStatus\{path=hdfs://btd-dev-2425209.lvs01.dev.ebayc3.com:8020/apps/hive/warehouse/nation_parquet/00_0; isDirectory=false; length=3216; replication=3; blocksize=134217728; modification_time=1516619879024; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}))_ _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes._ _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select count(distinct n_name) from nation_parquet]_ {color:#FF}*_org.apache.spark.SparkException: Task not serializable_*{color} _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_ _at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_ _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_ _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_ _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_ _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_ _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_ _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_ _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_ _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_ _at
[jira] [Created] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
Dilip Biswal created SPARK-23372: Summary: Writing empty struct in parquet fails during execution. It should fail earlier during analysis. Key: SPARK-23372 URL: https://issues.apache.org/jira/browse/SPARK-23372 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Dilip Biswal *Running* spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) *Results in* {code:java} org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with an empty group: message spark_schema { } at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread. {code} We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23372) Writing empty struct in parquet fails during execution. It should fail earlier during analysis.
[ https://issues.apache.org/jira/browse/SPARK-23372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358165#comment-16358165 ] Dilip Biswal commented on SPARK-23372: -- Working on a fix for this. > Writing empty struct in parquet fails during execution. It should fail > earlier during analysis. > --- > > Key: SPARK-23372 > URL: https://issues.apache.org/jira/browse/SPARK-23372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dilip Biswal >Priority: Minor > > *Running* > spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) > *Results in* > {code:java} > org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema with > an empty group: message spark_schema { > } > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27) > at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37) > at org.apache.parquet.schema.MessageType.accept(MessageType.java:58) > at org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23) > at > org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:225) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:37) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread. > {code} > We should detect this earlier and failed during compilation of the query. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358143#comment-16358143 ] Jose Torres commented on SPARK-23096: - Sure! Happy to have help. The "ratev2" source is just something I hacked together to exercise the v2 streaming execution path. You're right that it can really be replaced with a fully migrated version of the v1 source. > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23371) Parquet Footer data is wrong on window in parquet format partition table
pin_zhang created SPARK-23371: - Summary: Parquet Footer data is wrong on window in parquet format partition table Key: SPARK-23371 URL: https://issues.apache.org/jira/browse/SPARK-23371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.2, 2.1.1 Reporter: pin_zhang On window Run SQL in spark shell spark.sql("create table part_test (id string )partitioned by( index int) stored as parquet") spark.sql("insert into part_test partition (index =1) values ('1')") Get exception when query spark.sql("select * from part_test ").show() For the parquet.Version in parquet-hadoop-bundle-1.6.0.jar cannot load the version info in spark on window. Classloader try to get version in the parquet-format-2.3.0-incubating.jar 18/02/09 16:58:48 WARN CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr org.apache.parquet.VersionParser$VersionParseException: Could not parse created_ by: parquet-mr using format: (.+) version ((.*) )?(build ?(.*)) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptSt atistics.java:60) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParq uetStatistics(ParquetMetadataConverter.java:263) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(Parque tFileReader.java:583) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetF ileReader.java:513) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextBatch(VectorizedParquetRecordReader.java:225) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetR ecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNe xt(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIt erator(FileScanRDD.scala:184) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNex t(FileScanRDD.scala:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIte rator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRo wIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon $1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.s cala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$ap ply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala: 38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor. java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor .java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21860) Improve memory reuse for heap memory in `HeapMemoryAllocator`
[ https://issues.apache.org/jira/browse/SPARK-21860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358114#comment-16358114 ] Apache Spark commented on SPARK-21860: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20558 > Improve memory reuse for heap memory in `HeapMemoryAllocator` > - > > Key: SPARK-21860 > URL: https://issues.apache.org/jira/browse/SPARK-21860 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.0 > > > In `HeapMemoryAllocator`, when allocating memory from pool, and the key of > pool is memory size. > Actually some size of memory ,such as 1025bytes,1026bytes,..1032bytes, > we can think they are the same,because we allocate memory in multiples of 8 > bytes. > In this case, we can improve memory reuse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harleen Singh Mann updated SPARK-23370: --- Shepherd: Sean Owen (was: Xiangrui Meng) > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harleen Singh Mann updated SPARK-23370: --- Shepherd: Xiangrui Meng > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}{color:#f6c342}I can implement the changes, but require some > inputs on the approach from the gatekeepers here{color}.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harleen Singh Mann updated SPARK-23370: --- Description: Currently, on jdbc read spark obtains the schema of a table from using {color:#654982} resultSet.getMetaData.getColumnType{color} This works 99.99% of the times except when the column of Number type is added on an Oracle table using the alter statement. This is essentially an Oracle DB + JDBC bug that has been documented on Oracle KB and patches exist. [oracle KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] {color:#ff}As a result of the above mentioned issue, Spark receives a size of 0 for the field and defaults the field type to be BigDecimal(30,10) instead of what it actually should be. This is done in OracleDialect.scala. This may cause issues in the downstream application where relevant information may be missed to the changed precision and scale.{color} _The versions that are affected are:_ _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ _[Release: 11.1 to 11.2]_ +Proposed approach:+ There is another way of fetching the schema information in Oracle: Which is through the all_tab_columns table. If we use this table to fetch the precision and scale of Number time, the above issue is mitigated. {color:#14892c}{color:#f6c342}I can implement the changes, but require some inputs on the approach from the gatekeepers here{color}.{color} {color:#14892c}PS. This is also my first Jira issue and my first fork for Spark, so I will need some guidance along the way. (yes, I am a newbee to this) Thanks...{color} was: Currently, on jdbc read spark obtains the schema of a table from using {color:#654982} resultSet.getMetaData.getColumnType{color} This works 99.99% of the times except when the column of Number type is added on an Oracle table using the alter statement. This is essentially an Oracle DB + JDBC bug that has been documented on Oracle KB and patches exist. [oracle KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] {color:#FF}As a result of the above mentioned issue, Spark receives a size of 0 for the field and defaults the field type to be BigDecimal(30,10) instead of what it actually should be. This is done in OracleDialect.scala. This may cause issues in the downstream application where relevant information may be missed to the changed precision and scale.{color} _The versions that are affected are:_ _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ _[Release: 11.1 to 11.2]_ +Proposed approach:+ There is another way of fetching the schema information in Oracle: Which is through the all_tab_columns table. If we use this table to fetch the precision and scale of Number time, the above issue is mitigated. {color:#14892c}I can implement the changes, but require some inputs on the approach from the gatekeepers here.{color} {color:#14892c}PS. This is also my first Jira issue and my first fork for Spark, so I will need some guidance along the way. (yes, I am a newbee to this) Thanks...{color} > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#ff}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version:
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harleen Singh Mann updated SPARK-23370: --- Summary: Spark receives a size of 0 for an Oracle Number field and defaults the field type to be BigDecimal(30,10) instead of the actual precision and scale (was: Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10)) > Spark receives a size of 0 for an Oracle Number field and defaults the field > type to be BigDecimal(30,10) instead of the actual precision and scale > --- > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#FF}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}I can implement the changes, but require some inputs on the > approach from the gatekeepers here.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10)
[ https://issues.apache.org/jira/browse/SPARK-23370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harleen Singh Mann updated SPARK-23370: --- Attachment: Oracle KB Document 1266785.pdf > Spark receives a size of 0 for an Oracle Number field defaults the field type > to be BigDecimal(30,10) > - > > Key: SPARK-23370 > URL: https://issues.apache.org/jira/browse/SPARK-23370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 > Environment: Spark 2.2 > Oracle 11g > JDBC ojdbc6.jar >Reporter: Harleen Singh Mann >Priority: Major > Attachments: Oracle KB Document 1266785.pdf > > > Currently, on jdbc read spark obtains the schema of a table from using > {color:#654982} resultSet.getMetaData.getColumnType{color} > This works 99.99% of the times except when the column of Number type is added > on an Oracle table using the alter statement. This is essentially an Oracle > DB + JDBC bug that has been documented on Oracle KB and patches exist. > [oracle > KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] > {color:#FF}As a result of the above mentioned issue, Spark receives a > size of 0 for the field and defaults the field type to be BigDecimal(30,10) > instead of what it actually should be. This is done in OracleDialect.scala. > This may cause issues in the downstream application where relevant > information may be missed to the changed precision and scale.{color} > _The versions that are affected are:_ > _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ > _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ > _[Release: 11.1 to 11.2]_ > +Proposed approach:+ > There is another way of fetching the schema information in Oracle: Which is > through the all_tab_columns table. If we use this table to fetch the > precision and scale of Number time, the above issue is mitigated. > > {color:#14892c}I can implement the changes, but require some inputs on the > approach from the gatekeepers here.{color} > {color:#14892c}PS. This is also my first Jira issue and my first fork for > Spark, so I will need some guidance along the way. (yes, I am a newbee to > this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23370) Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10)
Harleen Singh Mann created SPARK-23370: -- Summary: Spark receives a size of 0 for an Oracle Number field defaults the field type to be BigDecimal(30,10) Key: SPARK-23370 URL: https://issues.apache.org/jira/browse/SPARK-23370 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Environment: Spark 2.2 Oracle 11g JDBC ojdbc6.jar Reporter: Harleen Singh Mann Currently, on jdbc read spark obtains the schema of a table from using {color:#654982} resultSet.getMetaData.getColumnType{color} This works 99.99% of the times except when the column of Number type is added on an Oracle table using the alter statement. This is essentially an Oracle DB + JDBC bug that has been documented on Oracle KB and patches exist. [oracle KB|https://support.oracle.com/knowledge/Oracle%20Database%20Products/1266785_1.html] {color:#FF}As a result of the above mentioned issue, Spark receives a size of 0 for the field and defaults the field type to be BigDecimal(30,10) instead of what it actually should be. This is done in OracleDialect.scala. This may cause issues in the downstream application where relevant information may be missed to the changed precision and scale.{color} _The versions that are affected are:_ _JDBC - Version: 11.2.0.1 and later [Release: 11.2 and later ]_ _Oracle Server - Enterprise Edition - Version: 11.1.0.6 to 11.2.0.1_ _[Release: 11.1 to 11.2]_ +Proposed approach:+ There is another way of fetching the schema information in Oracle: Which is through the all_tab_columns table. If we use this table to fetch the precision and scale of Number time, the above issue is mitigated. {color:#14892c}I can implement the changes, but require some inputs on the approach from the gatekeepers here.{color} {color:#14892c}PS. This is also my first Jira issue and my first fork for Spark, so I will need some guidance along the way. (yes, I am a newbee to this) Thanks...{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358067#comment-16358067 ] Wenchen Fan commented on SPARK-2: - I'm a little confused. If we wanna get a random row, why we need to sort? Do we have a way to get the dataframe before the sort and call its `first`? > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23096) Migrate rate source to v2
[ https://issues.apache.org/jira/browse/SPARK-23096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358064#comment-16358064 ] Saisai Shao commented on SPARK-23096: - [~joseph.torres] [~tdas] can I take a crack on this if you're not working it. In the current code base, there exists two rate stream source (v1 and v2), I think we can consolidate them. > Migrate rate source to v2 > - > > Key: SPARK-23096 > URL: https://issues.apache.org/jira/browse/SPARK-23096 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org