[jira] [Closed] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao closed SPARK-25391. > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102 ] Kazuaki Ishizaki commented on SPARK-25437: -- Is such a feature for major release, not for maintenance release? > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
Kazuaki Ishizaki created SPARK-25444: Summary: Refactor GenArrayData.genCodeToCreateArrayData() method Key: SPARK-25444 URL: https://issues.apache.org/jira/browse/SPARK-25444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: Kazuaki Ishizaki {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a temporary Java array to create {{ArrayData}}. It can be eliminated by using {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24768: --- Assignee: Gengliang Wang > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24768) Have a built-in AVRO data source implementation
[ https://issues.apache.org/jira/browse/SPARK-24768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24768. - Resolution: Fixed Fix Version/s: 2.4.0 > Have a built-in AVRO data source implementation > --- > > Key: SPARK-24768 > URL: https://issues.apache.org/jira/browse/SPARK-24768 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > Attachments: Built-in AVRO Data Source In Spark 2.4.pdf > > > Apache Avro (https://avro.apache.org) is a popular data serialization format. > It is widely used in the Spark and Hadoop ecosystem, especially for > Kafka-based data pipelines. Using the external package > [https://github.com/databricks/spark-avro], Spark SQL can read and write the > avro data. Making spark-Avro built-in can provide a better experience for > first-time users of Spark SQL and structured streaming. We expect the > built-in Avro data source can further improve the adoption of structured > streaming. The proposal is to inline code from spark-avro package > ([https://github.com/databricks/spark-avro]). The target release is Spark > 2.4. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617090#comment-16617090 ] Xiao Li commented on SPARK-23874: - [~bryanc] Thanks! It is very helpful. > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > Fix For: 2.4.0 > > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25443) fix issues when building docs with release scripts in docker
Wenchen Fan created SPARK-25443: --- Summary: fix issues when building docs with release scripts in docker Key: SPARK-25443 URL: https://issues.apache.org/jira/browse/SPARK-25443 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617066#comment-16617066 ] Goun Na commented on SPARK-25430: - Thanks [~hyukjin.kwon]. I will remember your guide. > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")
[ https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WEI PENG closed SPARK-25435. > df = sqlContext.read.json("examples/src/main/resources/people.json") > > > Key: SPARK-25435 > URL: https://issues.apache.org/jira/browse/SPARK-25435 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution > .scala:55) > at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe > cution.scala:47) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio > n.scala:431) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.createBaseDataset(JsonDataSource.scala:110) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.infer(JsonDataSource.scala:95) > at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS > chema(JsonDataSource.scala:63) > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS > chema(JsonFileFormat.scala:57) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF > ormatSchema(DataSource.scala:201) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation > (DataSource.scala:392) > at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca > la:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. > java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > sorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:745) > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loade > r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see > th > e next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc > rossDRDA(Unknown Source) > ... 123 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > da > tabase C:\Users\WEI\metastore_db. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL > ockOnDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO > nDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow > n Source) > at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow > n Source) > at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn > own Source) > at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So > urce) > at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow > n Source) > at
[jira] [Resolved] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")
[ https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WEI PENG resolved SPARK-25435. -- Resolution: Incomplete > df = sqlContext.read.json("examples/src/main/resources/people.json") > > > Key: SPARK-25435 > URL: https://issues.apache.org/jira/browse/SPARK-25435 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution > .scala:55) > at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe > cution.scala:47) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio > n.scala:431) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.createBaseDataset(JsonDataSource.scala:110) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.infer(JsonDataSource.scala:95) > at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS > chema(JsonDataSource.scala:63) > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS > chema(JsonFileFormat.scala:57) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF > ormatSchema(DataSource.scala:201) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation > (DataSource.scala:392) > at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca > la:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. > java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > sorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:745) > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loade > r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see > th > e next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc > rossDRDA(Unknown Source) > ... 123 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > da > tabase C:\Users\WEI\metastore_db. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL > ockOnDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO > nDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow > n Source) > at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow > n Source) > at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn > own Source) > at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So > urce) > at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow > n Source) > at
[jira] [Closed] (SPARK-25434) failed to locate the winutils binary in the hadoop binary path
[ https://issues.apache.org/jira/browse/SPARK-25434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WEI PENG closed SPARK-25434. Problem resolved > failed to locate the winutils binary in the hadoop binary path > -- > > Key: SPARK-25434 > URL: https://issues.apache.org/jira/browse/SPARK-25434 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > C:\Users\WEI>pyspark > Python 3.5.6 |Anaconda custom (64-bit)| (default, Aug 26 2018, 16:05:27) [MSC > v. > 1900 64 bit (AMD64)] on win32 > Type "help", "copyright", "credits" or "license" for more information. > 2018-09-14 21:12:39 ERROR Shell:397 - Failed to locate the winutils binary in > th > e hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Ha > doop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394) > at org.apache.hadoop.util.Shell.(Shell.java:387) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:80) > at org.apache.hadoop.security.SecurityUtil.getAuthenticationMethod(Secur > ityUtil.java:611) > at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupI > nformation.java:273) > at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(Use > rGroupInformation.java:261) > at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject( > UserGroupInformation.java:791) > at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGrou > pInformation.java:761) > at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGr > oupInformation.java:634) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils > .scala:2467) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2467) > at org.apache.spark.SecurityManager.(SecurityManager.scala:220) > at org.apache.spark.deploy.SparkSubmit$.secMgr$lzycompute$1(SparkSubmit. > scala:408) > at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSub > mit$$secMgr$1(SparkSubmit.scala:408) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at org.apache.spark.deploy.SparkSubmit$$anonfun$doPrepareSubmitEnvironme > nt$7.apply(SparkSubmit.scala:416) > at scala.Option.map(Option.scala:146) > at org.apache.spark.deploy.SparkSubmit$.doPrepareSubmitEnvironment(Spark > Submit.scala:415) > at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSu > bmit.scala:250) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:171) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > 2018-09-14 21:12:39 WARN NativeCodeLoader:62 - Unable to load native-hadoop > lib > rary for your platform... using builtin-java classes where applicable > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLeve > l(newLevel). > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 > /_/ > Using Python version 3.5.6 (default, Aug 26 2018 16:05:27) > SparkSession available as 'spark'. > >>> > > > > > > > > > > > > > > > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25442) Support STS to run in K8S deployment with spark deployment mode as cluster
[ https://issues.apache.org/jira/browse/SPARK-25442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617018#comment-16617018 ] Suryanarayana Garlapati commented on SPARK-25442: - There was an earlier PR(https://github.com/apache/spark/pull/20272) which had the half fix. PR [https://github.com/apache/spark/pull/22433] makes to have full solution for the same. > Support STS to run in K8S deployment with spark deployment mode as cluster > -- > > Key: SPARK-25442 > URL: https://issues.apache.org/jira/browse/SPARK-25442 > Project: Spark > Issue Type: Bug > Components: Kubernetes, SQL >Affects Versions: 2.4.0, 2.5.0 >Reporter: Suryanarayana Garlapati >Priority: Major > > STS fails to start in kubernetes deployments with spark deploy mode as > cluster. Support should be added to make it run in K8S deployments. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23200) Reset configuration when restarting from checkpoints
[ https://issues.apache.org/jira/browse/SPARK-23200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617016#comment-16617016 ] Wenchen Fan commented on SPARK-23200: - We should definitely merge it to branch 2.4, but I won't block the release since it's not that critical and it's still in progress. After it's merged, feel free to vote -1 on the RC voting email to include this change, if necessary. > Reset configuration when restarting from checkpoints > > > Key: SPARK-23200 > URL: https://issues.apache.org/jira/browse/SPARK-23200 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Anirudh Ramanathan >Priority: Major > > Streaming workloads and restarting from checkpoints may need additional > changes, i.e. resetting properties - see > https://github.com/apache-spark-on-k8s/spark/pull/516 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616994#comment-16616994 ] Hyukjin Kwon commented on SPARK-25437: -- Please feel the JIRA description. > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25435) df = sqlContext.read.json("examples/src/main/resources/people.json")
[ https://issues.apache.org/jira/browse/SPARK-25435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616993#comment-16616993 ] Hyukjin Kwon commented on SPARK-25435: -- Can you provide reproduible steps to verify this issue rather then just posting an error message? > df = sqlContext.read.json("examples/src/main/resources/people.json") > > > Key: SPARK-25435 > URL: https://issues.apache.org/jira/browse/SPARK-25435 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.3.1 >Reporter: WEI PENG >Priority: Major > > at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution > .scala:55) > at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExe > cution.scala:47) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74) > at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSessio > n.scala:431) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.createBaseDataset(JsonDataSource.scala:110) > at org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSour > ce$.infer(JsonDataSource.scala:95) > at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferS > chema(JsonDataSource.scala:63) > at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferS > chema(JsonFileFormat.scala:57) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.appl > y(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileF > ormatSchema(DataSource.scala:201) > at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation > (DataSource.scala:392) > at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.sca > la:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. > java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces > sorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:745) > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loade > r org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@2604c310, see > th > e next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAc > rossDRDA(Unknown Source) > ... 123 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > da > tabase C:\Users\WEI\metastore_db. > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.iapi.error.StandardException.newException(Unknown So > urce) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSL > ockOnDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockO > nDB(Unknown Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknow > n Source) > at org.apache.derby.impl.services.monitor.FileMonitor.startModule(Unknow > n Source) > at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unkn > own Source) > at org.apache.derby.impl.store.raw.RawStore$6.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.impl.store.raw.RawStore.bootServiceModule(Unknown So > urce) > at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Sourc > e) > at
[jira] [Comment Edited] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616994#comment-16616994 ] Hyukjin Kwon edited comment on SPARK-25437 at 9/17/18 2:13 AM: --- Please fill the JIRA description. was (Author: hyukjin.kwon): Please feel the JIRA description. > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616992#comment-16616992 ] Hyukjin Kwon commented on SPARK-25433: -- What's advantages of adding this and what are the alternatives? JIRA should describe what do fix not how to fix. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > This has been partly discussed in SPARK-13587 > I would like to provision the executors with a PEX package. I created a PR > with minimal necessary changes in PythonWorkerFactory. > PR: [https://github.com/apache/spark/pull/22422/files] > To run it one needs to set PYSPARK_PYTHON & PYSPARK_DRIVER_PYTHON variables > to the pex file and upload the pex file to the executors via > sparkContext.addFile or by setting the spark config > spark.yarn.dist.files/spark.file properties > Also it is necessary to set the PEX_ROOT environment variable. By default > inside the executors it tries to access /home/.pex and this fails. > Ideally, as this configuration is quite cumbersome, it would be interesting > to also add a parameter --pexFile to SparkContext and spark-submit in order > to directly provide a pexFile and then everything else is handled. Please > tell me what you think of this. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616984#comment-16616984 ] Hyukjin Kwon commented on SPARK-25424: -- Please avoid to set a target version which is usually reserved for committers. See https://spark.apache.org/contributing.html to open a PR which will automatically assign appropriate one. > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at >
[jira] [Commented] (SPARK-25429) SparkListenerBus inefficient due to 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure
[ https://issues.apache.org/jira/browse/SPARK-25429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616985#comment-16616985 ] Hyukjin Kwon commented on SPARK-25429: -- PR https://github.com/apache/spark/pull/22420 > SparkListenerBus inefficient due to > 'LiveStageMetrics#accumulatorIds:Array[Long]' data structure > > > Key: SPARK-25429 > URL: https://issues.apache.org/jira/browse/SPARK-25429 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: DENG FEI >Priority: Major > > {code:java} > private def updateStageMetrics( > stageId: Int, > attemptId: Int, > taskId: Long, > accumUpdates: Seq[AccumulableInfo], > succeeded: Boolean): Unit = { > Option(stageMetrics.get(stageId)).foreach { metrics => > if (metrics.attemptId != attemptId || metrics.accumulatorIds.isEmpty) { > return > } > val oldTaskMetrics = metrics.taskMetrics.get(taskId) > if (oldTaskMetrics != null && oldTaskMetrics.succeeded) { > return > } > val updates = accumUpdates > .filter { acc => acc.update.isDefined && > metrics.accumulatorIds.contains(acc.id) } > .sortBy(_.id) > if (updates.isEmpty) { > return > } > val ids = new Array[Long](updates.size) > val values = new Array[Long](updates.size) > updates.zipWithIndex.foreach { case (acc, idx) => > ids(idx) = acc.id > // In a live application, accumulators have Long values, but when > reading from event > // logs, they have String values. For now, assume all accumulators > are Long and covert > // accordingly. > values(idx) = acc.update.get match { > case s: String => s.toLong > case l: Long => l > case o => throw new IllegalArgumentException(s"Unexpected: $o") > } > } > // TODO: storing metrics by task ID can cause metrics for the same task > index to be > // counted multiple times, for example due to speculation or > re-attempts. > metrics.taskMetrics.put(taskId, new LiveTaskMetrics(ids, values, > succeeded)) > } > } > {code} > 'metrics.accumulatorIds.contains(acc.id)', if large SQL application generated > many accumulator, it's inefficient use Arrray#contains. > Actually, application may timeout while quit and will killed by RM on YARN > mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25424) Window duration and slide duration with negative values should fail fast
[ https://issues.apache.org/jira/browse/SPARK-25424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25424: - Target Version/s: (was: 2.4.0) > Window duration and slide duration with negative values should fail fast > > > Key: SPARK-25424 > URL: https://issues.apache.org/jira/browse/SPARK-25424 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Raghav Kumar Gautam >Priority: Major > Fix For: 2.4.0 > > > In TimeWindow class window duration and slide duration should not be allowed > to take negative values. > Currently this behaviour enforced by catalyst. It can be enforced by > constructor of TimeWindow allowing it to fail fast. > For e.g. the code below throws following error. Note that the error is > produced at the time of count() call instead of window() call. > {code:java} > val df = spark.readStream > .format("rate") > .option("numPartitions", "2") > .option("rowsPerSecond", "10") > .load() > .filter("value % 20 == 0") > .withWatermark("timestamp", "10 seconds") > .groupBy(window($"timestamp", "-10 seconds", "5 seconds")) > .count() > {code} > Error: > {code:java} > cannot resolve 'timewindow(timestamp, -1000, 500, 0)' due to data > type mismatch: The window duration (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > org.apache.spark.sql.AnalysisException: cannot resolve 'timewindow(timestamp, > -1000, 500, 0)' due to data type mismatch: The window duration > (-1000) must be greater than 0.;; > 'Aggregate [timewindow(timestamp#47-T1ms, -1000, 500, 0)], > [timewindow(timestamp#47-T1ms, -1000, 500, 0) AS window#53, > count(1) AS count#57L] > +- AnalysisBarrier > +- EventTimeWatermark timestamp#47: timestamp, interval 10 seconds > +- Filter ((value#48L % cast(20 as bigint)) = cast(0 as bigint)) > +- StreamingRelationV2 > org.apache.spark.sql.execution.streaming.RateSourceProvider@52e44f71, rate, > Map(rowsPerSecond -> 10, numPartitions -> 2), [timestamp#47, value#48L], > StreamingRelation > DataSource(org.apache.spark.sql.SparkSession@221961f2,rate,List(),None,List(),None,Map(rowsPerSecond > -> 10, numPartitions -> 2),None), rate, [timestamp#45, value#46L] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:93) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:95) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122) > at >
[jira] [Commented] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616981#comment-16616981 ] Hyukjin Kwon commented on SPARK-25430: -- Please avoid to set the target version which is usually reserved for committers. > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25430) Add map parameter for withColumnRenamed
[ https://issues.apache.org/jira/browse/SPARK-25430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25430: - Target Version/s: (was: 2.4.0) > Add map parameter for withColumnRenamed > --- > > Key: SPARK-25430 > URL: https://issues.apache.org/jira/browse/SPARK-25430 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Goun Na >Priority: Major > > WithColumnRenamed method should work with map parameter. It removes code > redundancy. > {code:java} > // example > df.withColumnRenamed(Map( "c1" -> "first_column", "c2" -> "second_column" > )){code} > {code:java} > // from abbr columns to desc columns > val m = Map( "c1" -> "first_column", "c2" -> "second_column" ) > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} > It is useful for CJK users when they are working on analysis in notebook > environment such as Zeppelin, Databricks, Apache Toree. > {code:java} > // for CJK users once define dictionary into map, reuse column map to > translate columns whenever report visualization is required > val m = Map( "c1" -> "컬럼_1", "c2" -> "컬럼_2") > df1.withColumnRenamed(m) > df2.withColumnRenamed(m) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huanghuai resolved SPARK-25420. --- Resolution: Fixed > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25420) Dataset.count() every time is different.
[ https://issues.apache.org/jira/browse/SPARK-25420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616978#comment-16616978 ] huanghuai commented on SPARK-25420: --- Because every time I use dropDuplicates() ,it's all in disorder, and every itme I get different first row after group by. That's why there are one or two record missing every time. But no matter how I try to use sort function before or after dropDuplicates(), It is no use. Finally , I use dataset.cache() after dropDuplicates(). > Dataset.count() every time is different. > - > > Key: SPARK-25420 > URL: https://issues.apache.org/jira/browse/SPARK-25420 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.0 > Environment: spark2.3 > standalone >Reporter: huanghuai >Priority: Major > Labels: SQL > > Dataset dataset = sparkSession.read().format("csv").option("sep", > ",").option("inferSchema", "true") > .option("escape", Constants.DEFAULT_CSV_ESCAPE).option("header", "true") > .option("encoding", "UTF-8") > .load("hdfs://192.168.1.26:9000/data/caopan/07-08_WithHead30M.csv"); > System.out.println("source count="+dataset.count()); > Dataset dropDuplicates = dataset.dropDuplicates(new > String[]\{"DATE","TIME","VEL","COMPANY"}); > System.out.println("dropDuplicates count1="+dropDuplicates.count()); > System.out.println("dropDuplicates count2="+dropDuplicates.count()); > Dataset filter = dropDuplicates.filter("jd > 120.85 and wd > 30.66 > and (status = 0 or status = 1)"); > System.out.println("filter count1="+filter.count()); > System.out.println("filter count2="+filter.count()); > System.out.println("filter count3="+filter.count()); > System.out.println("filter count4="+filter.count()); > System.out.println("filter count5="+filter.count()); > > > --The above is code > --- > > > console output: > source count=459275 > dropDuplicates count1=453987 > dropDuplicates count2=453987 > filter count1=445798 > filter count2=445797 > filter count3=445797 > filter count4=445798 > filter count5=445799 > > question: > > Why is filter.count() different everytime? > if I remove dropDuplicates() everything will be ok!! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616955#comment-16616955 ] Dongjoon Hyun commented on SPARK-25423: --- [~yumwang]'s PR link is added. > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616953#comment-16616953 ] Dongjoon Hyun commented on SPARK-25423: --- For the new feature, `Affects Version/s` should be the version of master branch. > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-25423: - Assignee: Yuming Wang > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Assignee: Yuming Wang >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25423) Output "dataFilters" in DataSourceScanExec.metadata
[ https://issues.apache.org/jira/browse/SPARK-25423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25423: -- Affects Version/s: (was: 2.3.1) 2.5.0 > Output "dataFilters" in DataSourceScanExec.metadata > --- > > Key: SPARK-25423 > URL: https://issues.apache.org/jira/browse/SPARK-25423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.5.0 >Reporter: Maryann Xue >Priority: Trivial > Labels: starter > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25153) Improve error messages for columns with dots/periods
[ https://issues.apache.org/jira/browse/SPARK-25153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fernando Díaz updated SPARK-25153: -- Comment: was deleted (was: I will take a look at it. Quick question: Given a dataframe with just one column named "Spark.Dots", should the error appear for both of the following examples? {code:java} df.col("Spark.Dots") df.col("Completely.Different.Name"){code} In the first case, Spark will try to extract the value "Dots" from the column. In the second case, it won't be able to resolve the column because there is no column with the qualifier "Completely".) > Improve error messages for columns with dots/periods > > > Key: SPARK-25153 > URL: https://issues.apache.org/jira/browse/SPARK-25153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > When we fail to resolve a column name with a dot in it, and the column name > is present as a string literal the error message could mention using > backticks to have the string treated as a literal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616675#comment-16616675 ] Nir Hedvat edited comment on SPARK-25380 at 9/16/18 11:21 AM: -- Experiencing the same problem !image-2018-09-16-14-21-38-939.png! was (Author: nir hedvat): Same problem here (using Spark 2.3.1) > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616675#comment-16616675 ] Nir Hedvat commented on SPARK-25380: Same problem here (using Spark 2.3.1) > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24315) Multiple streaming jobs detected error causing job failure
[ https://issues.apache.org/jira/browse/SPARK-24315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616643#comment-16616643 ] Marco Gaido commented on SPARK-24315: - [~joeyfezster] it has been a while ago, so I may be wrong, bur IIRC this was caused by some corrupted checkpoint dir and deleting it and restarting the job solved the issue. > Multiple streaming jobs detected error causing job failure > -- > > Key: SPARK-24315 > URL: https://issues.apache.org/jira/browse/SPARK-24315 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Major > > We are running a simple structured streaming job. It reads data from Kafka > and writes it to HDFS. Unfortunately at startup, the application fails with > the following error. After some restarts the application finally starts > successfully. > {code} > org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: > Concurrent update to the log. Multiple streaming jobs detected for 1 > === Streaming Query === > > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: java.lang.AssertionError: assertion failed: Concurrent update to > the log. Multiple streaming jobs detected for 1 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > ... 1 more > {code} > We have not set any value for `spark.streaming.concurrentJobs`. Our code > looks like: > {code} > // read from kafka > .withWatermark("timestamp", "30 minutes") > .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count() > // simple select of some fields with casts > .coalesce(1) > .writeStream > .trigger(Trigger.ProcessingTime("2 minutes")) > .option("checkpointLocation", checkpointDir) > // write to HDFS > .start() > .awaitTermination() > {code} > This may also be related to the presence of some data in the kafka queue to > process, so the time for the first batch may be longer than usual (as it is > quite common I think). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-24315) Multiple streaming jobs detected error causing job failure
[ https://issues.apache.org/jira/browse/SPARK-24315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616622#comment-16616622 ] Joey commented on SPARK-24315: -- [~mgaido] , can you please explain why this is not a bug? What could be the cause of this? > Multiple streaming jobs detected error causing job failure > -- > > Key: SPARK-24315 > URL: https://issues.apache.org/jira/browse/SPARK-24315 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Major > > We are running a simple structured streaming job. It reads data from Kafka > and writes it to HDFS. Unfortunately at startup, the application fails with > the following error. After some restarts the application finally starts > successfully. > {code} > org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: > Concurrent update to the log. Multiple streaming jobs detected for 1 > === Streaming Query === > > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: java.lang.AssertionError: assertion failed: Concurrent update to > the log. Multiple streaming jobs detected for 1 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > ... 1 more > {code} > We have not set any value for `spark.streaming.concurrentJobs`. Our code > looks like: > {code} > // read from kafka > .withWatermark("timestamp", "30 minutes") > .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count() > // simple select of some fields with casts > .coalesce(1) > .writeStream > .trigger(Trigger.ProcessingTime("2 minutes")) > .option("checkpointLocation", checkpointDir) > // write to HDFS > .start() > .awaitTermination() > {code} > This may also be related to the presence of some data in the kafka queue to > process, so the time for the first batch may be longer than usual (as it is > quite common I think). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Resolved] (SPARK-25391) Make behaviors consistent when converting parquet hive table to parquet data source
[ https://issues.apache.org/jira/browse/SPARK-25391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenxiao Mao resolved SPARK-25391. -- Resolution: Won't Do > Make behaviors consistent when converting parquet hive table to parquet data > source > --- > > Key: SPARK-25391 > URL: https://issues.apache.org/jira/browse/SPARK-25391 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Priority: Major > > parquet data source tables and hive parquet tables have different behaviors > about parquet field resolution. So, when > {{spark.sql.hive.convertMetastoreParquet}} is true, users might face > inconsistent behaviors. The differences are: > * Whether respect {{spark.sql.caseSensitive}}. Without SPARK-25132, both > data source tables and hive tables do NOT respect > {{spark.sql.caseSensitive}}. However data source tables always do > case-sensitive parquet field resolution, while hive tables always do > case-insensitive parquet field resolution no matter whether > {{spark.sql.caseSensitive}} is set to true or false. SPARK-25132 let data > source tables respect {{spark.sql.caseSensitive}} while hive serde table > behavior is not changed. > * How to resolve ambiguity in case-insensitive mode. Without SPARK-25132, > data source tables do case-sensitive resolution and return columns with the > corresponding letter cases, while hive tables always return the first matched > column ignoring cases. SPARK-25132 let data source tables throw exception > when there is ambiguity while hive table behavior is not changed. > This ticket aims to make behaviors consistent when converting hive table to > data source table. > * The behavior must be consistent to do the conversion, so we skip the > conversion in case-sensitive mode because hive parquet table always do > case-insensitive field resolution. > * In case-insensitive mode, when converting hive parquet table to parquet > data source, we switch the duplicated fields resolution mode to ask parquet > data source to pick the first matched field - the same behavior as hive > parquet table - to keep behaviors consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org