[jira] [Created] (SPARK-25111) increment kinesis client/producer lib versions & aws-sdk to match
Steve Loughran created SPARK-25111: -- Summary: increment kinesis client/producer lib versions & aws-sdk to match Key: SPARK-25111 URL: https://issues.apache.org/jira/browse/SPARK-25111 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.4.0 Reporter: Steve Loughran Move up to a more recent version of the kinesis client lib, matching aws SDK and producer versions. Proposed: move from 1.11.76 to 1.11.271. This is what hadoop-aws has been shipping in 3.1 and has been problem free so far (stable, no random warning messges in logs, etc). The hadoop-aws dependency is the shaded bundle, so not that of the kinesis SDK here, but at least it'll be the same version everywhere. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-22974: --- Assignee: Liang-Chi Hsieh > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22974) CountVectorModel does not attach attributes to output column
[ https://issues.apache.org/jira/browse/SPARK-22974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-22974. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20313 [https://github.com/apache/spark/pull/20313] > CountVectorModel does not attach attributes to output column > > > Key: SPARK-22974 > URL: https://issues.apache.org/jira/browse/SPARK-22974 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: William Zhang >Priority: Major > Fix For: 2.4.0 > > > If CountVectorModel transforms columns, the output column will not have > attributes attached to them. If later on, those columns are used in > Interaction transformer, an exception will be thrown: > {quote}"org.apache.spark.SparkException: Vector attributes must be defined > for interaction." > {quote} > To reproduce it: > {quote}import org.apache.spark.ml.feature._ > import org.apache.spark.sql.functions._ > val df = spark.createDataFrame(Seq( > (0, Array("a", "b", "c"), Array("1", "2")), > (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) > )).toDF("id", "words", "nums") > val cvModel: CountVectorizerModel = new CountVectorizer() > .setInputCol("nums") > .setOutputCol("features2") > .setVocabSize(4) > .setMinDF(0) > .fit(df) > ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) > .setInputCol("words") > .setOutputCol("features1") > > val df1 = cvm.transform(df) > val df2 = cvModel.transform(df1) > val interaction = new Interaction().setInputCols(Array("features1", > "features2")).setOutputCol("features") > val df3 = interaction.transform(df2) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25104) Validate user specified output schema
[ https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-25104. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22094 [https://github.com/apache/spark/pull/22094] > Validate user specified output schema > - > > Key: SPARK-25104 > URL: https://issues.apache.org/jira/browse/SPARK-25104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > With code changes in > [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] > , Spark can write out data as per user provided output schema. > To make it more robust and user friendly, we should validate the Avro schema > before tasks launched. > Also we should support output logical decimal type as BYTES (By default we > output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25104) Validate user specified output schema
[ https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-25104: --- Assignee: Gengliang Wang > Validate user specified output schema > - > > Key: SPARK-25104 > URL: https://issues.apache.org/jira/browse/SPARK-25104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > With code changes in > [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] > , Spark can write out data as per user provided output schema. > To make it more robust and user friendly, we should validate the Avro schema > before tasks launched. > Also we should support output logical decimal type as BYTES (By default we > output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579243#comment-16579243 ] Wenchen Fan commented on SPARK-24771: - It's good to pay more attention to compatibility issues, I've added release_notes label to this ticket and created https://issues.apache.org/jira/browse/SPARK-25110 to track the Flume streaming connector. I'm not sure how many users would use avro with RDD, but I feel it should be rare as there was a spark-avro package available for Spark SQL. Shall we send an email to user/dev list? > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23050) Structured Streaming with S3 file source duplicates data because of eventual consistency.
[ https://issues.apache.org/jira/browse/SPARK-23050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579242#comment-16579242 ] bharath kumar avusherla commented on SPARK-23050: - [~ste...@apache.org], I can start working on it. > Structured Streaming with S3 file source duplicates data because of eventual > consistency. > - > > Key: SPARK-23050 > URL: https://issues.apache.org/jira/browse/SPARK-23050 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Yash Sharma >Priority: Major > > Spark Structured streaming with S3 file source duplicates data because of > eventual consistency. > Re producing the scenario - > - Structured streaming reading from S3 source. Writing back to S3. > - Spark tries to commitTask on completion of a task, by verifying if all the > files have been written to Filesystem. > {{ManifestFileCommitProtocol.commitTask}}. > - [Eventual consistency issue] Spark finds that the file is not present and > fails the task. {{org.apache.spark.SparkException: Task failed while writing > rows. No such file or directory > 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet'}} > - By this time S3 eventually gets the file. > - Spark reruns the task and completes the task, but gets a new file name this > time. {{ManifestFileCommitProtocol.newTaskTempFile. > part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet.}} > - Data duplicates in results and the same data is processed twice and written > to S3. > - There is no data duplication if spark is able to list presence of all > committed files and all tasks succeed. > Code: > {code} > query = selected_df.writeStream \ > .format("parquet") \ > .option("compression", "snappy") \ > .option("path", "s3://path/data/") \ > .option("checkpointLocation", "s3://path/checkpoint/") \ > .start() > {code} > Same sized duplicate S3 Files: > {code} > $ aws s3 ls s3://path/data/ | grep part-00256 > 2018-01-11 03:37:00 17070 > part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet > 2018-01-11 03:37:10 17070 > part-00256-b62fa7a4-b7e0-43d6-8c38-9705076a7ee1-c000.snappy.parquet > {code} > Exception on S3 listing and task failure: > {code} > [Stage 5:>(277 + 100) / > 597]18/01/11 03:36:59 WARN TaskSetManager: Lost task 256.0 in stage 5.0 (TID > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: No such file or directory > 's3://path/data/part-00256-65ae782d-e32e-48fb-8652-e1d0defc370b-c000.snappy.parquet' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:816) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:509) > at > org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109) > at > org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol$$anonfun$4.apply(ManifestFileCommitProtocol.scala:109) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.streaming.ManifestFileCommitProtocol.commitTask(ManifestFileCommitProtocol.scala:109) > at >
[jira] [Created] (SPARK-25110) make sure Flume streaming connector works with Spark 2.4
Wenchen Fan created SPARK-25110: --- Summary: make sure Flume streaming connector works with Spark 2.4 Key: SPARK-25110 URL: https://issues.apache.org/jira/browse/SPARK-25110 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24771: Labels: release-notes (was: ) > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213 ] Marcelo Vanzin edited comment on SPARK-24771 at 8/14/18 3:46 AM: - The main problem pointed out in the original attempt is AVRO-1502; it means that people with code generated by Avro 1.7 might run into problems if Spark ships with Avro 1.8; that basically amounts to a binary compatibility issue, which we always try not to break in minor releases. It may be that it only applies in some specific situations and it may be acceptable to release note it. But it would be nice to be sure, since this affects existing users of Avro - including the Flume streaming connector which is still available in Spark 2.4... (Edit: the flume connector re-compiles the avro interface during the Spark build, but it would be nice to check that the flume libraries themselves don't use avro for anything. And we still should check how much other users of avro would be affected - e.g. people who use avro directly in RDD code.) was (Author: vanzin): The main problem pointed out in the original attempt is AVRO-1502; it means that people with code generated by Avro 1.7 might run into problems if Spark ships with Avro 1.8; that basically amounts to a binary compatibility issue, which we always try not to break in minor releases. It may be that it only applies in some specific situations and it may be acceptable to release note it. But it would be nice to be sure, since this affects existing users of Avro - including the Flume streaming connector which is still available in Spark 2.4... > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579213#comment-16579213 ] Marcelo Vanzin commented on SPARK-24771: The main problem pointed out in the original attempt is AVRO-1502; it means that people with code generated by Avro 1.7 might run into problems if Spark ships with Avro 1.8; that basically amounts to a binary compatibility issue, which we always try not to break in minor releases. It may be that it only applies in some specific situations and it may be acceptable to release note it. But it would be nice to be sure, since this affects existing users of Avro - including the Flume streaming connector which is still available in Spark 2.4... > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25068) High-order function: exists(array, function) → boolean
[ https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579210#comment-16579210 ] Takuya Ueshin commented on SPARK-25068: --- I added this because I thought this was a missing primitive operation for arrays, with better performance as you mentioned. I'm not sure whether we need {{forAll}} because generally we can rewrite {{forAll}} with {{exists}}. > High-order function: exists(array, function) → boolean > - > > Key: SPARK-25068 > URL: https://issues.apache.org/jira/browse/SPARK-25068 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.4.0 > > > Tests if arrays have those elements for which function returns true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated SPARK-25109: --- Description: We use this code to read parquet files from HDFS: spark.read.parquet('xxx') and get error as below: !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! What we can get is that one of the replica block cannot be read for some reason, but spark python doesn't try to read another replica which can be read successfully. So the application fails after throwing exception. When I use hadoop fs -text to read the file, I can get content correctly. It would be great that spark python can retry reading another replica block instead of failing. was: We use this code to read parquet files from HDFS: spark.read.parquet('xxx') and get error as below: !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! What we can get is that one of the replica block cannot be read for some reason, but spark python doesn't try to read another replica which can be read successfully. So the application fails after throwing exception. When I use hadoop fs -text to read the file, I can get content correctly. It would be great if spark python can retry reading another replica block instead of failing. > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We use this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and get error as below: > !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! > > What we can get is that one of the replica block cannot be read for some > reason, but spark python doesn't try to read another replica which can be > read successfully. So the application fails after throwing exception. > When I use hadoop fs -text to read the file, I can get content correctly. It > would be great that spark python can retry reading another replica block > instead of failing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated SPARK-25109: --- Description: We use this code to read parquet files from HDFS: spark.read.parquet('xxx') and get error as below: !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! What we can get is that one of the replica block cannot be read for some reason, but spark python doesn't try to read another replica which can be read successfully. So the application fails after throwing exception. When I use hadoop fs -text to read the file, I can get content correctly. It would be great if spark python can retry reading another replica block instead of failing. was: We used this code to read parquet files from HDFS: spark.read.parquet('xxx') and got error as below: > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We use this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and get error as below: > !WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png! > > What we can get is that one of the replica block cannot be read for some > reason, but spark python doesn't try to read another replica which can be > read successfully. So the application fails after throwing exception. > When I use hadoop fs -text to read the file, I can get content correctly. It > would be great if spark python can retry reading another replica block > instead of failing. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException
[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23308. -- Resolution: Won't Fix > ignoreCorruptFiles should not ignore retryable IOException > -- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Márcio Furlani Carmona >Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24006) ExecutorAllocationManager.onExecutorAdded is an O(n) operation
[ https://issues.apache.org/jira/browse/SPARK-24006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24006. -- Resolution: Won't Fix I am resolving this assuming there's no more update on actual numbers. Please reopen later when it is found there's actual gain from this. > ExecutorAllocationManager.onExecutorAdded is an O(n) operation > -- > > Key: SPARK-24006 > URL: https://issues.apache.org/jira/browse/SPARK-24006 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Xianjin YE >Priority: Major > > The ExecutorAllocationManager.onExecutorAdded is an O(n) operations, I > believe it will be a problem when scaling out with large number of Executors > as it effectively makes adding N executors at time complexity O(N^2). > > I propose to invoke onExecutorIdle guarded by > {code:java} > if (executorIds.size - executorsPendingToRemove.size >= minNumExecutors +1) { > // Since we only need to re-remark idle executors when low bound > executorIds.filter(listener.isExecutorIdle).foreach(onExecutorIdle) > } else { > onExecutorIdle(executorId) > }{code} > cc [~zsxwing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25086) Incorrect Default Value For "escape" For CSV Files
[ https://issues.apache.org/jira/browse/SPARK-25086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25086. -- Resolution: Duplicate > Incorrect Default Value For "escape" For CSV Files > -- > > Key: SPARK-25086 > URL: https://issues.apache.org/jira/browse/SPARK-25086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: David Wilcox >Priority: Major > > The RFC for CSV files ([https://tools.ietf.org/html/rfc4180]) indicates that > the way that a double-quote is escaped is by preceding it with another > double-quote: > {code:java} > 7. If double-quotes are used to enclose fields, then a double-quote appearing > inside a field must be escaped by preceding it with another double quote. For > example: "aaa","b""bb","ccc"{code} > Your default value for "escape" violates the RFC. I think that we should fix > the default value to be {{"}}, and those that want {{\}} to escape can > override for non-RFC-conforming CSV files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
[ https://issues.apache.org/jira/browse/SPARK-25109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanbo Liu updated SPARK-25109: --- Attachment: WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > spark python should retry reading another datanode if the first one fails to > connect > > > Key: SPARK-25109 > URL: https://issues.apache.org/jira/browse/SPARK-25109 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Yuanbo Liu >Priority: Major > Attachments: > WeChatWorkScreenshot_86b5-1d19-430a-a138-335e4bd3211c.png > > > We used this code to read parquet files from HDFS: > spark.read.parquet('xxx') > and got error as below: > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24356) Duplicate strings in File.path managed by FileSegmentManagedBuffer
[ https://issues.apache.org/jira/browse/SPARK-24356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579184#comment-16579184 ] Imran Rashid commented on SPARK-24356: -- Somewhat related to SPARK-24938 -- that explains why these buffers are even on the heap at all, as spark configures netty to use offheap buffers by default. > Duplicate strings in File.path managed by FileSegmentManagedBuffer > -- > > Key: SPARK-24356 > URL: https://issues.apache.org/jira/browse/SPARK-24356 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Fix For: 2.4.0 > > Attachments: SPARK-24356.01.patch, dup-file-strings-details.png > > > I recently analyzed a heap dump of Yarn Node Manager that was suffering from > high GC pressure due to high object churn. Analysis was done with the jxray > tool ([www.jxray.com)|http://www.jxray.com)/] that checks a heap dump for a > number of well-known memory issues. One problem that it found in this dump is > 19.5% of memory wasted due to duplicate strings. Of these duplicates, more > than a half come from {{FileInputStream.path}} and {{File.path}}. All the > {{FileInputStream}} objects that JXRay shows are garbage - looks like they > are used for a very short period and then discarded (I guess there is a > separate question of whether that's a good pattern). But {{File}} instances > are traceable to > {{org.apache.spark.network.buffer.FileSegmentManagedBuffer.file}} field. Here > is the full reference chain: > > {code:java} > ↖java.io.File.path > ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file > ↖{j.u.ArrayList} > ↖j.u.ArrayList$Itr.this$0 > ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers > ↖{java.util.concurrent.ConcurrentHashMap}.values > ↖org.apache.spark.network.server.OneForOneStreamManager.streams > ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager > ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler > ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance > {code} > > Values of these {{File.path}}'s and {{FileInputStream.path}}'s look very > similar, so I think {{FileInputStream}}s are generated by the > {{FileSegmentManagedBuffer}} code. Instances of {{File}}, in turn, likely > come from > [https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263] > > To avoid duplicate strings in {{File.path}}'s in this case, it is suggested > that in the above code we create a File with a complete, normalized pathname, > that has been already interned. This will prevent the code inside > {{java.io.File}} from modifying this string, and thus it will use the > interned copy, and will pass it to FileInputStream. Essentially the current > line > {code:java} > return new File(new File(localDir, String.format("%02x", subDirId)), > filename);{code} > should be replaced with something like > {code:java} > String pathname = localDir + File.separator + String.format(...) + > File.separator + filename; > pathname = fileSystem.normalize(pathname).intern(); > return new File(pathname);{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25109) spark python should retry reading another datanode if the first one fails to connect
Yuanbo Liu created SPARK-25109: -- Summary: spark python should retry reading another datanode if the first one fails to connect Key: SPARK-25109 URL: https://issues.apache.org/jira/browse/SPARK-25109 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Yuanbo Liu We used this code to read parquet files from HDFS: spark.read.parquet('xxx') and got error as below: -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579181#comment-16579181 ] Yuming Wang edited comment on SPARK-25051 at 8/14/18 3:13 AM: -- Yes. The bug only exist in branch-2.3. I can reproduced by: {code} val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name") val df2 = spark.range(3).selectExpr("id") df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show {code} was (Author: q79969786): Yes. The bug still exists. I can reproduced by: {code:scala} val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name") val df2 = spark.range(3).selectExpr("id") df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show {code} > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579181#comment-16579181 ] Yuming Wang commented on SPARK-25051: - Yes. The bug still exists. I can reproduced by: {code:scala} val df1 = spark.range(4).selectExpr("id", "cast(id as string) as name") val df2 = spark.range(3).selectExpr("id") df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull).show {code} > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579177#comment-16579177 ] Imran Rashid commented on SPARK-24938: -- yeah thats about what I expected. Its worse than 16MB per "service" in some cases, though -- it'll be 16 MB per netty thread, which will max out at 8. You'll see that a lot on the driver. So could save 384 MB on the driver, and I think 128 MB in the external shuffle service (where only one service is active, I think). Did you see a corresponding increase in the offheap pools? I expected them to *not* grow (as spark actually only needs a tiny bit of space from these pools, so it should be able to find that space in the existing offheap pools). > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Description: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, see attached pictures please . ![show](/user/desktop/doge.png) was: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, see attached pictures please . > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, see attached pictures please . > ![show](/user/desktop/doge.png) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Description: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:scala} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, see attached pictures please . ![show](https://issues.apache.org/jira/secure/attachment/12935462/show.bmp) was: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, see attached pictures please . ![show](/user/desktop/doge.png) > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:scala} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, see attached pictures please . > ![show](https://issues.apache.org/jira/secure/attachment/12935462/show.bmp) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25108: Assignee: Apache Spark > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Assignee: Apache Spark >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, see attached pictures please . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25108: Assignee: (was: Apache Spark) > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, see attached pictures please . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Description: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, see attached pictures please . was: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, please see attached pictures. > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, see attached pictures please . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Description: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} Before and after fix, please see attached pictures. was: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} > > Before and after fix, please see attached pictures. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Environment: spark-shell on Xshell5 (was: spark-shell on Xshell) > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell5 >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: Attachment: show.bmp > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell >Reporter: xuejianbest >Priority: Critical > Attachments: show.bmp > > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
[ https://issues.apache.org/jira/browse/SPARK-25108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuejianbest updated SPARK-25108: External issue URL: (was: https://github.com/apache/spark/pull/22048) Description: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") df.show /* +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ */{code} was: The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") before: +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ after fix: +--+ | value| +--+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| | jamón serrano| | pêches| |シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +--+ {code} > Dataset.show() generates incorrect padding for Unicode Character > > > Key: SPARK-25108 > URL: https://issues.apache.org/jira/browse/SPARK-25108 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 > Environment: spark-shell on Xshell >Reporter: xuejianbest >Priority: Critical > > The Dataset.show() method generates incorrect space padding since column name > or column value has Unicode Character. > {code:java} > val df = spark.createDataset(Seq( > "γύρος", > "pears", > "linguiça", > "xoriço", > "hamburger", > "éclair", > "smørbrød", > "spätzle", > "包子", > "jamón serrano", > "pêches", > "シュークリーム", > "막걸리", > "寿司", > "おもち", > "crème brûlée", > "fideuà", > "pâté", > "お好み焼き")).toDF("value") > df.show > /* > +-+ > | value| > +-+ > | γύρος| > | pears| > | linguiça| > | xoriço| > | hamburger| > | éclair| > | smørbrød| > | spätzle| > | 包子| > |jamón serrano| > | pêches| > | シュークリーム| > | 막걸리| > | 寿司| > | おもち| > | crème brûlée| > | fideuà| > | pâté| > | お好み焼き| > +-+ > */{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25108) Dataset.show() generates incorrect padding for Unicode Character
xuejianbest created SPARK-25108: --- Summary: Dataset.show() generates incorrect padding for Unicode Character Key: SPARK-25108 URL: https://issues.apache.org/jira/browse/SPARK-25108 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1, 2.3.0 Environment: spark-shell on Xshell Reporter: xuejianbest The Dataset.show() method generates incorrect space padding since column name or column value has Unicode Character. {code:java} val df = spark.createDataset(Seq( "γύρος", "pears", "linguiça", "xoriço", "hamburger", "éclair", "smørbrød", "spätzle", "包子", "jamón serrano", "pêches", "シュークリーム", "막걸리", "寿司", "おもち", "crème brûlée", "fideuà", "pâté", "お好み焼き")).toDF("value") before: +-+ | value| +-+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| |jamón serrano| | pêches| | シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +-+ after fix: +--+ | value| +--+ | γύρος| | pears| | linguiça| | xoriço| | hamburger| | éclair| | smørbrød| | spätzle| | 包子| | jamón serrano| | pêches| |シュークリーム| | 막걸리| | 寿司| | おもち| | crème brûlée| | fideuà| | pâté| | お好み焼き| +--+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579133#comment-16579133 ] Michael Heuer commented on SPARK-24771: --- I'm looking forward to testing this with [ADAM|https://github.com/bigdatagenomics/adam] and all of our downstream projects as part of the 2.4.0 release candidate process. If it is worth my time doing so before then, please let me know. Parquet + Avro is at the core of what we do, and having the 1.8 vs 1.7 internal conflict present in Spark resolved would be very welcome. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24886) Increase Jenkins build time
[ https://issues.apache.org/jira/browse/SPARK-24886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579131#comment-16579131 ] Apache Spark commented on SPARK-24886: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/22098 > Increase Jenkins build time > --- > > Key: SPARK-24886 > URL: https://issues.apache.org/jira/browse/SPARK-24886 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > Currently, looks we hit the time limit time to time. Looks better increasing > the time a bit. > For instance, please see https://github.com/apache/spark/pull/21822 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579128#comment-16579128 ] Sean Owen commented on SPARK-24771: --- I confess I just don't know enough to have a strong opinion. A minor version upgrade isn't out of the question for a minor Spark upgrade. You are right that this is considered a non-core integration. It sounds like there are incompatibility issues. However that cuts two ways; some users are facing problems because Spark _isn't_ on 1.8.x. If spark-avro is already on 1.8, I can see the need for Spark to update as well. Not blessing this so much as saying I don't object. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579126#comment-16579126 ] Wenchen Fan commented on SPARK-24771: - cc [~r...@databricks.com] [~srowen] > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579125#comment-16579125 ] Wenchen Fan commented on SPARK-16617: - Sorry I missed this ticket, the upgrade is now done by https://issues.apache.org/jira/browse/SPARK-24771 > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement > Components: Build, Spark Core >Affects Versions: 2.1.0 >Reporter: Ben McCann >Priority: Major > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579124#comment-16579124 ] Wenchen Fan commented on SPARK-24771: - Sorry I was not aware of https://issues.apache.org/jira/browse/SPARK-16617 . So we proposed to upgrade AVRO 2 years ago, and gave it up because it's not binary compatible and the benefit is not that much. I think things have changed now. This upgrade is super important to the AVRO data source, for date/timestamp/decimal support. Also as people pointed out in https://issues.apache.org/jira/browse/SPARK-16617 , this is an important bug fix to use Parquet and AVRO. BTW I don't think the impact is that large. Spark doesn't have a stable API to plugin AVRO supports, so AVRO users have to do some manual work to migrate to new Spark versions. As an example, I don't think the databricks spark-avro package can work with Spark 2.4 without modification. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
[ https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25028: --- Assignee: Marco Gaido > AnalyzePartitionCommand failed with NPE if value is null > > > Key: SPARK-25028 > URL: https://issues.apache.org/jira/browse/SPARK-25028 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > on line 143: val partitionColumnValues = > partitionColumns.indices.map(r.get(_).toString) > If the value is NULL the code will fail with NPE > *sample:* > {code:scala} > val df = List((1, null , "first"), (2, null , "second")).toDF("index", > "name", "value").withColumn("name", $"name".cast("string")) > df.write.partitionBy("name").saveAsTable("df13") > spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") > {code} > output: > 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations > (2) reached for batch Resolution > java.lang.NullPointerException > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > ... 49 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25028) AnalyzePartitionCommand failed with NPE if value is null
[ https://issues.apache.org/jira/browse/SPARK-25028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25028. - Resolution: Fixed Fix Version/s: 2.3.2 2.4.0 Issue resolved by pull request 22036 [https://github.com/apache/spark/pull/22036] > AnalyzePartitionCommand failed with NPE if value is null > > > Key: SPARK-25028 > URL: https://issues.apache.org/jira/browse/SPARK-25028 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Izek Greenfield >Priority: Major > Fix For: 2.4.0, 2.3.2 > > > on line 143: val partitionColumnValues = > partitionColumns.indices.map(r.get(_).toString) > If the value is NULL the code will fail with NPE > *sample:* > {code:scala} > val df = List((1, null , "first"), (2, null , "second")).toDF("index", > "name", "value").withColumn("name", $"name".cast("string")) > df.write.partitionBy("name").saveAsTable("df13") > spark.sql("ANALYZE TABLE df13 PARTITION (name) COMPUTE STATISTICS") > {code} > output: > 2018-08-08 09:25:43 WARN BaseSessionStateBuilder$$anon$1:66 - Max iterations > (2) reached for batch Resolution > java.lang.NullPointerException > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1$$anonfun$8.apply(AnalyzePartitionCommand.scala:143) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.Range.foreach(Range.scala:160) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:143) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand$$anonfun$calculateRowCountsPerPartition$1.apply(AnalyzePartitionCommand.scala:142) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.calculateRowCountsPerPartition(AnalyzePartitionCommand.scala:142) > at > org.apache.spark.sql.execution.command.AnalyzePartitionCommand.run(AnalyzePartitionCommand.scala:104) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) > at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) > at org.apache.spark.sql.Dataset.(Dataset.scala:190) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641) > ... 49 elided -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579042#comment-16579042 ] Marcelo Vanzin commented on SPARK-24918: I like the idea in general. On the implementation side, instead of {{spark.executor.plugins}}, how about using {{java.util.ServiceLoader}}? That's one less configuration needed to enable these plugins. The downside is that if the jar is visible to Spark, it will be invoked (so it becomes "opt out" instead of "opt in", if you want to add an option to disable specific plugins). I thought about suggesting a new API in SparkContext to programatically add plugins, but that might be too messy. Better to start simple. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24938) Understand usage of netty's onheap memory use, even with offheap pools
[ https://issues.apache.org/jira/browse/SPARK-24938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579039#comment-16579039 ] Nihar Sheth commented on SPARK-24938: - After making the change and running the tool on a very simple application both with and without that change, I saw 3 netty services that dropped from 16mb to 0. They are: netty-rpc-client-usedHeapMem netty-blockTransfer-client-usedHeapMem netty-external-shuffle-client-usedHeapMem 16mb of onheap memory was allocated for these three services through their lifetime without the change, but with the change it disappears in all 3 cases. Does this sound like the sole source of this particular issue? Or would you expect more memory elsewhere to also be freed up? > Understand usage of netty's onheap memory use, even with offheap pools > -- > > Key: SPARK-24938 > URL: https://issues.apache.org/jira/browse/SPARK-24938 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > We've observed that netty uses large amount of onheap memory in its pools, in > addition to the expected offheap memory when I added some instrumentation > (using SPARK-24918 and https://github.com/squito/spark-memory). We should > figure out why its using that memory, and whether its really necessary. > It might be just this one line: > https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/protocol/MessageEncoder.java#L82 > which means that even with a small burst of messages, each arena will grow by > 16MB which could lead to a 128 MB spike of an almost entirely unused pool. > Switching to requesting a buffer from the default pool would probably fix > this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752 ] MIK edited comment on SPARK-25051 at 8/13/18 11:04 PM: --- Thanks [~yumwang] , with 2.3.2-rc4, not getting the error now but the result is not correct (getting 0 records), +-+--+ |id|name| +-+--+ The sample program should return 2 records. +-+---+ |id|name| |1|one| |3|three| +-+---+ was (Author: mik1007): Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not correct (getting 0 records), ++---+ |id|name| ++---+ The sample program should return 2 records. +++ |id|name| |1|one| |3|three| +++ > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579012#comment-16579012 ] Marcelo Vanzin commented on SPARK-24787: Is the slowness really caused by the use of hsync vs. hflush? I'd expect the flushing of the data, not the metadata update, to be the expensive part... In any case, if you have any ideas, feel free to post a PR. > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579005#comment-16579005 ] Marcelo Vanzin commented on SPARK-24771: Hi guys, why was this accepted? It has been tried in the past and we re-targeted it to 3.0 because Avro 1.8 is not backwards compatible with Avro 1.7: https://issues.apache.org/jira/browse/SPARK-16617 In particular the discussion in the PR is helpful here: https://github.com/apache/spark/pull/17163 > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan updated SPARK-25107: -- Description: I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. 1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE WHEN pcs.score BETWEEN PC.from AND PC.to AND ((PC.csacnt IS NOT NULL AND CC.status = 4 AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag FROM maindata VC INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid INNER JOIN cnfgtable PC ON PC.subid = VC.subid INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID LEFT JOIN casetable CC ON CC.rowid = VC.rowid LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A WHERE A.Flag =1").createOrReplaceTempView.("stage1") 2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* FROM stage1 t1 INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID INNER JOIN maindata vct on vct.recordid = t1.recordid WHERE t2.ro = PCR.datacount”) The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} +CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields|#1506... 89 more fields]+ \{{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at
[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan updated SPARK-25107: -- Description: I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. 1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE WHEN pcs.score BETWEEN PC.from AND PC.to AND ((PC.csacnt IS NOT NULL AND CC.status = 4 AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag FROM maindata VC INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid INNER JOIN cnfgtable PC ON PC.subid = VC.subid INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID LEFT JOIN casetable CC ON CC.rowid = VC.rowid LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A WHERE A.Flag =1").registerTempTable("stage1") 2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* FROM stage1 t1 INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID INNER JOIN maindata vct on vct.recordid = t1.recordid WHERE t2.ro = PCR.datacount”) The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields|#1506... 89 more fields]+ }} \{{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at
[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan updated SPARK-25107: -- Description: I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. {{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro }} {{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE }} {{ WHEN pcs.score BETWEEN PC.from AND PC.to }} {{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }} {{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag }} {{ FROM maindata VC }} {{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }} {{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }} {{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }} {{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }} {{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A }} {{ WHERE A.Flag =1").registerTempTable("stage1")}} {{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }} {{ FROM stage1 t1 }} {{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }} {{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }} {{ INNER JOIN maindata vct on vct.recordid = t1.recordid}} {{ WHERE t2.ro = PCR.datacount”)}} The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields|#1506... 89 more fields]+ }} \{{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{
[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan updated SPARK-25107: -- Description: I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. {\{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro }} \{{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE }} \{{ WHEN pcs.score BETWEEN PC.from AND PC.to }} \{{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }} \{{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag }} \{{ FROM maindata VC }} \{{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }} \{{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }} \{{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }} \{{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }} \{{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A }} \{{ WHERE A.Flag =1").registerTempTable("stage1")}} {\{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }} \{{ FROM stage1 t1 }} \{{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }} \{{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }} \{{ INNER JOIN maindata vct on vct.recordid = t1.recordid}} \{{ WHERE t2.ro = PCR.datacount”)}} The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} {{+CatalogRelation `database_name`.`{color:#f79232}maindata{color}`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields|#1506... 89 more fields]+ }} \{{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at
[jira] [Updated] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
[ https://issues.apache.org/jira/browse/SPARK-25107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karan updated SPARK-25107: -- Description: I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. {{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro }} {{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE }} {{ WHEN pcs.score BETWEEN PC.from AND PC.to }} {{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }} {{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag }} {{ FROM maindata VC }} {{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }} {{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }} {{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }} {{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }} {{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A }} {{ WHERE A.Flag =1").registerTempTable("stage1")}}{{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }} {{ FROM stage1 t1 }} {{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }} {{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }} {{ INNER JOIN maindata vct on vct.recordid = t1.recordid}} {{ WHERE t2.ro = PCR.datacount”)}} The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} {{+CatalogRelation `ilink_perf_athenaprod`.`maindata`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields|#1506... 89 more fields]+ }} \{{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} \{{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} \{{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} \{{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} \{{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} \{{ at
[jira] [Created] (SPARK-25107) Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors
Karan created SPARK-25107: - Summary: Spark 2.2.0 Upgrade Issue : Throwing TreeNodeException: makeCopy, tree: CatalogRelation Errors Key: SPARK-25107 URL: https://issues.apache.org/jira/browse/SPARK-25107 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.2.0 Environment: Spark Version : 2.2.0.cloudera2 Reporter: Karan I am in the process of upgrading Spark 1.6 to Spark 2.2. I have two stage query and I am running with hiveContext. {{1) hiveContext.sql("SELECT *, ROW_NUMBER() OVER (PARTITION BY ConfigID, rowid ORDER BY date DESC) AS ro }} {{ FROM (SELECT DISTINCT PC.ConfigID, VPM.seqNo, VC.rowid ,VC.recordid ,VC.data,CASE }} {{ WHEN pcs.score BETWEEN PC.from AND PC.to }} {{ AND ((PC.csacnt IS NOT NULL AND CC.status = 4 }} {{ AND CC.cnclusion = mc.ca) OR (PC.csacnt IS NULL)) THEN 1 ELSE 0 END AS Flag }} {{ FROM {color:#f79232}maindata{color} VC }} {{ INNER JOIN scoretable pcs ON VC.s_recordid = pcs.s_recordid }} {{ INNER JOIN cnfgtable PC ON PC.subid = VC.subid }} {{ INNER JOIN prdtable VPM ON PC.configID = VPM.CNFG_ID }} {{ LEFT JOIN casetable CC ON CC.rowid = VC.rowid }} {{ LEFT JOIN prdcnfg mc ON mc.configID = PC.configID WHERE VC.date BETWEEN VPM.StartDate and VPM.EndDate) A }} {{ WHERE A.Flag =1").}}createOrReplaceTempView{{("stage1")}} {{2) hiveContext.sql("SELECT DISTINCT t1.ConfigID As cnfg_id ,vct.* }} {{ FROM stage1 t1 }} {{ INNER JOIN stage1 t2 ON t1.rowid = t2.rowid AND T1.ConfigID = t2.ConfigID }} {{ INNER JOIN cnfgtable PCR ON PCR.ConfigID = t2.ConfigID }} {{ INNER JOIN {color:#f79232}maindata{color} vct on vct.recordid = t1.recordid}} {{ WHERE t2.ro = PCR.datacount")}} The same query sequency is working fine in Spark 1.6 but failing with below exeption in Spark 2,2. It throws exception while parsing above 2nd query. {{+org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:+}} {{+CatalogRelation `ilink_perf_athenaprod`.`maindata`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [recordid#1506... 89 more fields]+ }} {{ at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:385)}} {{ at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:300)}} {{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)}} {{ at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)}} {{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:722)}} {{ at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$5.applyOrElse(Analyzer.scala:721)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)}} {{ at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)}} {{ at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)}} {{ at
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 0.10.0.1 to 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578962#comment-16578962 ] Apache Spark commented on SPARK-18057: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22097 > Update structured streaming kafka from 0.10.0.1 to 2.0.0 > > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Assignee: Ted Yu >Priority: Major > Fix For: 2.4.0 > > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24156) Enable no-data micro batches for more eager streaming state clean up
[ https://issues.apache.org/jira/browse/SPARK-24156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578937#comment-16578937 ] Xiao Li commented on SPARK-24156: - [~tdas] Can we mark it done? > Enable no-data micro batches for more eager streaming state clean up > - > > Key: SPARK-24156 > URL: https://issues.apache.org/jira/browse/SPARK-24156 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > > Currently, MicroBatchExecution in Structured Streaming runs batches only when > there is new data to process. This is sensible in most cases as we dont want > to unnecessarily use resources when there is nothing new to process. However, > in some cases of stateful streaming queries, this delays state clean up as > well as clean-up based output. For example, consider a streaming aggregation > query with watermark-based state cleanup. The watermark is updated after > every batch with new data completes. The updated value is used in the next > batch to clean up state, and output finalized aggregates in append mode. > However, if there is no data, then the next batch does not occur, and > cleanup/output gets delayed unnecessarily. This is true for all stateful > streaming operators - aggregation, deduplication, joins, mapGroupsWithState > This issue tracks the work to enable no-data batches in MicroBatchExecution. > The major challenge is that all the tests of relevant stateful operations add > dummy data to force another batch for testing the state cleanup. So a lot of > the tests are going to be changed. So my plan is to enable no-data batches > for different stateful operators one at a time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578894#comment-16578894 ] Imran Rashid commented on SPARK-24918: -- With dynamic allocation you don't have a good place to run {code} df.mapPartitions { it => MyResource.initIfNeeded() it.map(...) } {code} Executors can come and go, you can't ensure that runs everywhere. Even if you make "too many" tasks, it could be your job starts out with a very small number of tasks for a while before ramping up. So after you run your initialization with the added initResource code, many executors get torn down during the first part of the real job as they sit idle; then when the job ramps up, you get new executors, which never had your initialization run. You'd have to put {{MyResource.initIfNeeded()}} inside *every* task. (Note that for the debug use case, the initializer is totally unnecessary for the task to complete -- if the task actually depended on it, then of course you should have that logic in each task.) I think there are a large class of users who can add "--conf spark.executor.plugins com.mycompany.WhizzBangDebugPlugin --jars whizzbangdebug.jar" to the command line arguments, that couldn't add in that code sample (even with static allocation). They're not the ones that are *writing* the plugins, they just need to be able to enable it. {quote}What do you do if init fails? retry or fail?{quote} good question, Tom asked the same thing on the pr. I suggested the executor just fails to start. If a plugin wanted to be "safe", it could catch exceptions in its own initialization. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunling Cai updated SPARK-25091: Priority: Critical (was: Major) > Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor > memory > > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25091) Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yunling Cai updated SPARK-25091: Component/s: (was: Spark Core) SQL > Spark Thrift Server: UNCACHE TABLE and CLEAR CACHE does not clean up executor > memory > > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Major > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.
[ https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578882#comment-16578882 ] Ilan Filonenko commented on SPARK-24736: The URL, until a resource-staging-server is setup will be unable to resolve the file location unless you use SparkFiles.get(file_name)` in your application. As such, using a URL in the --py-files will be unresolved. Thus, remote dependencies won't be supported by --py-files just yet, but we can support local files. > --py-files not functional for non local URLs. It appears to pass non-local > URL's into PYTHONPATH directly. > -- > > Key: SPARK-24736 > URL: https://issues.apache.org/jira/browse/SPARK-24736 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 > Environment: Recent 2.4.0 from master branch, submitted on Linux to a > KOPS Kubernetes cluster created on AWS. > >Reporter: Jonathan A Weaver >Priority: Minor > > My spark-submit > bin/spark-submit \ > --master > k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/] > \ > --deploy-mode cluster \ > --name pytest \ > --conf > spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest] > \ > --conf > [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver > \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/] > \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \ > --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \ > --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \ > [https://s3.amazonaws.com/maxar-ids-fids/it.py] > > *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()* > 2018-07-01 07:33:43 INFO SparkContext:54 - Added file > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp > 1530430423297 > 2018-07-01 07:33:43 INFO Utils:54 - Fetching > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to > /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp > *I print out the PYTHONPATH and PYSPARK_FILES environment variables from the > driver script:* > PYTHONPATH > /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]* > PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] > > *I print out sys.path* > ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', > u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240', > '/opt/spark/python/lib/pyspark.zip', > '/opt/spark/python/lib/py4j-0.10.7-src.zip', > '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', > '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', > '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',* > '/usr/lib/python27.zip', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/lib/python2.7/site-packages'] > > *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.* > > *Dump of spark config from container.* > Spark config dumped: > [(u'spark.master', > u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'), > (u'spark.kubernetes.authenticate.submission.oauthToken', > u''), > (u'spark.kubernetes.authenticate.driver.oauthToken', > u''), (u'spark.kubernetes.executor.podNamePrefix', > u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), > (u'spark.driver.blockManager.port', u'7079'), > (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), > (u'[spark.app.name|http://spark.app.name/]', u'pytest'), > (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), > (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), > (u'spark.kubernetes.container.image', >
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578872#comment-16578872 ] Sean Owen commented on SPARK-24918: --- This is just for per-executor initialization right? What's the issue with dynamic allocation – executors still start there, JVMs still initialize; how is it particularly hard? What do you do if init fails? retry or fail? Would SQL-only users meaningfully be able to use this if they don't know about code anyway? Is turning on debug code not something for config options? I guess I don't get why this still can't be solved by a static initializer. I'm not dead-set against this, just think it will add some complexity and not sure it gains a lot. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23984) PySpark Bindings for K8S
[ https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578871#comment-16578871 ] Apache Spark commented on SPARK-23984: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22095 > PySpark Bindings for K8S > > > Key: SPARK-23984 > URL: https://issues.apache.org/jira/browse/SPARK-23984 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, PySpark >Affects Versions: 2.3.0 >Reporter: Ilan Filonenko >Priority: Major > Fix For: 2.4.0 > > > This ticket is tracking the ongoing work of moving the upsteam work from > [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python > bindings for Spark on Kubernetes. > The points of focus are: dependency management, increased non-JVM memory > overhead default values, and modified Docker images to include Python > Support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578858#comment-16578858 ] Imran Rashid edited comment on SPARK-24918 at 8/13/18 8:15 PM: --- [~lucacanali] OK I see the case for what you're proposing -- its hard to setup that communication between the driver & executors without *some* initial setup message. Still ... I'm a bit reluctant to include that now, until we see someone actually builds something that uses it. I realizes you might be hesitant to do that until you know it can be built on a stable api, but I don't think we can get around that. was (Author: irashid): [~lucacanali] OK I see the case for what you're proposing -- its hard too setup that communication between the driver & executors without *some* initial setup message. Still ... I'm a bit reluctant to include that now, until we see someone actually builds something that uses it. I realizes you might be hesitant to do that until you know it can be built on a stable api, but I don't think we can get around that. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578858#comment-16578858 ] Imran Rashid commented on SPARK-24918: -- [~lucacanali] OK I see the case for what you're proposing -- its hard too setup that communication between the driver & executors without *some* initial setup message. Still ... I'm a bit reluctant to include that now, until we see someone actually builds something that uses it. I realizes you might be hesitant to do that until you know it can be built on a stable api, but I don't think we can get around that. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578856#comment-16578856 ] Imran Rashid commented on SPARK-650: Folks may be interested in SPARK-24918. perhaps one should be closed a duplicate of the other, but for now there is some discussion on both, so I'll leave them open for the time being > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578853#comment-16578853 ] Imran Rashid commented on SPARK-24918: -- Ah, right, thanks [~vanzin], I knew I had seen this before. [~srowen], you argued the most against SPARK-650 -- have I made the case here? I did indeed at first do exactly what you suggested, using a static initializer, but realized it was not great for a couple of very important reasons: * dynamic allocation * turning on a "debug" mode without any code changes (you'd be surprised how big a hurdle this is for something in production) * "sql only" apps, where the end user barely knows anything about calling a mapPartitions function > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5
[ https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578851#comment-16578851 ] shane knapp commented on SPARK-25079: - question: do we want to upgrade to 3.6 instead? > [PYTHON] upgrade python 3.4 -> 3.5 > -- > > Key: SPARK-25079 > URL: https://issues.apache.org/jira/browse/SPARK-25079 > Project: Spark > Issue Type: Improvement > Components: Build, PySpark >Affects Versions: 2.3.1 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > for the impending arrow upgrade > (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python > 3.4 -> 3.5. > i have been testing this here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69] > my methodology: > 1) upgrade python + arrow to 3.5 and 0.10.0 > 2) run python tests > 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and > upgrade centos workers to python3.5 > 4) simultaneously do the following: > - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that > points to python3.5 (this is currently being tested here: > [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)] > - push a change to python/run-tests.py replacing 3.4 with 3.5 > 5) once the python3.5 change to run-tests.py is merged, we will need to > back-port this to all existing branches > 6) then and only then can i remove the python3.4 -> python3.5 symlink -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22905) Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row order issues
[ https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578850#comment-16578850 ] Apache Spark commented on SPARK-22905: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/22079 > Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row > order issues > - > > Key: SPARK-22905 > URL: https://issues.apache.org/jira/browse/SPARK-22905 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.1 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 2.3.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Currently, in `ChiSqSelectorModel`, save: > {code} > spark.createDataFrame(dataArray).repartition(1).write... > {code} > The default partition number used by createDataFrame is "defaultParallelism", > Current RoundRobinPartitioning won't guarantee the "repartition" generating > the same order result with local array. We need fix it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch
[ https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Seigneurin updated SPARK-25106: -- Description: I have a fairly simple piece of code that reads from Kafka, applies some transformations - including applying a UDF - and writes the result to the console. Every time a batch is created, a new consumer is created (and not closed), eventually leading to a "too many open files" error. I created a test case, with the code available here: [https://github.com/aseigneurin/spark-kafka-issue] To reproduce: # Start Kafka and create a topic called "persons" # Run "Producer" to generate data # Run "Consumer" I am attaching the log where you can see a new consumer being initialized between every batch. Please note this issue does *not* appear with Spark 2.2.2, and it does not appear either when I don't apply the UDF. I am suspecting - although I did go far enough to confirm - that this issue is related to the improvement made in SPARK-23623. was: I have a fairly simple piece of code that reads from Kafka, applies some transformations - including applying a UDF - and writes the result to the console. Every time a batch is created, a new consumer is created (and not closed), eventually leading to a "too many open files" error. I created a test case, with the code available here: [https://github.com/aseigneurin/spark-kafka-issue] To reproduce: # Start Kafka and create a topic called "persons" # Run "Producer" to generate data # Run "Consumer" I am attaching the log where you can see a new consumer being initialized between every batch. Please note this issue does *not* appear with Spark 2.2.2, and it does not appear either when I don't apply the UDF. > A new Kafka consumer gets created for every batch > - > > Key: SPARK-25106 > URL: https://issues.apache.org/jira/browse/SPARK-25106 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: console.txt > > > I have a fairly simple piece of code that reads from Kafka, applies some > transformations - including applying a UDF - and writes the result to the > console. Every time a batch is created, a new consumer is created (and not > closed), eventually leading to a "too many open files" error. > I created a test case, with the code available here: > [https://github.com/aseigneurin/spark-kafka-issue] > To reproduce: > # Start Kafka and create a topic called "persons" > # Run "Producer" to generate data > # Run "Consumer" > I am attaching the log where you can see a new consumer being initialized > between every batch. > Please note this issue does *not* appear with Spark 2.2.2, and it does not > appear either when I don't apply the UDF. > I am suspecting - although I did go far enough to confirm - that this issue > is related to the improvement made in SPARK-23623. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch
[ https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Seigneurin updated SPARK-25106: -- Description: I have a fairly simple piece of code that reads from Kafka, applies some transformations - including applying a UDF - and writes the result to the console. Every time a batch is created, a new consumer is created (and not closed), eventually leading to a "too many open files" error. I created a test case, with the code available here: [https://github.com/aseigneurin/spark-kafka-issue] To reproduce: # Start Kafka and create a topic called "persons" # Run "Producer" to generate data # Run "Consumer" I am attaching the log where you can see a new consumer being initialized between every batch. Please note this issue does *not* appear with Spark 2.2.2, and it does not appear either when I don't apply the UDF. was: I have a fairly simple piece of code that reads from Kafka, applies some transformations - including applying a UDF - and writes the result to the console. Every time a batch is created, a new consumer is created (and not closed), eventually leading to a "too many open files" error. I created a test case, with the code available here: [https://github.com/aseigneurin/spark-kafka-issue] To reproduce: # Start Kafka and create a topic called "persons" # Run "Producer" to generate data # Run "Consumer" I am attaching the log where you can see a new consumer being initialized between every batch. Please note this issue does *not* appear with Spark 2.2.2. > A new Kafka consumer gets created for every batch > - > > Key: SPARK-25106 > URL: https://issues.apache.org/jira/browse/SPARK-25106 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: console.txt > > > I have a fairly simple piece of code that reads from Kafka, applies some > transformations - including applying a UDF - and writes the result to the > console. Every time a batch is created, a new consumer is created (and not > closed), eventually leading to a "too many open files" error. > I created a test case, with the code available here: > [https://github.com/aseigneurin/spark-kafka-issue] > To reproduce: > # Start Kafka and create a topic called "persons" > # Run "Producer" to generate data > # Run "Consumer" > I am attaching the log where you can see a new consumer being initialized > between every batch. > Please note this issue does *not* appear with Spark 2.2.2, and it does not > appear either when I don't apply the UDF. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch
[ https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Seigneurin updated SPARK-25106: -- Attachment: console.txt > A new Kafka consumer gets created for every batch > - > > Key: SPARK-25106 > URL: https://issues.apache.org/jira/browse/SPARK-25106 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: console.txt > > > I have a fairly simple piece of code that reads from Kafka, applies some > transformations - including applying a UDF - and writes the result to the > console. Every time a batch is created, a new consumer is created (and not > closed), eventually leading to a "too many open files" error. > I created a test case, with the code available here: > [https://github.com/aseigneurin/spark-kafka-issue] > To reproduce: > # Start Kafka and create a topic called "persons" > # Run "Producer" to generate data > # Run "Consumer" > I am attaching the log where you can see a new consumer being initialized > between every batch. > Please note this issue does *not* appear with Spark 2.2.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25106) A new Kafka consumer gets created for every batch
Alexis Seigneurin created SPARK-25106: - Summary: A new Kafka consumer gets created for every batch Key: SPARK-25106 URL: https://issues.apache.org/jira/browse/SPARK-25106 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Alexis Seigneurin Attachments: console.txt I have a fairly simple piece of code that reads from Kafka, applies some transformations - including applying a UDF - and writes the result to the console. Every time a batch is created, a new consumer is created (and not closed), eventually leading to a "too many open files" error. I created a test case, with the code available here: [https://github.com/aseigneurin/spark-kafka-issue] To reproduce: # Start Kafka and create a topic called "persons" # Run "Producer" to generate data # Run "Consumer" I am attaching the log where you can see a new consumer being initialized between every batch. Please note this issue does *not* appear with Spark 2.2.2. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578789#comment-16578789 ] Eyal Farago commented on SPARK-24410: - [~viirya], my bad :) seems there are two distinct issues here: one is general behavior of join/aggregate over unions, the other is the guarantees of bucketed partitioning. looking more carefully at the results of your query it seems that the two DFs are not co-partitioned (which is a bit surprising), so my apologies. having that said, there's a more general issue with pushing down shuffle related operations over a union, do you guys think this deserves a separate issue? > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24918) Executor Plugin API
[ https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578767#comment-16578767 ] Marcelo Vanzin commented on SPARK-24918: For reference: this looks kinda similar to SPARK-650. > Executor Plugin API > --- > > Key: SPARK-24918 > URL: https://issues.apache.org/jira/browse/SPARK-24918 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: SPIP, memory-analysis > > It would be nice if we could specify an arbitrary class to run within each > executor for debugging and instrumentation. Its hard to do this currently > because: > a) you have no idea when executors will come and go with DynamicAllocation, > so don't have a chance to run custom code before the first task > b) even with static allocation, you'd have to change the code of your spark > app itself to run a special task to "install" the plugin, which is often > tough in production cases when those maintaining regularly running > applications might not even know how to make changes to the application. > For example, https://github.com/squito/spark-memory could be used in a > debugging context to understand memory use, just by re-running an application > with extra command line arguments (as opposed to rebuilding spark). > I think one tricky part here is just deciding the api, and how its versioned. > Does it just get created when the executor starts, and thats it? Or does it > get more specific events, like task start, task end, etc? Would we ever add > more events? It should definitely be a {{DeveloperApi}}, so breaking > compatibility would be allowed ... but still should be avoided. We could > create a base class that has no-op implementations, or explicitly version > everything. > Note that this is not needed in the driver as we already have SparkListeners > (even if you don't care about the SparkListenerEvents and just want to > inspect objects in the JVM, its still good enough). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well
[ https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578766#comment-16578766 ] kevin yu commented on SPARK-25105: -- I will try to fix it. Thanks. Kevin > Importing all of pyspark.sql.functions should bring PandasUDFType in as well > > > Key: SPARK-25105 > URL: https://issues.apache.org/jira/browse/SPARK-25105 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: holdenk >Priority: Trivial > > > {code:java} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'PandasUDFType' is not defined > > {code} > When explicitly imported it works fine: > {code:java} > > >>> from pyspark.sql.functions import PandasUDFType > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > {code} > > We just need to make sure it's included in __all__/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578763#comment-16578763 ] Liang-Chi Hsieh commented on SPARK-24410: - The above code shows that the two tables in union results are located in logically different partitions, even you know they might be physically co-partitioned. So we can't just get rid of the shuffle and expect the correct results, because of `SparkContext.union`'s current implementation. That is why cloud-fan suggested to implement Union with RDD.zip for some certain case, to preserve the children output partitioning. Although we can make Union smarter on its output partitioning, from the discussion you can see we might need to also consider parallelism and locality. > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > {code} > but for aggregation after union we get a shuffle: > {code} > select key,count(*) from (select * from a1 union all select * from a2)z group > by key > == Physical Plan == > *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, > count(1)#36L]) > +- Exchange hashpartitioning(key#25L, 1) >+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], > output=[key#25L, count#38L]) > +- Union > :- *(1) Project [key#25L] > : +- *(1) FileScan parquet default.a1[key#25L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > +- *(2) Project [key#28L] > +- *(2) FileScan parquet default.a2[key#28L] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752 ] MIK edited comment on SPARK-25051 at 8/13/18 6:21 PM: -- Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not correct (getting 0 records), ++---+ |id|name| ++---+ The sample program should return 2 records. +++ |id|name| |1|one| |3|three| +++ was (Author: mik1007): Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not correct (getting 0 records), +---++ | id|name| +---++ +---++ The sample program should return 2 records. +---+-+ | id| name| +---+-+ | 1| one| | 3|three| +---+-+ > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578752#comment-16578752 ] MIK commented on SPARK-25051: - Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not correct (getting 0 records), +---++ | id|name| +---++ +---++ The sample program should return 2 records. +---+-+ | id| name| +---+-+ | 1| one| | 3|three| +---+-+ > where clause on dataset gives AnalysisException > --- > > Key: SPARK-25051 > URL: https://issues.apache.org/jira/browse/SPARK-25051 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: MIK >Priority: Major > > *schemas :* > df1 > => id ts > df2 > => id name country > *code:* > val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull) > *error*: > org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing > from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in > operator !Filter isnull(id#0). Attribute(s) with the same name appear in the > operation: id. Please check if the right attribute(s) are used.;; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47) > at org.apache.spark.sql.Dataset.(Dataset.scala:172) > at org.apache.spark.sql.Dataset.(Dataset.scala:178) > at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300) > at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458) > at org.apache.spark.sql.Dataset.where(Dataset.scala:1486) > This works fine in spark 2.2.2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-23654: --- Summary: Cut jets3t as a dependency of spark-core (was: Cut jets3t and bouncy castle as dependencies of spark-core) > Cut jets3t as a dependency of spark-core > > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23654) Cut jets3t and bouncy castle as dependencies of spark-core
[ https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-23654: --- Summary: Cut jets3t and bouncy castle as dependencies of spark-core (was: Cut jets3t as a dependency of spark-core) > Cut jets3t and bouncy castle as dependencies of spark-core > -- > > Key: SPARK-23654 > URL: https://issues.apache.org/jira/browse/SPARK-23654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > > Spark core declares a dependency on Jets3t, which pulls in other cruft > # the hadoop-cloud module pulls in the hadoop-aws module with the > jets3t-compatible connectors, and the relevant dependencies: the spark-core > dependency is incomplete if that module isn't built, and superflous or > inconsistent if it is. > # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop > 3.x in favour we're willing to maintain. > JetS3t was wonderful when it came out, but now the amazon SDKs massively > exceed it in functionality, albeit at the expense of week-to-week stability > and JAR binary compatibility -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types
[ https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578719#comment-16578719 ] holdenk commented on SPARK-24735: - So [~bryanc]what do you think of if we add a AggregatePythonUDF and use it for grouped_map / grouped_agg so we get treated the correct way by the Scala SQL engine? > Improve exception when mixing up pandas_udf types > - > > Key: SPARK-24735 > URL: https://issues.apache.org/jira/browse/SPARK-24735 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > From the discussion here > https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up > Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = > pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an > exception which is hard to understand. It should tell the user that the UDF > type is wrong. This is the full output: > {code} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > >>> df.select(foo(df['v'])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", > line 353, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types
[ https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578710#comment-16578710 ] holdenk commented on SPARK-24735: - I think we could do better than just improving the exception, if we look at the other aggregates in PySpark when we call them with select it does the grouping for us: {code:java} >>> df.select(sumDistinct(df._1)).show() ++ |sum(DISTINCT _1)| ++ | 4950 | ++{code} > Improve exception when mixing up pandas_udf types > - > > Key: SPARK-24735 > URL: https://issues.apache.org/jira/browse/SPARK-24735 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > From the discussion here > https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up > Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = > pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an > exception which is hard to understand. It should tell the user that the UDF > type is wrong. This is the full output: > {code} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > >>> df.select(foo(df['v'])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", > line 353, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578709#comment-16578709 ] Liang-Chi Hsieh commented on SPARK-22347: - Agreed. Thanks [~rdblue] > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well
holdenk created SPARK-25105: --- Summary: Importing all of pyspark.sql.functions should bring PandasUDFType in as well Key: SPARK-25105 URL: https://issues.apache.org/jira/browse/SPARK-25105 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: holdenk {code:java} >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) Traceback (most recent call last): File "", line 1, in NameError: name 'PandasUDFType' is not defined {code} When explicitly imported it works fine: {code:java} >>> from pyspark.sql.functions import PandasUDFType >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) {code} We just need to make sure it's included in __all__/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24735) Improve exception when mixing up pandas_udf types
[ https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-24735: Summary: Improve exception when mixing up pandas_udf types (was: Improve exception when mixing pandas_udf types) > Improve exception when mixing up pandas_udf types > - > > Key: SPARK-24735 > URL: https://issues.apache.org/jira/browse/SPARK-24735 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Major > > From the discussion here > https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up > Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = > pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an > exception which is hard to understand. It should tell the user that the UDF > type is wrong. This is the full output: > {code} > >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP) > >>> df.select(foo(df['v'])).show() > Traceback (most recent call last): > File "", line 1, in > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", > line 353, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString. > : java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261) > at > org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108) > at > org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105) > at scala.Option.getOrElse(Option.scala:121) > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25104) Validate user specified output schema
[ https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25104: Assignee: Apache Spark > Validate user specified output schema > - > > Key: SPARK-25104 > URL: https://issues.apache.org/jira/browse/SPARK-25104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > With code changes in > [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] > , Spark can write out data as per user provided output schema. > To make it more robust and user friendly, we should validate the Avro schema > before tasks launched. > Also we should support output logical decimal type as BYTES (By default we > output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25104) Validate user specified output schema
[ https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25104: Assignee: (was: Apache Spark) > Validate user specified output schema > - > > Key: SPARK-25104 > URL: https://issues.apache.org/jira/browse/SPARK-25104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > With code changes in > [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] > , Spark can write out data as per user provided output schema. > To make it more robust and user friendly, we should validate the Avro schema > before tasks launched. > Also we should support output logical decimal type as BYTES (By default we > output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25104) Validate user specified output schema
[ https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578637#comment-16578637 ] Apache Spark commented on SPARK-25104: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22094 > Validate user specified output schema > - > > Key: SPARK-25104 > URL: https://issues.apache.org/jira/browse/SPARK-25104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Major > > With code changes in > [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] > , Spark can write out data as per user provided output schema. > To make it more robust and user friendly, we should validate the Avro schema > before tasks launched. > Also we should support output logical decimal type as BYTES (By default we > output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.
[ https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578624#comment-16578624 ] holdenk commented on SPARK-24736: - cc [~ifilonenko] > --py-files not functional for non local URLs. It appears to pass non-local > URL's into PYTHONPATH directly. > -- > > Key: SPARK-24736 > URL: https://issues.apache.org/jira/browse/SPARK-24736 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 > Environment: Recent 2.4.0 from master branch, submitted on Linux to a > KOPS Kubernetes cluster created on AWS. > >Reporter: Jonathan A Weaver >Priority: Minor > > My spark-submit > bin/spark-submit \ > --master > k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/] > \ > --deploy-mode cluster \ > --name pytest \ > --conf > spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest] > \ > --conf > [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver > \ > --conf > spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/] > \ > --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \ > --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \ > --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]; \ > [https://s3.amazonaws.com/maxar-ids-fids/it.py] > > *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()* > 2018-07-01 07:33:43 INFO SparkContext:54 - Added file > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp > 1530430423297 > 2018-07-01 07:33:43 INFO Utils:54 - Fetching > [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to > /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp > *I print out the PYTHONPATH and PYSPARK_FILES environment variables from the > driver script:* > PYTHONPATH > /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]* > PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] > > *I print out sys.path* > ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', > u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240', > '/opt/spark/python/lib/pyspark.zip', > '/opt/spark/python/lib/py4j-0.10.7-src.zip', > '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', > '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', > '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',* > '/usr/lib/python27.zip', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/lib/python2.7/site-packages'] > > *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.* > > *Dump of spark config from container.* > Spark config dumped: > [(u'spark.master', > u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'), > (u'spark.kubernetes.authenticate.submission.oauthToken', > u''), > (u'spark.kubernetes.authenticate.driver.oauthToken', > u''), (u'spark.kubernetes.executor.podNamePrefix', > u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), > (u'spark.driver.blockManager.port', u'7079'), > (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), > (u'[spark.app.name|http://spark.app.name/]', u'pytest'), > (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), > (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), > (u'spark.kubernetes.container.image', > u'[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest'|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest']), > (u'spark.driver.port', u'7078'), > (u'spark.kubernetes.python.mainAppResource', > u'[https://s3.amazonaws.com/maxar-ids-fids/it.py']), >
[jira] [Created] (SPARK-25104) Validate user specified output schema
Gengliang Wang created SPARK-25104: -- Summary: Validate user specified output schema Key: SPARK-25104 URL: https://issues.apache.org/jira/browse/SPARK-25104 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Gengliang Wang With code changes in [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,] , Spark can write out data as per user provided output schema. To make it more robust and user friendly, we should validate the Avro schema before tasks launched. Also we should support output logical decimal type as BYTES (By default we output as FIXED) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23874: - Description: Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support ARROW-2141 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 was: Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support SPARK-23555 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support ARROW-2141 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-22347: Fix Version/s: (was: 2.3.0) > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22347. - Resolution: Won't Fix > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.3.0 > > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-22347: - > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.3.0 > > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578566#comment-16578566 ] Wenchen Fan commented on SPARK-22347: - we changed our mind during code review and this JIRA is no longer valid, we should mark it as won't fix. [~rdblue] thanks for pointing it out! > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.3.0 > > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25060) PySpark UDF in case statement is always run
[ https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Blue resolved SPARK-25060. --- Resolution: Won't Fix I'm closing this issue as "Won't Fix", the same as the issue this duplicates, SPARK-22347. > PySpark UDF in case statement is always run > --- > > Key: SPARK-25060 > URL: https://issues.apache.org/jira/browse/SPARK-25060 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Ryan Blue >Priority: Major > > When evaluating a case statement with a python UDF, Spark will always run the > UDF even if the case doesn't use the branch with the UDF call. Here's a repro > case: > {code:lang=python} > from pyspark.sql.types import StringType > def fail_if_x(s): > assert s != 'x' > return s > spark.udf.register("fail_if_x", fail_if_x, StringType()) > df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str']) > df.registerTempTable("data") > spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end > from data").show() > {code} > This produces the following error: > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 189, in main > process() > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 184, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 104, in > func = lambda _, it: map(mapper, it) > File "", line 1, in > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 71, in > return lambda *a: f(*a) > File "", line 4, in fail_if_x > AssertionError > {code} > This is because Python UDFs are extracted from expressions and run in the > BatchEvalPython node inserted as the child of the expression node: > {code} > == Physical Plan == > CollectLimit 21 > +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null > END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS > STRING) END#6] >+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14] > +- Scan ExistingRDD[id#0L,str#1] > {code} > This doesn't affect correctness, but the behavior doesn't match the Scala API > where case can be used to avoid passing data that will cause a UDF to fail > into the UDF. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25060) PySpark UDF in case statement is always run
[ https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578557#comment-16578557 ] Ryan Blue commented on SPARK-25060: --- Thanks, [~hyukjin.kwon], you're right that this is a duplicate. I've closed it. > PySpark UDF in case statement is always run > --- > > Key: SPARK-25060 > URL: https://issues.apache.org/jira/browse/SPARK-25060 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Ryan Blue >Priority: Major > > When evaluating a case statement with a python UDF, Spark will always run the > UDF even if the case doesn't use the branch with the UDF call. Here's a repro > case: > {code:lang=python} > from pyspark.sql.types import StringType > def fail_if_x(s): > assert s != 'x' > return s > spark.udf.register("fail_if_x", fail_if_x, StringType()) > df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str']) > df.registerTempTable("data") > spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end > from data").show() > {code} > This produces the following error: > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 189, in main > process() > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 184, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 104, in > func = lambda _, it: map(mapper, it) > File "", line 1, in > File > "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py", > line 71, in > return lambda *a: f(*a) > File "", line 4, in fail_if_x > AssertionError > {code} > This is because Python UDFs are extracted from expressions and run in the > BatchEvalPython node inserted as the child of the expression node: > {code} > == Physical Plan == > CollectLimit 21 > +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null > END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS > STRING) END#6] >+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14] > +- Scan ExistingRDD[id#0L,str#1] > {code} > This doesn't affect correctness, but the behavior doesn't match the Scala API > where case can be used to avoid passing data that will cause a UDF to fail > into the UDF. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false
[ https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578549#comment-16578549 ] Ryan Blue commented on SPARK-22347: --- [~viirya], [~cloud_fan]: Is there any objection to changing the resolution of this issue to "Won't Fix" instead of "Fixed"? Just documenting the behavior is not a fix. If I don't hear anything in the next day or so, I'll update it. > UDF is evaluated when 'F.when' condition is false > - > > Key: SPARK-22347 > URL: https://issues.apache.org/jira/browse/SPARK-22347 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Nicolas Porter >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.3.0 > > > Here's a simple example on how to reproduce this: > {code} > from pyspark.sql import functions as F, Row, types > def Divide10(): > def fn(value): return 10 / int(value) > return F.udf(fn, types.IntegerType()) > df = sc.parallelize([Row(x=5), Row(x=0)]).toDF() > x = F.col('x') > df2 = df.select(F.when((x > 0), Divide10()(x))) > df2.show(200) > {code} > This raises a division by zero error, even if `F.when` is trying to filter > out all cases where `x <= 0`. I believe the correct behavior should be not to > evaluate the UDF when the `F.when` condition is false. > Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, > then the error is not raised and all rows resolve to `null`, which is the > expected result. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0
[ https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23874: - Description: Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support SPARK-23555 * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 was: Version 0.10.0 will allow for the following improvements and bug fixes: * Allow for adding BinaryType support * Bug fix related to array serialization ARROW-1973 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 * Python bytearrays are supported in as input to pyarrow ARROW-2141 * Java has common interface for reset to cleanup complex vectors in Spark ArrowWriter ARROW-1962 * Cleanup pyarrow type equality checks ARROW-2423 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, ARROW-2645 * Improved low level handling of messages for RecordBatch ARROW-2704 > Upgrade apache/arrow to 0.10.0 > -- > > Key: SPARK-23874 > URL: https://issues.apache.org/jira/browse/SPARK-23874 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Bryan Cutler >Priority: Major > > Version 0.10.0 will allow for the following improvements and bug fixes: > * Allow for adding BinaryType support SPARK-23555 > * Bug fix related to array serialization ARROW-1973 > * Python2 str will be made into an Arrow string instead of bytes ARROW-2101 > * Python bytearrays are supported in as input to pyarrow ARROW-2141 > * Java has common interface for reset to cleanup complex vectors in Spark > ArrowWriter ARROW-1962 > * Cleanup pyarrow type equality checks ARROW-2423 > * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, > ARROW-2645 > * Improved low level handling of messages for RecordBatch ARROW-2704 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25103) CompletionIterator may delay GC of completed resources
[ https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578519#comment-16578519 ] Eyal Farago commented on SPARK-25103: - CC: [~cloud_fan], [~hvanhovell] > CompletionIterator may delay GC of completed resources > -- > > Key: SPARK-25103 > URL: https://issues.apache.org/jira/browse/SPARK-25103 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0 >Reporter: Eyal Farago >Priority: Major > > while working on SPARK-22713 , I fund (and partially fixed) a scenario in > which an iterator is already exhausted but still holds a reference to some > resources that can be GCed at this point. > However, these resources can not be GCed because of this reference. > the specific fix applied in SPARK-22713 was to wrap the iterator with a > CompletionIterator that cleans it when exhausted, thing is that it's quite > easy to get this wrong by closing over local variables or _this_ reference in > the cleanup function itself. > I propose solving this by modifying CompletionIterator to discard references > to the wrapped iterator and cleanup function once exhausted. > > * a dive into the code showed that most CompletionIterators are eventually > used by > {code:java} > org.apache.spark.scheduler.ShuffleMapTask#runTask{code} > which does: > {code:java} > writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: > Product2[Any, Any]]]){code} > looking at > {code:java} > org.apache.spark.shuffle.ShuffleWriter#write{code} > implementations, it seems all of them first exhaust the iterator and then > perform some kind of post-processing: i.e. merging spills, sorting, writing > partitions files and then concatenating them into a single file... bottom > line the Iterator may actually be 'sitting' for some time after being > exhausted. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25103) CompletionIterator may delay GC of completed resources
Eyal Farago created SPARK-25103: --- Summary: CompletionIterator may delay GC of completed resources Key: SPARK-25103 URL: https://issues.apache.org/jira/browse/SPARK-25103 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.1 Reporter: Eyal Farago while working on SPARK-22713 , I fund (and partially fixed) a scenario in which an iterator is already exhausted but still holds a reference to some resources that can be GCed at this point. However, these resources can not be GCed because of this reference. the specific fix applied in SPARK-22713 was to wrap the iterator with a CompletionIterator that cleans it when exhausted, thing is that it's quite easy to get this wrong by closing over local variables or _this_ reference in the cleanup function itself. I propose solving this by modifying CompletionIterator to discard references to the wrapped iterator and cleanup function once exhausted. * a dive into the code showed that most CompletionIterators are eventually used by {code:java} org.apache.spark.scheduler.ShuffleMapTask#runTask{code} which does: {code:java} writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]){code} looking at {code:java} org.apache.spark.shuffle.ShuffleWriter#write{code} implementations, it seems all of them first exhaust the iterator and then perform some kind of post-processing: i.e. merging spills, sorting, writing partitions files and then concatenating them into a single file... bottom line the Iterator may actually be 'sitting' for some time after being exhausted. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables
[ https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578489#comment-16578489 ] Eyal Farago commented on SPARK-24410: - [~viirya], I think your conclusion about co-partitioning is wrong, the following code segment from your comment: {code:java} val df1 = spark.table("a1").select(spark_partition_id(), $"key") val df2 = spark.table("a2").select(spark_partition_id(), $"key") df1.union(df2).select(spark_partition_id(), $"key").collect {code} this prints the partition ids as assigned by union, assuming union simply concatenates the partitions from df1 and df2 assigning them a running number id, it really makes sense you'd get two partitions per key: one coming from df1 and the other from df2. applying this select on each dataframe separately you'd get the exact same results meaning a given key will have the same partition id in both dataframes. I think this code fragment basically shows what's wrong with current implementation of Union, no that we can't optimize unions of co-partitioned relations. if union was a bit more 'partitioning aware' it'd be able to identify that both children have the same partitioning scheme and 'inherit' it. as you actually showed this might be a bit tricky as the same logical attribute from different children has a different expression id, but Union eventually maps these child attributes into a single output attribute, so this information can be used to resolve the partitioning columns and determine their equality. furthermore, Union being smarter on its output partitioning won't cut it, few rules have to be added/modified: 1. applying exchange on a union should sometimes be pushed to the children (children can be partitioned to those supporting the required partitioning and others not supporting it, the exchange can be applied to a union of the non-supporting children and then unioned with the rest of the children) 2. partial aggregate also has to be pushed to the children resulting with a union of partial aggregations, again it's possible to partition children according to their support of the required partitioning. 3. final aggregation over a union introduces an exchange which will then be pushed to the children, the aggregation is then applied on top of the partitioning aware union (think of the way PartitionerAwareUnionRDD handles partitioning). * partition children = partitioning an array by a predicate (scala.collection.TraversableLike#partition) * other operators like join may require additional rules. * some of this ideas were discussed offline with [~hvanhovell] > Missing optimization for Union on bucketed tables > - > > Key: SPARK-24410 > URL: https://issues.apache.org/jira/browse/SPARK-24410 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ohad Raviv >Priority: Major > > A common use-case we have is of a partially aggregated table and daily > increments that we need to further aggregate. we do this my unioning the two > tables and aggregating again. > we tried to optimize this process by bucketing the tables, but currently it > seems that the union operator doesn't leverage the tables being bucketed > (like the join operator). > for example, for two bucketed tables a1,a2: > {code} > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write > .mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a1") > sparkSession.range(N).selectExpr( > "id as key", > "id % 2 as t1", > "id % 3 as t2") > .repartition(col("key")) > .write.mode(SaveMode.Overwrite) > .bucketBy(3, "key") > .sortBy("t1") > .saveAsTable("a2") > {code} > for the join query we get the "SortMergeJoin" > {code} > select * from a1 join a2 on (a1.key=a2.key) > == Physical Plan == > *(3) SortMergeJoin [key#24L], [key#27L], Inner > :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0 > : +- *(1) Project [key#24L, t1#25L, t2#26L] > : +- *(1) Filter isnotnull(key#24L) > :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], > PushedFilters: [IsNotNull(key)], ReadSchema: > struct > +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0 >+- *(2) Project [key#27L, t1#28L, t2#29L] > +- *(2) Filter isnotnull(key#27L) > +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: > true, Format: Parquet, Location: > InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], > PushedFilters: