[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221 ] Igor Berman edited comment on SPARK-23423 at 2/20/18 7:33 AM: -- [~skonto] so I've run application today with relevant logs at debug level(previously I just had problems with loggers, that they were reconfigured dynamically so I haven't seen reports of TASK_KILLED). Seems like with dynamic allocation on with executors starting and shutting down the chances that every slave will get 2 failures staring some executor are much higher that in regular case(without dynamic allocation) seems like SPARK-19755 is the core issue here - after half day of long running driver in client mode almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted. the reasons for executor failures might be different and transient(e.g. port collision) I think I'll close this Jira as duplicate for SPARK-19755, WDYT? Here just one example that out of 74 mesos slaves 16 already blacklisted {code:java} grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l 16{code} was (Author: igor.berman): [~skonto] so I've run application today with relevant logs at debug level. Seems like with dynamic allocation on with executors starting and shutting down the chances that every slave will get 2 failures staring some executor are much higher that in regular case(without dynamic allocation) seems like SPARK-19755 is the core issue here - after half day of long running driver in client mode almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted. the reasons for executor failures might be different and transient(e.g. port collision) I think I'll close this Jira as duplicate for SPARK-19755, WDYT? Here just one example that out of 74 mesos slaves 16 already blacklisted {code:java} grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l 16{code} > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin],
[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369750#comment-16369750 ] Manan Bakshi commented on SPARK-23463: -- Hi Marco, I realized that if you used following code for the same sample: rdd = sc.textFile("sample").map(lambda x: x.split(", ")) df = rdd.toDF(["dev", "val"]) df = df.filter(df["val"] > 0) The filter condition actually only filters out values less than 1 even though it should actually be filtering out values less than 0. This is really weird. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > Attachments: sample > > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23436: --- Assignee: Marco Gaido > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23436) Incorrect Date column Inference in partition discovery
[ https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23436. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20621 [https://github.com/apache/spark/pull/20621] > Incorrect Date column Inference in partition discovery > -- > > Key: SPARK-23436 > URL: https://issues.apache.org/jira/browse/SPARK-23436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Apoorva Sareen >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > If a Partition column appears to partial date/timestamp > example : 2018-01-01-23 > where it is only truncated upto an hour then the data types of the > partitioning columns are automatically inferred as date however, the values > are loaded as null. > Here is an example code to reproduce this behaviour > > > {code:java} > val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", > "date_month", "data_hour", "data") > data.write.partitionBy("id","date_month","data_hour").parquet("output/test") > val input = spark.read.parquet("output/test") > input.printSchema() > input.show() > ## Result ### > root > |-- data: string (nullable = true) > |-- id: integer (nullable = true) > |-- date_month: string (nullable = true) > |-- data_hour: date (nullable = true) > ++---+--+-+ > |data| id|date_month|data_hour| > ++---+--+-+ > |test| 1| 2018-01| null| > ++---+--+-+{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23457) Register task completion listeners first for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23457. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20619 [https://github.com/apache/spark/pull/20619] > Register task completion listeners first for ParquetFileFormat > -- > > Key: SPARK-23457 > URL: https://issues.apache.org/jira/browse/SPARK-23457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > ParquetFileFormat leaks open files in some cases. This issue aims to register > task completion listener first. > {code} > test("SPARK-23390 Register task completion listeners first in > ParquetFileFormat") { > withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> > s"${Int.MaxValue}") { > withTempDir { dir => > val basePath = dir.getCanonicalPath > Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, > "first").toString) > Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, > "second").toString) > val df = spark.read.parquet( > new Path(basePath, "first").toString, > new Path(basePath, "second").toString) > val e = intercept[SparkException] { > df.collect() > } > assert(e.getCause.isInstanceOf[OutOfMemoryError]) > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23457) Register task completion listeners first for ParquetFileFormat
[ https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-23457: --- Assignee: Dongjoon Hyun > Register task completion listeners first for ParquetFileFormat > -- > > Key: SPARK-23457 > URL: https://issues.apache.org/jira/browse/SPARK-23457 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > ParquetFileFormat leaks open files in some cases. This issue aims to register > task completion listener first. > {code} > test("SPARK-23390 Register task completion listeners first in > ParquetFileFormat") { > withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> > s"${Int.MaxValue}") { > withTempDir { dir => > val basePath = dir.getCanonicalPath > Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, > "first").toString) > Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, > "second").toString) > val df = spark.read.parquet( > new Path(basePath, "first").toString, > new Path(basePath, "second").toString) > val e = intercept[SparkException] { > df.collect() > } > assert(e.getCause.isInstanceOf[OutOfMemoryError]) > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369724#comment-16369724 ] Dongjoon Hyun commented on SPARK-23443: --- Hi, [~ameen.tayy...@gmail.com]. I have the same need for `ExternalCatalog`, and also tried to provide `ExternalCatalog` at SPARK-17767 ([PR|https://github.com/apache/spark/pull/15336/files]). As you see, SPARK-17767 was closed as `Later` and included in `Catalog Federation`. I hope both of us can find a way to achieve that in Apache Spark 2.4. BTW, I removed the target version here because only Spark committers do that. > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23443: -- Target Version/s: (was: 2.4.0) > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369623#comment-16369623 ] Apache Spark commented on SPARK-23466: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/20637 > Remove redundant null checks in generated Java code by > GenerateUnsafeProjection > --- > > Key: SPARK-23466 > URL: https://issues.apache.org/jira/browse/SPARK-23466 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field > is correct, we can use it to save null check" to simplify generated code. > When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed > code for null checks in the generated Java code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23466: Assignee: Apache Spark > Remove redundant null checks in generated Java code by > GenerateUnsafeProjection > --- > > Key: SPARK-23466 > URL: https://issues.apache.org/jira/browse/SPARK-23466 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark >Priority: Major > > One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field > is correct, we can use it to save null check" to simplify generated code. > When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed > code for null checks in the generated Java code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection
[ https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23466: Assignee: (was: Apache Spark) > Remove redundant null checks in generated Java code by > GenerateUnsafeProjection > --- > > Key: SPARK-23466 > URL: https://issues.apache.org/jira/browse/SPARK-23466 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field > is correct, we can use it to save null check" to simplify generated code. > When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed > code for null checks in the generated Java code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369602#comment-16369602 ] Dongjoon Hyun commented on SPARK-15348: --- Since Hive 3.0, Hive ACID 2.0 (HIVE-14035) table becomes default table in Hive metastore. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
[ https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] V Luong updated SPARK-23467: Description: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]. Also a related ticket is: https://issues.apache.org/jira/browse/SPARK-17998. In many of my use cases, data is appended by date (say, date D) into an S3 subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? was: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date D) into an S3 subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? > Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) > with each in-memory partition mapped to 1 physical file partition > -- > > Key: SPARK-23467 > URL: https://issues.apache.org/jira/browse/SPARK-23467 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > I would like to echo the need described here: > [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]. > Also a related ticket is: https://issues.apache.org/jira/browse/SPARK-17998. > In many of my use cases, data is appended by date (say, date D) into an S3 > subdir s3://bucket/path/to/parquet/date=D, after which I need to run > analytics by date. The analytics involves sorts and window functions. But I'm > only interested in within-date sorts/windows, and don't care about the > between-dates sorts and windows. > Currently, if I simply load the entire data set from the parent dir > s3://bucket/path/to/parquet, and then write Spark SQL statements involving > "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am > exploring ways to write analytics code/function per Spark partition, and send > such code/function to each partition. > The biggest problem now is that Spark's in-memory partitions do not > correspond to the physical files loaded from S3, so there is no way to > guarantee that the analytics by partition is done by date as desired. > Is there a way we can explicitly enable a direct correspondence between file > partitions and in-memory partitions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
[ https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] V Luong updated SPARK-23467: Description: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date X) into an S3 subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? was: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date X) into an S3 subdir `s3://bucket/path/to/parquet/date=X`, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir `s3://bucket/path/to/parquet`, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? > Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) > with each in-memory partition mapped to 1 physical file partition > -- > > Key: SPARK-23467 > URL: https://issues.apache.org/jira/browse/SPARK-23467 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > I would like to echo the need described here: > [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] > In many of my use cases, data is appended by date (say, date X) into an S3 > subdir s3://bucket/path/to/parquet/date=X, after which I need to run > analytics by date. The analytics involves sorts and window functions. But I'm > only interested in within-date sorts/windows, and don't care about the > between-dates sorts and windows. > Currently, if I simply load the entire data set from the parent dir > s3://bucket/path/to/parquet, and then write Spark SQL statements involving > "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am > exploring ways to write analytics code/function per Spark partition, and send > such code/function to each partition. > The biggest problem now is that Spark's in-memory partitions do not > correspond to the physical files loaded from S3, so there is no way to > guarantee that the analytics by partition is done by date as desired. > Is there a way we can explicitly enable a direct correspondence between file > partitions and in-memory partitions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
[ https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] V Luong updated SPARK-23467: Description: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date D) into an S3 subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? was: I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date X) into an S3 subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir s3://bucket/path/to/parquet, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? > Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) > with each in-memory partition mapped to 1 physical file partition > -- > > Key: SPARK-23467 > URL: https://issues.apache.org/jira/browse/SPARK-23467 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Major > > I would like to echo the need described here: > [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] > In many of my use cases, data is appended by date (say, date D) into an S3 > subdir s3://bucket/path/to/parquet/date=D, after which I need to run > analytics by date. The analytics involves sorts and window functions. But I'm > only interested in within-date sorts/windows, and don't care about the > between-dates sorts and windows. > Currently, if I simply load the entire data set from the parent dir > s3://bucket/path/to/parquet, and then write Spark SQL statements involving > "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am > exploring ways to write analytics code/function per Spark partition, and send > such code/function to each partition. > The biggest problem now is that Spark's in-memory partitions do not > correspond to the physical files loaded from S3, so there is no way to > guarantee that the analytics by partition is done by date as desired. > Is there a way we can explicitly enable a direct correspondence between file > partitions and in-memory partitions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition
V Luong created SPARK-23467: --- Summary: Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition Key: SPARK-23467 URL: https://issues.apache.org/jira/browse/SPARK-23467 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.2.1 Reporter: V Luong I would like to echo the need described here: [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html] In many of my use cases, data is appended by date (say, date X) into an S3 subdir `s3://bucket/path/to/parquet/date=X`, after which I need to run analytics by date. The analytics involves sorts and window functions. But I'm only interested in within-date sorts/windows, and don't care about the between-dates sorts and windows. Currently, if I simply load the entire data set from the parent dir `s3://bucket/path/to/parquet`, and then write Spark SQL statements involving "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am exploring ways to write analytics code/function per Spark partition, and send such code/function to each partition. The biggest problem now is that Spark's in-memory partitions do not correspond to the physical files loaded from S3, so there is no way to guarantee that the analytics by partition is done by date as desired. Is there a way we can explicitly enable a direct correspondence between file partitions and in-memory partitions? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure
[ https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369600#comment-16369600 ] Dongjoon Hyun commented on SPARK-23346: --- Thank you for reporting. I adjusted the priority to `Critical` since it's reported in 2.2.0. This might be critical, but cannot be a blocker for Apache Spark 2.3.0. > Failed tasks reported as success if the failure reason is not ExceptionFailure > -- > > Key: SPARK-23346 > URL: https://issues.apache.org/jira/browse/SPARK-23346 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0 >Reporter: 吴志龙 >Priority: Critical > Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png > > > !企业微信截图_15179715023606.png! !企业微信截图_15179714603307.png! We have many other > failure reasons, such as TaskResultLost,but the status is success. In the web > ui, we count non-ExceptionFailure failures as successful tasks, which is > highly misleading. > detail message: > Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most > recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor > 27): TaskResultLost (result lost from block manager) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure
[ https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23346: -- Priority: Critical (was: Blocker) > Failed tasks reported as success if the failure reason is not ExceptionFailure > -- > > Key: SPARK-23346 > URL: https://issues.apache.org/jira/browse/SPARK-23346 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0 >Reporter: 吴志龙 >Priority: Critical > Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png > > > !企业微信截图_15179715023606.png! !企业微信截图_15179714603307.png! We have many other > failure reasons, such as TaskResultLost,but the status is success. In the web > ui, we count non-ExceptionFailure failures as successful tasks, which is > highly misleading. > detail message: > Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most > recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor > 27): TaskResultLost (result lost from block manager) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369567#comment-16369567 ] Dongjoon Hyun commented on SPARK-23449: --- I removed the fixed version from this issue. We can update it when this merged into Apache Spark 2.3 RC if exists. > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh` does not preserve its ordering, which makes > `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before > any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23449) Extra java options lose order in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23449: -- Fix Version/s: (was: 2.3.0) > Extra java options lose order in Docker context > --- > > Key: SPARK-23449 > URL: https://issues.apache.org/jira/browse/SPARK-23449 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8S with supplied Docker image. Passing > along extra java options. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when > processed in `entrypoint.sh` does not preserve its ordering, which makes > `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before > any other experimental options. > > Steps to reproduce: > # Set `spark.driver.extraJavaOptions`, e.g. > `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled > -XX:+UseCGroupMemoryLimitForHeap` > # Submit application to k8s cluster. > # Fetch logs and observe that on each run order of options is different and > when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail. > > Expected behaviour: > # Order of `extraJavaOptions` should be preserved. > > Cause: > `entrypoint.sh` fetches environment options with `env`, which doesn't > guarantee ordering. > {code:java} > env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > > /tmp/java_opts.txt{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23094: -- Fix Version/s: (was: 2.3.0) > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369565#comment-16369565 ] Dongjoon Hyun commented on SPARK-23094: --- Since this is reverted and not a regression, I'll remove the fixed version to unblock RC4. > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369564#comment-16369564 ] Dongjoon Hyun commented on SPARK-23390: --- Since this is not a regression for both Parquet(not a regression) and ORC (new ORC is disabled by default), I'll remove the target version from this to unblock RC4. cc [~cloud_fan] and [~sameerag] > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Target Version/s: (was: 2.3.0) > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
[ https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23390: -- Fix Version/s: (was: 2.3.0) > Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7 > -- > > Key: SPARK-23390 > URL: https://issues.apache.org/jira/browse/SPARK-23390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Wenchen Fan >Priority: Major > > We're seeing multiple failures in {{FileBasedDataSourceSuite}} in > {{spark-branch-2.3-test-sbt-hadoop-2.7}}: > {code} > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01215805999 seconds. Last failure message: There are 1 possibly leaked > file streams.. > {code} > Here's the full history: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/ > From a very quick look, these failures seem to be correlated with > https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from > the following stack trace (full logs > [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]): > > {code} > [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds) > 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in > stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled) > 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem > connection created at: > java.lang.Throwable > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173) > at > org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254) > at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138) > {code} > Also, while this might be just a false correlation but the frequency of these > test failures have increased considerably in > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/ > after https://github.com/apache/spark/pull/20562 (cc > [~feng...@databricks.com]) was merged. > The following is Parquet leakage. > {code} > Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null > at > org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36) > at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769) > at > org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538) > at > org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23417: Assignee: Apache Spark > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Assignee: Apache Spark >Priority: Minor > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369500#comment-16369500 ] Apache Spark commented on SPARK-23417: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/20638 > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions
[ https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23417: Assignee: (was: Apache Spark) > pyspark tests give wrong sbt instructions > - > > Key: SPARK-23417 > URL: https://issues.apache.org/jira/browse/SPARK-23417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Minor > > When running python/run-tests, the script indicates that I must run > "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or > 'build/mvn -Pkafka-0-8 package'". The sbt command fails: > > [error] Expected ID character > [error] Not a valid command: streaming-kafka-0-8-assembly > [error] Expected project ID > [error] Expected configuration > [error] Expected ':' (if selecting a configuration) > [error] Expected key > [error] Not a valid key: streaming-kafka-0-8-assembly > [error] streaming-kafka-0-8-assembly/assembly > [error] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369424#comment-16369424 ] Manan Bakshi commented on SPARK-23463: -- Hi Marco, I have attached a text file along with this comment. If you perform the following operations on the rdd, you should be able to repro the issue: rdd = sc.textFile("sample").map(lambda x: x.split(",")) df = rdd.toDF(["dev", "val"]) df = df.filter(df["val"] > 0) [^sample] > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > Attachments: sample > > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manan Bakshi updated SPARK-23463: - Attachment: sample > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > Attachments: sample > > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection
Kazuaki Ishizaki created SPARK-23466: Summary: Remove redundant null checks in generated Java code by GenerateUnsafeProjection Key: SPARK-23466 URL: https://issues.apache.org/jira/browse/SPARK-23466 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field is correct, we can use it to save null check" to simplify generated code. When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed code for null checks in the generated Java code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17430) Spark task Hangs after OOM while DAG scheduler tries to serialize a task
[ https://issues.apache.org/jira/browse/SPARK-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369367#comment-16369367 ] Furcy Pin commented on SPARK-17430: --- Same problem here in spark 2.2.1 The job hangs forever, the GUI is still responsive but no task advances and the executors are still reserved. This can be really annoying as it create zombie jobs eating cluster resources for nothing until they are killed manually. The StackTrace in the driver logs gave this error: {code:java} WARN util.Utils: Suppressing exception in finally: Java heap space java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822) at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719) at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740) at org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:620) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346) at org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619) at org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562) at org.apache.spark.scheduler.DAGScheduler.createShuffleMapStage(DAGScheduler.scala:341) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage$1.apply(DAGScheduler.scala:312) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage$1.apply(DAGScheduler.scala:305) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.mutable.Stack.foreach(Stack.scala:169) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:305) at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:381) at org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:380) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:45) at scala.collection.SetLike$class.map(SetLike.scala:93) Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3236) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) at java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145) at java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894) at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875) at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at
[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Berman updated SPARK-23423: Attachment: (was: no-dyn-allocation-failed-no-statusUpdate.txt) > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23459) Improve the error message when unknown column is specified in partition columns
[ https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369346#comment-16369346 ] Kazuaki Ishizaki commented on SPARK-23459: -- Have you started working for this? If not, I can take it. > Improve the error message when unknown column is specified in partition > columns > --- > > Key: SPARK-23459 > URL: https://issues.apache.org/jira/browse/SPARK-23459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > Labels: starter > > {noformat} > test("save with an unknown partition column") { > withTempDir { dir => > val path = dir.getCanonicalPath > Seq(1L -> "a").toDF("i", "j").write > .format("parquet") > .partitionBy("unknownColumn") > .save(path) > } > } > {noformat} > We got the following error message: > {noformat} > Partition column unknownColumn not found in schema > StructType(StructField(i,LongType,false), StructField(j,StringType,true)); > {noformat} > We should not call toString, but catalogString in the function > `partitionColumnsSchema` of `PartitioningUtils.scala` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.
[ https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369330#comment-16369330 ] Steve Loughran commented on SPARK-23420: Can I note that if there's a colon in the path, it'd still fail, even if the glob is bypassed. Long outstanding issue > Datasource loading not handling paths with regex chars. > --- > > Key: SPARK-23420 > URL: https://issues.apache.org/jira/browse/SPARK-23420 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 >Reporter: Mitchell >Priority: Major > > Greetings, during some recent testing I ran across an issue attempting to > load files with regex chars like []()* etc. in them. The files are valid in > the various storages and the normal hadoop APIs all function properly > accessing them. > When my code is executed, I get the following stack trace. > 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: > java.io.IOException: Illegal file pattern: Unmatched closing ')' near index > 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near > index 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at > org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at > org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at > org.apache.hadoop.fs.Globber.glob(Globber.java:149) at > org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at > org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at > org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) > at > org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244) > at > org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.immutable.List.flatMap(List.scala:344) at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at > org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at > org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at > com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635) > Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' > near index 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^ at java.util.regex.Pattern.error(Pattern.java:1955) at > java.util.regex.Pattern.compile(Pattern.java:1700) at > java.util.regex.Pattern.(Pattern.java:1351) at > java.util.regex.Pattern.compile(Pattern.java:1054) at > org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at > org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at > org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 > 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, > (reason: User class threw exception: java.io.IOException: Illegal file > pattern: Unmatched closing ')' near index 130 > A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_?? > ^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking
[jira] [Commented] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one
[ https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369275#comment-16369275 ] Mihaly Toth commented on SPARK-23465: - I have started working on this. > Dataset.withAllColumnsRenamed should map all column names to a new one > -- > > Key: SPARK-23465 > URL: https://issues.apache.org/jira/browse/SPARK-23465 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Minor > > Currently one can only rename a column only one by one using > {{withColumnRenamed()}} function. When one would like to rename all or most > of the columns it would be easier to specify an algorithm for mapping from > the old to the new name (like prefixing) than iterating over all the fields. > Example usage is joining to a Dataset with the same or similar schema > (special case is self joining) where the names are the same or overlapping. > Such a joined Dataset would fail at {{saveAsTable()}} > With the new function usage would be easy like that: > {code:java} > ds.withAllColumnsRenamed("prefix" + _) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one
Mihaly Toth created SPARK-23465: --- Summary: Dataset.withAllColumnsRenamed should map all column names to a new one Key: SPARK-23465 URL: https://issues.apache.org/jira/browse/SPARK-23465 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Mihaly Toth Currently one can only rename a column only one by one using {{withColumnRenamed()}} function. When one would like to rename all or most of the columns it would be easier to specify an algorithm for mapping from the old to the new name (like prefixing) than iterating over all the fields. Example usage is joining to a Dataset with the same or similar schema (special case is self joining) where the names are the same or overlapping. Such a joined Dataset would fail at {{saveAsTable()}} With the new function usage would be easy like that: {code:java} ds.withAllColumnsRenamed("prefix" + _) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221 ] Igor Berman edited comment on SPARK-23423 at 2/19/18 3:47 PM: -- [~skonto] so I've run application today with relevant logs at debug level. Seems like with dynamic allocation on with executors starting and shutting down the chances that every slave will get 2 failures staring some executor are much higher that in regular case(without dynamic allocation) seems like SPARK-19755 is the core issue here - after half day of long running driver in client mode almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted. the reasons for executor failures might be different and transient(e.g. port collision) I think I'll close this Jira as duplicate for SPARK-19755, WDYT? Here just one example that out of 74 mesos slaves 16 already blacklisted {code:java} grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l 16{code} was (Author: igor.berman): [~skonto] so I've run application today with relevant logs at debug level. Seems like with dynamic allocation on with executors starting and shutting down the chances that every slave will get 2 failures staring some executor are much higher that in regular case(without dynamic allocation) seems like SPARK-19755 is the core issue here - after half day of long running driver in client mode almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted. the reasons for executor failures might be different and transient(e.g. port collision) I think I'll close this Jira as duplicate for SPARK-19755, WDYT? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > Attachments: no-dyn-allocation-failed-no-statusUpdate.txt > > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Issue Comment Deleted] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Berman updated SPARK-23423: Comment: was deleted (was: [~skonto] today I haven't managed to run with dynamic allocation on, however attached details for following situation: one of executors failed while running without dyn.allocation. All parties seem like getting all the updates: slave agent, master, and even framework, but not MesosCoarseGrainedSchedulerBackend even though TaskSchedulerImpl did get "Lost executor 15" see [^no-dyn-allocation-failed-no-statusUpdate.txt] I'll enable dyn.allocation tomorrow. Meanwhile, do you think the attached behavior might be problematic? ) > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > Attachments: no-dyn-allocation-failed-no-statusUpdate.txt > > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.
[ https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369227#comment-16369227 ] Igor Berman commented on SPARK-19755: - This Jira is very relevant for the case when running with dynamic allocation turned on, where starting and stopping executors is part of natural lifecycle of the driver. The chances to fail when starting executor are increasing(e.g. due to transient port collisions) The threshold of 2 seems too low and artificial for this usecases. I've observed situation where at some point almost 1/3 of mesos-slave nodes are marked as blacklisted(but they were ok). This creates situation where the cluster has free resources but frameworks can't use them since they actively decline offers from the master. > Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result > - scheduler cannot create an executor after some time. > --- > > Key: SPARK-19755 > URL: https://issues.apache.org/jira/browse/SPARK-19755 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.1.0 > Environment: mesos, marathon, docker - driver and executors are > dockerized. >Reporter: Timur Abakumov >Priority: Major > > When for some reason task fails - MesosCoarseGrainedSchedulerBackend > increased failure counter for a slave where that task was running. > When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded. > Over time scheduler cannot create a new executor - every slave is is in the > blacklist. Task failure not necessary related to host health- especially for > long running stream apps. > If accepted as a bug: possible solution is to use: spark.blacklist.enabled to > make that functionality optional and if it make sense MAX_SLAVE_FAILURES > also can be configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
[ https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221 ] Igor Berman commented on SPARK-23423: - [~skonto] so I've run application today with relevant logs at debug level. Seems like with dynamic allocation on with executors starting and shutting down the chances that every slave will get 2 failures staring some executor are much higher that in regular case(without dynamic allocation) seems like SPARK-19755 is the core issue here - after half day of long running driver in client mode almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted. the reasons for executor failures might be different and transient(e.g. port collision) I think I'll close this Jira as duplicate for SPARK-19755, WDYT? > Application declines any offers when killed+active executors rich > spark.dynamicAllocation.maxExecutors > -- > > Key: SPARK-23423 > URL: https://issues.apache.org/jira/browse/SPARK-23423 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.2.1 >Reporter: Igor Berman >Priority: Major > Labels: Mesos, dynamic_allocation > Attachments: no-dyn-allocation-failed-no-statusUpdate.txt > > > Hi > Mesos Version:1.1.0 > I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend > when running on Mesos with dynamic allocation on and limiting number of max > executors by spark.dynamicAllocation.maxExecutors. > Suppose we have long running driver that has cyclic pattern of resource > consumption(with some idle times in between), due to dyn.allocation it > receives offers and then releases them after current chunk of work processed. > Since at > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573] > the backend compares numExecutors < executorLimit and > numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves > holds all slaves ever "met", i.e. both active and killed (see comment > [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)] > > On the other hand, number of taskIds should be updated due to statusUpdate, > but suppose this update is lost(actually I don't see logs of 'is now > TASK_KILLED') so this number of executors might be wrong > > I've created test that "reproduces" this behavior, not sure how good it is: > {code:java} > //MesosCoarseGrainedSchedulerBackendSuite > test("max executors registered stops to accept offers when dynamic allocation > enabled") { > setBackend(Map( > "spark.dynamicAllocation.maxExecutors" -> "1", > "spark.dynamicAllocation.enabled" -> "true", > "spark.dynamicAllocation.testing" -> "true")) > backend.doRequestTotalExecutors(1) > val (mem, cpu) = (backend.executorMemory(sc), 4) > val offer1 = createOffer("o1", "s1", mem, cpu) > backend.resourceOffers(driver, List(offer1).asJava) > verifyTaskLaunched(driver, "o1") > backend.doKillExecutors(List("0")) > verify(driver, times(1)).killTask(createTaskId("0")) > val offer2 = createOffer("o2", "s2", mem, cpu) > backend.resourceOffers(driver, List(offer2).asJava) > verify(driver, times(1)).declineOffer(offer2.getId) > }{code} > > > Workaround: Don't set maxExecutors with dynamicAllocation on > > Please advice > Igor > marking you friends since you were last to touch this piece of code and > probably can advice something([~vanzin], [~skonto], [~susanxhuynh]) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8529) Set metadata for MinMaxScaler
[ https://issues.apache.org/jira/browse/SPARK-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369194#comment-16369194 ] Barry Becker commented on SPARK-8529: - Complementing the output metadata in what way? Need more info about what needs to be set. Is this still an issue? It's been open for almost 3 years. >From a fit MinMaxScalarModel can I get the originalMin and originalMax values? Not sure what this comment in the code (on master) means: |/**| | | * Model fitted by [[MinMaxScaler]].| | | *| | | * @param originalMin min value for each original column during fitting| | | * @param originalMax max value for each original column during fitting| | | *| | | * TODO: The transformer does not yet set the metadata in the output column (SPARK-8529).| | | */| > Set metadata for MinMaxScaler > - > > Key: SPARK-8529 > URL: https://issues.apache.org/jira/browse/SPARK-8529 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Add this as an reminder for complementing the output metadata for transformer > MinMaxScaler. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.
[ https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369112#comment-16369112 ] Shay Elbaz commented on SPARK-5377: --- +1 This seems like a very useful improvement and will save us many current workarounds. Any specific reason for why was this closed? > Dynamically add jar into Spark Driver's classpath. > -- > > Key: SPARK-5377 > URL: https://issues.apache.org/jira/browse/SPARK-5377 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Chengxiang Li >Priority: Major > > Spark support dynamically add jar to executor classpath through > SparkContext::addJar(), while it does not support dynamically add jar into > driver classpath. In most case(if not all the case), user dynamically add jar > with SparkContext::addJar() because some classes from the jar would be > referred in upcoming Spark job, which means the classes need to be loaded in > Spark driver side either,e.g during serialization. I think it make sense to > add an API to add jar into driver classpath, or just make it available in > SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command
[ https://issues.apache.org/jira/browse/SPARK-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Kurczych updated SPARK-23464: Description: Parameters passed to driver launching command in Mesos container are escaped using _shellEscape_ function. In SPARK-18114 additional wrapping in double quotes has been introduced. This cancels out quoting done by _shellEscape_ and makes in unable to run tasks with whitespaces in parameters, as they are interpreted as additional parameters to in-container spark-submit. This is how parameter passed to in-container spark-submit looks like now: {code:java} --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" {code} This is how they look after reverting SPARK-18114 related commit: {code:java} --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" {code} In current version submitting job with such extraJavaOptions causes following error: {code:java} Error: Unrecognized option: -Dbar=another Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). (... further spark-submit help ...) {code} Reverting SPARK-18114 is the solution to the issue. I can create a pull-request in GitHub. I thought about adding unit tests for that, buth methods generating driver launch command are private. was: Parameters passed to driver launching command in Mesos container are escaped using _shellEscape_ function. In SPARK-18114 additional wrapping in double quotes has been introduced. This cancels out quoting done by _shellEscape_ and makes in unable to run tasks with whitespaces in parameters, as they are interpreted as additional parameters to in-container spark-submit. This is how parameter passed to in-container spark-submit looks like now: {code:java} --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" {code} This is how they look after reverting SPARK-18114 related commit: {code:java} --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" {code} In current version submitting job with such extraJavaOptions causes following error: {code:java} Error: Unrecognized option: -Dbar=another Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). (... further spark-submit help ...) {code} > MesosClusterScheduler double-escapes parameters to bash command > --- > > Key: SPARK-23464 > URL: https://issues.apache.org/jira/browse/SPARK-23464 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 with Mesosphere patches (but the problem > exists in main repo) > DC/OS 1.9.5 >Reporter: Marcin Kurczych >Priority: Major > > Parameters passed to driver launching command in Mesos container are escaped > using _shellEscape_ function. In SPARK-18114 additional wrapping in double > quotes has been introduced. This cancels out quoting done by _shellEscape_ > and makes in unable to run tasks with whitespaces in parameters, as they are > interpreted as additional parameters to in-container spark-submit. > This is how parameter passed to in-container spark-submit looks like now: > {code:java} > --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" > {code} > This is how they look after reverting SPARK-18114 related commit: > {code:java} > --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" > {code} > In current version submitting job with such extraJavaOptions causes following > error: > {code:java} > Error: Unrecognized option: -Dbar=another > Usage: spark-submit [options] [app arguments] > Usage: spark-submit --kill [submission ID] --master [spark://...] > Usage: spark-submit --status [submission ID] --master [spark://...] > Usage:
[jira] [Created] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command
Marcin Kurczych created SPARK-23464: --- Summary: MesosClusterScheduler double-escapes parameters to bash command Key: SPARK-23464 URL: https://issues.apache.org/jira/browse/SPARK-23464 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.2.0 Environment: Spark 2.2.0 with Mesosphere patches (but the problem exists in main repo) DC/OS 1.9.5 Reporter: Marcin Kurczych Parameters passed to driver launching command in Mesos container are escaped using _shellEscape_ function. In SPARK-18114 additional wrapping in double quotes has been introduced. This cancels out quoting done by _shellEscape_ and makes in unable to run tasks with whitespaces in parameters, as they are interpreted as additional parameters to in-container spark-submit. This is how parameter passed to in-container spark-submit looks like now: {code} --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" {code} This is how they look after reverting SPARK-18114 related commit: {code} --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" {code} In current version submitting job with such extraJavaOptions causes following error: {code} Error: Unrecognized option: -Dfoo=another Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). (... further spark-submit help ...) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command
[ https://issues.apache.org/jira/browse/SPARK-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Kurczych updated SPARK-23464: Description: Parameters passed to driver launching command in Mesos container are escaped using _shellEscape_ function. In SPARK-18114 additional wrapping in double quotes has been introduced. This cancels out quoting done by _shellEscape_ and makes in unable to run tasks with whitespaces in parameters, as they are interpreted as additional parameters to in-container spark-submit. This is how parameter passed to in-container spark-submit looks like now: {code:java} --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" {code} This is how they look after reverting SPARK-18114 related commit: {code:java} --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" {code} In current version submitting job with such extraJavaOptions causes following error: {code:java} Error: Unrecognized option: -Dbar=another Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). (... further spark-submit help ...) {code} was: Parameters passed to driver launching command in Mesos container are escaped using _shellEscape_ function. In SPARK-18114 additional wrapping in double quotes has been introduced. This cancels out quoting done by _shellEscape_ and makes in unable to run tasks with whitespaces in parameters, as they are interpreted as additional parameters to in-container spark-submit. This is how parameter passed to in-container spark-submit looks like now: {code} --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" {code} This is how they look after reverting SPARK-18114 related commit: {code} --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" {code} In current version submitting job with such extraJavaOptions causes following error: {code} Error: Unrecognized option: -Dfoo=another Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). (... further spark-submit help ...) {code} > MesosClusterScheduler double-escapes parameters to bash command > --- > > Key: SPARK-23464 > URL: https://issues.apache.org/jira/browse/SPARK-23464 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 with Mesosphere patches (but the problem > exists in main repo) > DC/OS 1.9.5 >Reporter: Marcin Kurczych >Priority: Major > > Parameters passed to driver launching command in Mesos container are escaped > using _shellEscape_ function. In SPARK-18114 additional wrapping in double > quotes has been introduced. This cancels out quoting done by _shellEscape_ > and makes in unable to run tasks with whitespaces in parameters, as they are > interpreted as additional parameters to in-container spark-submit. > This is how parameter passed to in-container spark-submit looks like now: > {code:java} > --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"" > {code} > This is how they look after reverting SPARK-18114 related commit: > {code:java} > --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another" > {code} > In current version submitting job with such extraJavaOptions causes following > error: > {code:java} > Error: Unrecognized option: -Dbar=another > Usage: spark-submit [options] [app arguments] > Usage: spark-submit --kill [submission ID] --master [spark://...] > Usage: spark-submit --status [submission ID] --master [spark://...] > Usage: spark-submit run-example [options] example-class [example args] > Options: > --master MASTER_URL spark://host:port, mesos://host:port, yarn, or > local. > --deploy-mode DEPLOY_MODE Whether to launch the
[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368968#comment-16368968 ] Marco Gaido commented on SPARK-23463: - sorry, what do you mean by blank values? Which is the type of "val"? Can you provide some code to generate a DataFrame like the one you posted to reproduce the issue? Thanks. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23450) jars option in spark submit is documented in misleading way
[ https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368953#comment-16368953 ] Gregory Reshetniak commented on SPARK-23450: Hi guys, any update on this, please? > jars option in spark submit is documented in misleading way > --- > > Key: SPARK-23450 > URL: https://issues.apache.org/jira/browse/SPARK-23450 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.2.1 >Reporter: Gregory Reshetniak >Priority: Major > > I am wondering if the {{--jars}} option on spark submit is actually meant for > distributing the dependency jars onto the nodes in cluster? > > In my case I can see it working as a "symlink" of sorts. But the > documentation is written in the way that suggests otherwise. Please help me > figure out if this is a bug or just my reading of the docs. Thanks! > _ > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23415: Assignee: Apache Spark > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > The test suite fails due to 60-second timeout sometimes. > {code} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23415: Assignee: (was: Apache Spark) > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > The test suite fails due to 60-second timeout sometimes. > {code} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky
[ https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368895#comment-16368895 ] Apache Spark commented on SPARK-23415: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/20636 > BufferHolderSparkSubmitSuite is flaky > - > > Key: SPARK-23415 > URL: https://issues.apache.org/jira/browse/SPARK-23415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > The test suite fails due to 60-second timeout sometimes. > {code} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > failAfter did not complete within 60 seconds. > {code} > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manan Bakshi updated SPARK-23463: - Comment: was deleted (was: I believe that the bug has something to do with CBO that was introduced in Spark 2.2.0. The CBO looks at blank min value in table stats and is unable to decide on filter selectivity.) > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manan Bakshi updated SPARK-23463: - Description: Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was introduced to look at the table stats and decide filter selectivity. However, since then, filter has started behaving unexpectedly for blank values. The operation would not only drop columns with blank values but also filter out rows that actually meet the filter criteria. Steps to repro Consider a simple dataframe with some blank values as below: ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. Also, if there are no blank values, the filter operation works as expected for all versions. was: I have a simple dataframe with some blank values as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. Also, if there are no blank values, the filter operation works as expected for all versions. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was > introduced to look at the table stats and decide filter selectivity. However, > since then, filter has started behaving unexpectedly for blank values. The > operation would not only drop columns with blank values but also filter out > rows that actually meet the filter criteria. > Steps to repro > Consider a simple dataframe with some blank values as below: > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manan Bakshi updated SPARK-23463: - Description: I have a simple dataframe with some blank values as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. Also, if there are no blank values, the filter operation works as expected for all versions. was: I have a simple dataframe with some blank values as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. Also, the filter operation works as expected for all versions, if there are no blank values. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > I have a simple dataframe with some blank values as below > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, if there are no blank values, the filter operation works as expected > for all versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368879#comment-16368879 ] Manan Bakshi commented on SPARK-23463: -- I believe that the bug has something to do with CBO that was introduced in Spark 2.2.0. The CBO looks at blank min value in table stats and is unable to decide on filter selectivity. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > I have a simple dataframe with some blank values as below > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, the filter operation works as expected for all versions, if there are > no blank values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
[ https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manan Bakshi updated SPARK-23463: - Description: I have a simple dataframe with some blank values as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. Also, the filter operation works as expected for all versions, if there are no blank values. was: I have a simple dataframe as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. > Filter operation fails to handle blank values and evicts rows that even > satisfy the filtering condition > --- > > Key: SPARK-23463 > URL: https://issues.apache.org/jira/browse/SPARK-23463 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 >Reporter: Manan Bakshi >Priority: Critical > > I have a simple dataframe with some blank values as below > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL| | > |ALL|2.5| > |ALL|4.5| > |ALL|45| > Running a simple filter operation over val column in this dataframe yields > unexpected results. For eg. the following query returned an empty dataframe: > df.filter(df["val"] > 0) > ||dev||val|| > However, the filter operation works as expected if 0 in filter condition is > replaced by float 0.0 > df.filter(df["val"] > 0.0) > ||dev||val|| > |ALL|0.01| > |ALL|0.02| > |ALL|0.004| > |ALL|2.5| > |ALL|4.5| > |ALL|45| > > Note that this bug only exists in Spark 2.2.0 and later. The previous > versions filter as expected for both int (0) and float (0.0) values in the > filter condition. > Also, the filter operation works as expected for all versions, if there are > no blank values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition
Manan Bakshi created SPARK-23463: Summary: Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition Key: SPARK-23463 URL: https://issues.apache.org/jira/browse/SPARK-23463 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.1 Reporter: Manan Bakshi I have a simple dataframe as below ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL| | |ALL|2.5| |ALL|4.5| |ALL|45| Running a simple filter operation over val column in this dataframe yields unexpected results. For eg. the following query returned an empty dataframe: df.filter(df["val"] > 0) ||dev||val|| However, the filter operation works as expected if 0 in filter condition is replaced by float 0.0 df.filter(df["val"] > 0.0) ||dev||val|| |ALL|0.01| |ALL|0.02| |ALL|0.004| |ALL|2.5| |ALL|4.5| |ALL|45| Note that this bug only exists in Spark 2.2.0 and later. The previous versions filter as expected for both int (0) and float (0.0) values in the filter condition. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark
[ https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368877#comment-16368877 ] Paul Doran commented on SPARK-21962: I have a working local version for Jaeger implementing the SparkListenerInterface. Is the scope of this issue broader? Either way I'd be happy to open source what I have to provide something similar to: [https://github.com/JasonMWhite/spark-datadog-relay] Thanks > Distributed Tracing in Spark > > > Key: SPARK-21962 > URL: https://issues.apache.org/jira/browse/SPARK-21962 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Major > > Spark should support distributed tracing, which is the mechanism, widely > popularized by Google in the [Dapper > Paper|https://research.google.com/pubs/pub36356.html], where network requests > have additional metadata used for tracing requests between services. > This would be useful for me since I have OpenZipkin style tracing in my > distributed application up to the Spark driver, and from the executors out to > my other services, but the link is broken in Spark between driver and > executor since the Span IDs aren't propagated across that link. > An initial implementation could instrument the most important network calls > with trace ids (like launching and finishing tasks), and incrementally add > more tracing to other calls (torrent block distribution, external shuffle > service, etc) as the feature matures. > Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org