[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221
 ] 

Igor Berman edited comment on SPARK-23423 at 2/20/18 7:33 AM:
--

[~skonto] so I've run application today with relevant logs at debug 
level(previously I just had problems with loggers, that they were reconfigured 
dynamically so I haven't seen reports of TASK_KILLED). Seems like with dynamic 
allocation on with executors starting and shutting down the chances that every 
slave will get 2 failures staring some executor are much higher that in regular 
case(without dynamic allocation)  seems like SPARK-19755 is the core issue here 
- after half day of long running driver in client mode almost 1/3 of slaves out 
of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

 

Here just one example that out of 74 mesos slaves 16 already blacklisted
{code:java}
grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l
16{code}


was (Author: igor.berman):
[~skonto] so I've run application today with relevant logs at debug level. 
Seems like with dynamic allocation on with executors starting and shutting down 
the chances that every slave will get 2 failures staring some executor are much 
higher that in regular case(without dynamic allocation)  seems like SPARK-19755 
is the core issue here - after half day of long running driver in client mode 
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

 

Here just one example that out of 74 mesos slaves 16 already blacklisted
{code:java}
grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l
16{code}

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], 

[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369750#comment-16369750
 ] 

Manan Bakshi commented on SPARK-23463:
--

Hi Marco,

I realized that if you used following code for the same sample:

rdd = sc.textFile("sample").map(lambda x: x.split(", "))

df = rdd.toDF(["dev", "val"])

df = df.filter(df["val"] > 0)

 

The filter condition actually only filters out values less than 1 even though 
it should actually be filtering out values less than 0. This is really weird. 

 

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23436:
---

Assignee: Marco Gaido

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23436) Incorrect Date column Inference in partition discovery

2018-02-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23436.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20621
[https://github.com/apache/spark/pull/20621]

> Incorrect Date column Inference in partition discovery
> --
>
> Key: SPARK-23436
> URL: https://issues.apache.org/jira/browse/SPARK-23436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Apoorva Sareen
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> If a Partition column appears to partial date/timestamp
>     example : 2018-01-01-23 
> where it is only truncated upto an hour then the data types of the 
> partitioning columns are automatically inferred as date however, the values 
> are loaded as null. 
> Here is an example code to reproduce this behaviour
>  
>  
> {code:java}
> val data = Seq(("1", "2018-01", "2018-01-01-04", "test")).toDF("id", 
> "date_month", "data_hour", "data")  
> data.write.partitionBy("id","date_month","data_hour").parquet("output/test")
> val input = spark.read.parquet("output/test")  
> input.printSchema()
> input.show()
> ## Result ###
> root
> |-- data: string (nullable = true)
> |-- id: integer (nullable = true)
> |-- date_month: string (nullable = true)
> |-- data_hour: date (nullable = true)
> ++---+--+-+
> |data| id|date_month|data_hour|
> ++---+--+-+
> |test|  1|   2018-01|     null|
> ++---+--+-+{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23457.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20619
[https://github.com/apache/spark/pull/20619]

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23457) Register task completion listeners first for ParquetFileFormat

2018-02-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23457:
---

Assignee: Dongjoon Hyun

> Register task completion listeners first for ParquetFileFormat
> --
>
> Key: SPARK-23457
> URL: https://issues.apache.org/jira/browse/SPARK-23457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> ParquetFileFormat leaks open files in some cases. This issue aims to register 
> task completion listener first.
> {code}
>   test("SPARK-23390 Register task completion listeners first in 
> ParquetFileFormat") {
> withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE.key -> 
> s"${Int.MaxValue}") {
>   withTempDir { dir =>
> val basePath = dir.getCanonicalPath
> Seq(0).toDF("a").write.format("parquet").save(new Path(basePath, 
> "first").toString)
> Seq(1).toDF("a").write.format("parquet").save(new Path(basePath, 
> "second").toString)
> val df = spark.read.parquet(
>   new Path(basePath, "first").toString,
>   new Path(basePath, "second").toString)
> val e = intercept[SparkException] {
>   df.collect()
> }
> assert(e.getCause.isInstanceOf[OutOfMemoryError])
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369724#comment-16369724
 ] 

Dongjoon Hyun commented on SPARK-23443:
---

Hi, [~ameen.tayy...@gmail.com].
I have the same need for `ExternalCatalog`, and also tried to provide 
`ExternalCatalog` at SPARK-17767 
([PR|https://github.com/apache/spark/pull/15336/files]). As you see, 
SPARK-17767 was closed as `Later` and included in `Catalog Federation`. I hope 
both of us can find a way to achieve that in Apache Spark 2.4.

BTW, I removed the target version here because only Spark committers do that.

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23443) Spark with Glue as external catalog

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23443:
--
Target Version/s:   (was: 2.4.0)

> Spark with Glue as external catalog
> ---
>
> Key: SPARK-23443
> URL: https://issues.apache.org/jira/browse/SPARK-23443
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ameen Tayyebi
>Priority: Major
>
> AWS Glue Catalog is an external Hive metastore backed by a web service. It 
> allows permanent storage of catalog data for BigData use cases.
> To find out more information about AWS Glue, please consult:
>  * AWS Glue - [https://aws.amazon.com/glue/]
>  * Using Glue as a Metastore catalog for Spark - 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html]
> Today, the integration of Glue and Spark is through the Hive layer. Glue 
> implements the IMetaStore interface of Hive and for installations of Spark 
> that contain Hive, Glue can be used as the metastore.
> The feature set that Glue supports does not align 1-1 with the set of 
> features that the latest version of Spark supports. For example, Glue 
> interface supports more advanced partition pruning that the latest version of 
> Hive embedded in Spark.
> To enable a more natural integration with Spark and to allow leveraging 
> latest features of Glue, without being coupled to Hive, a direct integration 
> through Spark's own Catalog API is proposed. This Jira tracks this work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection

2018-02-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369623#comment-16369623
 ] 

Apache Spark commented on SPARK-23466:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20637

> Remove redundant null checks in generated Java code by 
> GenerateUnsafeProjection
> ---
>
> Key: SPARK-23466
> URL: https://issues.apache.org/jira/browse/SPARK-23466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field 
> is correct, we can use it to save null check" to simplify generated code.
> When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed 
> code for null checks in the generated Java code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23466:


Assignee: Apache Spark

> Remove redundant null checks in generated Java code by 
> GenerateUnsafeProjection
> ---
>
> Key: SPARK-23466
> URL: https://issues.apache.org/jira/browse/SPARK-23466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Major
>
> One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field 
> is correct, we can use it to save null check" to simplify generated code.
> When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed 
> code for null checks in the generated Java code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23466:


Assignee: (was: Apache Spark)

> Remove redundant null checks in generated Java code by 
> GenerateUnsafeProjection
> ---
>
> Key: SPARK-23466
> URL: https://issues.apache.org/jira/browse/SPARK-23466
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field 
> is correct, we can use it to save null check" to simplify generated code.
> When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed 
> code for null checks in the generated Java code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369602#comment-16369602
 ] 

Dongjoon Hyun commented on SPARK-15348:
---

Since Hive 3.0, Hive ACID 2.0 (HIVE-14035) table becomes default table in Hive 
metastore.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

2018-02-19 Thread V Luong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V Luong updated SPARK-23467:

Description: 
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html].
 Also a related ticket is: https://issues.apache.org/jira/browse/SPARK-17998.

In many of my use cases, data is appended by date (say, date D) into an S3 
subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics 
by date. The analytics involves sorts and window functions. But I'm only 
interested in within-date sorts/windows, and don't care about the between-dates 
sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?

  was:
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date D) into an S3 
subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics 
by date. The analytics involves sorts and window functions. But I'm only 
interested in within-date sorts/windows, and don't care about the between-dates 
sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?


> Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) 
> with each in-memory partition mapped to 1 physical file partition
> --
>
> Key: SPARK-23467
> URL: https://issues.apache.org/jira/browse/SPARK-23467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Major
>
> I would like to echo the need described here: 
> [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html].
>  Also a related ticket is: https://issues.apache.org/jira/browse/SPARK-17998.
> In many of my use cases, data is appended by date (say, date D) into an S3 
> subdir s3://bucket/path/to/parquet/date=D, after which I need to run 
> analytics by date. The analytics involves sorts and window functions. But I'm 
> only interested in within-date sorts/windows, and don't care about the 
> between-dates sorts and windows. 
> Currently, if I simply load the entire data set from the parent dir 
> s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
> "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
> exploring ways to write analytics code/function per Spark partition, and send 
> such code/function to each partition.
> The biggest problem now is that Spark's in-memory partitions do not 
> correspond to the physical files loaded from S3, so there is no way to 
> guarantee that the analytics by partition is done by date as desired.
> Is there a way we can explicitly enable a direct correspondence between file 
> partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

2018-02-19 Thread V Luong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V Luong updated SPARK-23467:

Description: 
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 
subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics 
by date. The analytics involves sorts and window functions. But I'm only 
interested in within-date sorts/windows, and don't care about the between-dates 
sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?

  was:
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 
subdir `s3://bucket/path/to/parquet/date=X`, after which I need to run 
analytics by date. The analytics involves sorts and window functions. But I'm 
only interested in within-date sorts/windows, and don't care about the 
between-dates sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
`s3://bucket/path/to/parquet`, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?


> Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) 
> with each in-memory partition mapped to 1 physical file partition
> --
>
> Key: SPARK-23467
> URL: https://issues.apache.org/jira/browse/SPARK-23467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Major
>
> I would like to echo the need described here: 
> [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]
> In many of my use cases, data is appended by date (say, date X) into an S3 
> subdir s3://bucket/path/to/parquet/date=X, after which I need to run 
> analytics by date. The analytics involves sorts and window functions. But I'm 
> only interested in within-date sorts/windows, and don't care about the 
> between-dates sorts and windows. 
> Currently, if I simply load the entire data set from the parent dir 
> s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
> "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
> exploring ways to write analytics code/function per Spark partition, and send 
> such code/function to each partition.
> The biggest problem now is that Spark's in-memory partitions do not 
> correspond to the physical files loaded from S3, so there is no way to 
> guarantee that the analytics by partition is done by date as desired.
> Is there a way we can explicitly enable a direct correspondence between file 
> partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

2018-02-19 Thread V Luong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

V Luong updated SPARK-23467:

Description: 
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date D) into an S3 
subdir s3://bucket/path/to/parquet/date=D, after which I need to run analytics 
by date. The analytics involves sorts and window functions. But I'm only 
interested in within-date sorts/windows, and don't care about the between-dates 
sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?

  was:
I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 
subdir s3://bucket/path/to/parquet/date=X, after which I need to run analytics 
by date. The analytics involves sorts and window functions. But I'm only 
interested in within-date sorts/windows, and don't care about the between-dates 
sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?


> Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) 
> with each in-memory partition mapped to 1 physical file partition
> --
>
> Key: SPARK-23467
> URL: https://issues.apache.org/jira/browse/SPARK-23467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Major
>
> I would like to echo the need described here: 
> [https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]
> In many of my use cases, data is appended by date (say, date D) into an S3 
> subdir s3://bucket/path/to/parquet/date=D, after which I need to run 
> analytics by date. The analytics involves sorts and window functions. But I'm 
> only interested in within-date sorts/windows, and don't care about the 
> between-dates sorts and windows. 
> Currently, if I simply load the entire data set from the parent dir 
> s3://bucket/path/to/parquet, and then write Spark SQL statements involving 
> "SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
> exploring ways to write analytics code/function per Spark partition, and send 
> such code/function to each partition.
> The biggest problem now is that Spark's in-memory partitions do not 
> correspond to the physical files loaded from S3, so there is no way to 
> guarantee that the analytics by partition is done by date as desired.
> Is there a way we can explicitly enable a direct correspondence between file 
> partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23467) Enable way to create DataFrame from pre-partitioned files (Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file partition

2018-02-19 Thread V Luong (JIRA)
V Luong created SPARK-23467:
---

 Summary: Enable way to create DataFrame from pre-partitioned files 
(Parquet/ORC/etc.) with each in-memory partition mapped to 1 physical file 
partition
 Key: SPARK-23467
 URL: https://issues.apache.org/jira/browse/SPARK-23467
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.2.1
Reporter: V Luong


I would like to echo the need described here: 
[https://forums.databricks.com/questions/12323/how-to-create-a-dataframe-that-has-one-filepartiti.html]

In many of my use cases, data is appended by date (say, date X) into an S3 
subdir `s3://bucket/path/to/parquet/date=X`, after which I need to run 
analytics by date. The analytics involves sorts and window functions. But I'm 
only interested in within-date sorts/windows, and don't care about the 
between-dates sorts and windows. 

Currently, if I simply load the entire data set from the parent dir 
`s3://bucket/path/to/parquet`, and then write Spark SQL statements involving 
"SORT" or "WINDOW", then very expensive shuffles will be invoked. Hence I am 
exploring ways to write analytics code/function per Spark partition, and send 
such code/function to each partition.

The biggest problem now is that Spark's in-memory partitions do not correspond 
to the physical files loaded from S3, so there is no way to guarantee that the 
analytics by partition is done by date as desired.

Is there a way we can explicitly enable a direct correspondence between file 
partitions and in-memory partitions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369600#comment-16369600
 ] 

Dongjoon Hyun commented on SPARK-23346:
---

Thank you for reporting. I adjusted the priority to `Critical` since it's 
reported in 2.2.0. This might be critical, but cannot be a blocker for Apache 
Spark 2.3.0.

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Critical
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
>  !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
> failure reasons, such as TaskResultLost,but the status is success. In the web 
> ui, we count non-ExceptionFailure failures as successful tasks, which is 
> highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23346) Failed tasks reported as success if the failure reason is not ExceptionFailure

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23346:
--
Priority: Critical  (was: Blocker)

> Failed tasks reported as success if the failure reason is not ExceptionFailure
> --
>
> Key: SPARK-23346
> URL: https://issues.apache.org/jira/browse/SPARK-23346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: HADOOP 2.6 + JDK1.8 + SPARK 2.2.0
>Reporter: 吴志龙
>Priority: Critical
> Attachments: 企业微信截图_15179714603307.png, 企业微信截图_15179715023606.png
>
>
>  !企业微信截图_15179715023606.png!  !企业微信截图_15179714603307.png! We have many other 
> failure reasons, such as TaskResultLost,but the status is success. In the web 
> ui, we count non-ExceptionFailure failures as successful tasks, which is 
> highly misleading.
> detail message:
> Job aborted due to stage failure: Task 0 in stage 7.0 failed 10 times, most 
> recent failure: Lost task 0.9 in stage 7.0 (TID 35, 60.hadoop.com, executor 
> 27): TaskResultLost (result lost from block manager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23449) Extra java options lose order in Docker context

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369567#comment-16369567
 ] 

Dongjoon Hyun commented on SPARK-23449:
---

I removed the fixed version from this issue. We can update it when this merged 
into Apache Spark 2.3 RC if exists.

> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh` does not preserve its ordering, which makes 
> `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before 
> any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23449) Extra java options lose order in Docker context

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23449:
--
Fix Version/s: (was: 2.3.0)

> Extra java options lose order in Docker context
> ---
>
> Key: SPARK-23449
> URL: https://issues.apache.org/jira/browse/SPARK-23449
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8S with supplied Docker image. Passing 
> along extra java options.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> `spark.driver.extraJavaOptions` and `spark.executor.extraJavaOptions` when 
> processed in `entrypoint.sh` does not preserve its ordering, which makes 
> `-XX:+UnlockExperimentalVMOptions` unusable, as you have to pass it before 
> any other experimental options.
>  
> Steps to reproduce:
>  # Set `spark.driver.extraJavaOptions`, e.g. 
> `-XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+CMSClassUnloadingEnabled 
> -XX:+UseCGroupMemoryLimitForHeap`
>  # Submit application to k8s cluster.
>  # Fetch logs and observe that on each run order of options is different and 
> when `-XX:+UnlockExperimentalVMOptions` is not the first startup will fail.
>  
> Expected behaviour:
>  # Order of `extraJavaOptions` should be preserved.
>  
> Cause:
> `entrypoint.sh` fetches environment options with `env`, which doesn't 
> guarantee ordering.
> {code:java}
> env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
> /tmp/java_opts.txt{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23094:
--
Fix Version/s: (was: 2.3.0)

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369565#comment-16369565
 ] 

Dongjoon Hyun commented on SPARK-23094:
---

Since this is reverted and not a regression, I'll remove the fixed version to 
unblock RC4.

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369564#comment-16369564
 ] 

Dongjoon Hyun commented on SPARK-23390:
---

Since this is not a regression for both Parquet(not a regression) and ORC (new 
ORC is disabled by default), I'll remove the target version from this to 
unblock RC4.
cc [~cloud_fan] and [~sameerag]

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23390:
--
Target Version/s:   (was: 2.3.0)

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23390) Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7

2018-02-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23390:
--
Fix Version/s: (was: 2.3.0)

> Flaky Test Suite: FileBasedDataSourceSuite in Spark 2.3/hadoop 2.7
> --
>
> Key: SPARK-23390
> URL: https://issues.apache.org/jira/browse/SPARK-23390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Wenchen Fan
>Priority: Major
>
> We're seeing multiple failures in {{FileBasedDataSourceSuite}} in 
> {{spark-branch-2.3-test-sbt-hadoop-2.7}}:
> {code}
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01215805999 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
> {code}
> Here's the full history: 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/history/
> From a very quick look, these failures seem to be correlated with 
> https://github.com/apache/spark/pull/20479 (cc [~dongjoon]) as evident from 
> the following stack trace (full logs 
> [here|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/189/console]):
>  
> {code}
> [info] - Enabling/disabling ignoreMissingFiles using orc (648 milliseconds)
> 15:55:58.673 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in 
> stage 61.0 (TID 85, localhost, executor driver): TaskKilled (Stage cancelled)
> 15:55:58.674 WARN org.apache.spark.DebugFilesystem: Leaked filesystem 
> connection created at:
> java.lang.Throwable
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.open(RecordReaderUtils.java:173)
>   at 
> org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:254)
>   at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:633)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:138)
> {code}
> Also, while this might be just a false correlation but the frequency of these 
> test failures have increased considerably in 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/
>  after https://github.com/apache/spark/pull/20562 (cc 
> [~feng...@databricks.com]) was merged.
> The following is Parquet leakage.
> {code}
> Caused by: sbt.ForkMain$ForkError: java.lang.Throwable: null
>   at 
> org.apache.spark.DebugFilesystem$.addOpenStream(DebugFilesystem.scala:36)
>   at org.apache.spark.DebugFilesystem.open(DebugFilesystem.scala:70)
>   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.(ParquetFileReader.java:538)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:149)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:400)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:356)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:106)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23417:


Assignee: Apache Spark

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>Priority: Minor
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369500#comment-16369500
 ] 

Apache Spark commented on SPARK-23417:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/20638

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23417) pyspark tests give wrong sbt instructions

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23417:


Assignee: (was: Apache Spark)

> pyspark tests give wrong sbt instructions
> -
>
> Key: SPARK-23417
> URL: https://issues.apache.org/jira/browse/SPARK-23417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Minor
>
> When running python/run-tests, the script indicates that I must run 
> "'build/sbt assembly/package streaming-kafka-0-8-assembly/assembly' or 
> 'build/mvn -Pkafka-0-8 package'". The sbt command fails:
>  
> [error] Expected ID character
> [error] Not a valid command: streaming-kafka-0-8-assembly
> [error] Expected project ID
> [error] Expected configuration
> [error] Expected ':' (if selecting a configuration)
> [error] Expected key
> [error] Not a valid key: streaming-kafka-0-8-assembly
> [error] streaming-kafka-0-8-assembly/assembly
> [error] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369424#comment-16369424
 ] 

Manan Bakshi commented on SPARK-23463:
--

Hi Marco,

I have attached a text file along with this comment. If you perform the 
following operations on the rdd, you should be able to repro the issue:

 

rdd = sc.textFile("sample").map(lambda x: x.split(","))

df = rdd.toDF(["dev", "val"])

df = df.filter(df["val"] > 0)

 

[^sample]

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi updated SPARK-23463:
-
Attachment: sample

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
> Attachments: sample
>
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23466) Remove redundant null checks in generated Java code by GenerateUnsafeProjection

2018-02-19 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23466:


 Summary: Remove redundant null checks in generated Java code by 
GenerateUnsafeProjection
 Key: SPARK-23466
 URL: https://issues.apache.org/jira/browse/SPARK-23466
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


One of TODOs in {{GenerateUnsafeProjection}} is "if the nullability of field is 
correct, we can use it to save null check" to simplify generated code.
When {{nullable=false}} in {{DataType}}, {{GenerateUnsafeProjection}} removed 
code for null checks in the generated Java code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17430) Spark task Hangs after OOM while DAG scheduler tries to serialize a task

2018-02-19 Thread Furcy Pin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369367#comment-16369367
 ] 

Furcy Pin commented on SPARK-17430:
---

Same problem here in spark 2.2.1

The job hangs forever, the GUI is still responsive but no task advances and the 
executors are still reserved. 
This can be really annoying as it create zombie jobs eating cluster resources 
for nothing until they are killed manually.

The StackTrace in the driver logs gave this error:
{code:java}
WARN util.Utils: Suppressing exception in finally: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1822)
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:719)
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:740)
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:620)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1346)
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:619)
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:562)
at 
org.apache.spark.scheduler.DAGScheduler.createShuffleMapStage(DAGScheduler.scala:341)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage$1.apply(DAGScheduler.scala:312)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage$1.apply(DAGScheduler.scala:305)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.mutable.Stack.foreach(Stack.scala:169)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getOrCreateShuffleMapStage(DAGScheduler.scala:305)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:381)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$getOrCreateParentStages$1.apply(DAGScheduler.scala:380)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:45)
at scala.collection.SetLike$class.map(SetLike.scala:93)
Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: Java 
heap space
at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253)
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211)
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:145)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1894)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1875)
at 
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 

[jira] [Updated] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Berman updated SPARK-23423:

Attachment: (was: no-dyn-allocation-failed-no-statusUpdate.txt)

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23459) Improve the error message when unknown column is specified in partition columns

2018-02-19 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369346#comment-16369346
 ] 

Kazuaki Ishizaki commented on SPARK-23459:
--

Have you started working for this? If not, I can take it.

> Improve the error message when unknown column is specified in partition 
> columns
> ---
>
> Key: SPARK-23459
> URL: https://issues.apache.org/jira/browse/SPARK-23459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
>   test("save with an unknown partition column") {
> withTempDir { dir =>
>   val path = dir.getCanonicalPath
> Seq(1L -> "a").toDF("i", "j").write
>   .format("parquet")
>   .partitionBy("unknownColumn")
>   .save(path)
> }
>   }
> {noformat}
> We got the following error message:
> {noformat}
> Partition column unknownColumn not found in schema 
> StructType(StructField(i,LongType,false), StructField(j,StringType,true));
> {noformat}
> We should not call toString, but catalogString in the function 
> `partitionColumnsSchema` of `PartitioningUtils.scala`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23420) Datasource loading not handling paths with regex chars.

2018-02-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369330#comment-16369330
 ] 

Steve Loughran commented on SPARK-23420:


Can I note that if there's a colon in the path, it'd still fail, even if the 
glob is bypassed. Long outstanding issue

> Datasource loading not handling paths with regex chars.
> ---
>
> Key: SPARK-23420
> URL: https://issues.apache.org/jira/browse/SPARK-23420
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Mitchell
>Priority: Major
>
> Greetings, during some recent testing I ran across an issue attempting to 
> load files with regex chars like []()* etc. in them. The files are valid in 
> the various storages and the normal hadoop APIs all function properly 
> accessing them.
> When my code is executed, I get the following stack trace.
> 8/02/14 04:52:46 ERROR yarn.ApplicationMaster: User class threw exception: 
> java.io.IOException: Illegal file pattern: Unmatched closing ')' near index 
> 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ java.io.IOException: Illegal file pattern: Unmatched closing ')' near 
> index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:71) at 
> org.apache.hadoop.fs.GlobFilter.(GlobFilter.java:50) at 
> org.apache.hadoop.fs.Globber.doGlob(Globber.java:210) at 
> org.apache.hadoop.fs.Globber.glob(Globber.java:149) at 
> org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1955) at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.globStatus(S3AFileSystem.java:2477) at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:234) 
> at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:244)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:618)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.immutable.List.flatMap(List.scala:344) at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533) at 
> org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412) at 
> com.sap.profile.SparkProfileTask.main(SparkProfileTask.java:95) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
>  Caused by: java.util.regex.PatternSyntaxException: Unmatched closing ')' 
> near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^ at java.util.regex.Pattern.error(Pattern.java:1955) at 
> java.util.regex.Pattern.compile(Pattern.java:1700) at 
> java.util.regex.Pattern.(Pattern.java:1351) at 
> java.util.regex.Pattern.compile(Pattern.java:1054) at 
> org.apache.hadoop.fs.GlobPattern.set(GlobPattern.java:156) at 
> org.apache.hadoop.fs.GlobPattern.(GlobPattern.java:42) at 
> org.apache.hadoop.fs.GlobFilter.init(GlobFilter.java:67) ... 25 more 18/02/14 
> 04:52:46 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, 
> (reason: User class threw exception: java.io.IOException: Illegal file 
> pattern: Unmatched closing ')' near index 130 
> A_VERY_LONG_DIRECTORY_FOLDER_THAT_INCLUDES_MULTIBYTE_AND_SPECIAL_CHARACTERS_abcdefghijklmnopqrst_0123456789_~@#\$%\^&\(\)-_=\+[(?:]);',\._???_???_???_??
>  ^) 18/02/14 04:52:46 INFO spark.SparkContext: Invoking 

[jira] [Commented] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one

2018-02-19 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369275#comment-16369275
 ] 

Mihaly Toth commented on SPARK-23465:
-

I have started working on this.

> Dataset.withAllColumnsRenamed should map all column names to a new one
> --
>
> Key: SPARK-23465
> URL: https://issues.apache.org/jira/browse/SPARK-23465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Minor
>
> Currently one can only rename a column only one by one using 
> {{withColumnRenamed()}} function. When one would like to rename all or most 
> of the columns it would be easier to specify an algorithm for mapping from 
> the old to the new name (like prefixing) than iterating over all the fields.
> Example usage is joining to a Dataset with the same or similar schema 
> (special case is self joining) where the names are the same or overlapping. 
> Such a joined Dataset would fail at {{saveAsTable()}}
> With the new function usage would be easy like that:
> {code:java}
> ds.withAllColumnsRenamed("prefix" + _)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one

2018-02-19 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-23465:
---

 Summary: Dataset.withAllColumnsRenamed should map all column names 
to a new one
 Key: SPARK-23465
 URL: https://issues.apache.org/jira/browse/SPARK-23465
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Mihaly Toth


Currently one can only rename a column only one by one using 
{{withColumnRenamed()}} function. When one would like to rename all or most of 
the columns it would be easier to specify an algorithm for mapping from the old 
to the new name (like prefixing) than iterating over all the fields.

Example usage is joining to a Dataset with the same or similar schema (special 
case is self joining) where the names are the same or overlapping. Such a 
joined Dataset would fail at {{saveAsTable()}}

With the new function usage would be easy like that:
{code:java}
ds.withAllColumnsRenamed("prefix" + _)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221
 ] 

Igor Berman edited comment on SPARK-23423 at 2/19/18 3:47 PM:
--

[~skonto] so I've run application today with relevant logs at debug level. 
Seems like with dynamic allocation on with executors starting and shutting down 
the chances that every slave will get 2 failures staring some executor are much 
higher that in regular case(without dynamic allocation)  seems like SPARK-19755 
is the core issue here - after half day of long running driver in client mode 
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

 

Here just one example that out of 74 mesos slaves 16 already blacklisted
{code:java}
grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l
16{code}


was (Author: igor.berman):
[~skonto] so I've run application today with relevant logs at debug level. 
Seems like with dynamic allocation on with executors starting and shutting down 
the chances that every slave will get 2 failures staring some executor are much 
higher that in regular case(without dynamic allocation)  seems like SPARK-19755 
is the core issue here - after half day of long running driver in client mode 
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
> Attachments: no-dyn-allocation-failed-no-statusUpdate.txt
>
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Issue Comment Deleted] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor Berman updated SPARK-23423:

Comment: was deleted

(was: [~skonto] today I haven't managed to run with dynamic allocation on, 
however attached details for following situation:

one of executors failed while running without dyn.allocation. All parties seem 
like getting all the updates: slave agent, master, and even framework, but not  
MesosCoarseGrainedSchedulerBackend even though TaskSchedulerImpl did get "Lost 
executor 15"

see [^no-dyn-allocation-failed-no-statusUpdate.txt]

 

I'll enable dyn.allocation tomorrow. Meanwhile, do you think the attached 
behavior might be problematic?

 )

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
> Attachments: no-dyn-allocation-failed-no-statusUpdate.txt
>
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369227#comment-16369227
 ] 

Igor Berman commented on SPARK-19755:
-

This Jira is very relevant for the case when running with  dynamic allocation 
turned on, where starting and stopping executors is part of natural lifecycle 
of the driver. The chances to fail when starting executor are increasing(e.g. 
due to transient port collisions)

 

The threshold of 2 seems too low and artificial for this usecases. I've 
observed situation where at some point almost 1/3 of mesos-slave nodes are 
marked as blacklisted(but they were ok). This creates situation where the 
cluster has free resources but frameworks can't use them since they actively 
decline offers from the master.

> Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result 
> - scheduler cannot create an executor after some time.
> ---
>
> Key: SPARK-19755
> URL: https://issues.apache.org/jira/browse/SPARK-19755
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler
>Affects Versions: 2.1.0
> Environment: mesos, marathon, docker - driver and executors are 
> dockerized.
>Reporter: Timur Abakumov
>Priority: Major
>
> When for some reason task fails - MesosCoarseGrainedSchedulerBackend 
> increased failure counter for a slave where that task was running.
> When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded.  
> Over time  scheduler cannot create a new executor - every slave is is in the 
> blacklist.  Task failure not necessary related to host health- especially for 
> long running stream apps.
> If accepted as a bug: possible solution is to use: spark.blacklist.enabled to 
> make that functionality optional and if it make sense   MAX_SLAVE_FAILURES 
> also can be configurable.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors

2018-02-19 Thread Igor Berman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369221#comment-16369221
 ] 

Igor Berman commented on SPARK-23423:
-

[~skonto] so I've run application today with relevant logs at debug level. 
Seems like with dynamic allocation on with executors starting and shutting down 
the chances that every slave will get 2 failures staring some executor are much 
higher that in regular case(without dynamic allocation)  seems like SPARK-19755 
is the core issue here - after half day of long running driver in client mode 
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.

the reasons for executor failures might be different and transient(e.g. port 
collision)

I think I'll close this Jira as duplicate for SPARK-19755, WDYT?

> Application declines any offers when killed+active executors rich 
> spark.dynamicAllocation.maxExecutors
> --
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Spark Core
>Affects Versions: 2.2.1
>Reporter: Igor Berman
>Priority: Major
>  Labels: Mesos, dynamic_allocation
> Attachments: no-dyn-allocation-failed-no-statusUpdate.txt
>
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend 
> when running on Mesos with dynamic allocation on and limiting number of max 
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource 
> consumption(with some idle times in between), due to dyn.allocation it 
> receives offers and then releases them after current chunk of work processed.
> Since at 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
>  the backend compares numExecutors < executorLimit and 
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves 
> holds all slaves ever "met", i.e. both active and killed (see comment 
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>  
> On the other hand, number of taskIds should be updated due to statusUpdate, 
> but suppose this update is lost(actually I don't see logs of 'is now 
> TASK_KILLED') so this number of executors might be wrong
>  
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation 
> enabled") {
>   setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>  
>  
> Workaround: Don't set maxExecutors with dynamicAllocation on
>  
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and 
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8529) Set metadata for MinMaxScaler

2018-02-19 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369194#comment-16369194
 ] 

Barry Becker commented on SPARK-8529:
-

Complementing the output metadata in what way? Need more info about what needs 
to be set. Is this still an issue? It's been open for almost 3 years.

>From a fit MinMaxScalarModel can I get the originalMin and originalMax values?

Not sure what this comment in the code (on master) means:
|/**|
| | * Model fitted by [[MinMaxScaler]].|
| | *|
| | * @param originalMin min value for each original column during fitting|
| | * @param originalMax max value for each original column during fitting|
| | *|
| | * TODO: The transformer does not yet set the metadata in the output column 
(SPARK-8529).|
| | */|

> Set metadata for MinMaxScaler
> -
>
> Key: SPARK-8529
> URL: https://issues.apache.org/jira/browse/SPARK-8529
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Add this as an reminder for complementing the output metadata for transformer 
> MinMaxScaler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5377) Dynamically add jar into Spark Driver's classpath.

2018-02-19 Thread Shay Elbaz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369112#comment-16369112
 ] 

Shay Elbaz commented on SPARK-5377:
---

+1

This seems like a very useful improvement and will save us many current 
workarounds.

Any specific reason for why was this closed?

> Dynamically add jar into Spark Driver's classpath.
> --
>
> Key: SPARK-5377
> URL: https://issues.apache.org/jira/browse/SPARK-5377
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Chengxiang Li
>Priority: Major
>
> Spark support dynamically add jar to executor classpath through 
> SparkContext::addJar(), while it does not support dynamically add jar into 
> driver classpath. In most case(if not all the case), user dynamically add jar 
> with SparkContext::addJar()  because some classes from the jar would be 
> referred in upcoming Spark job, which means the classes need to be loaded in 
> Spark driver side either,e.g during serialization. I think it make sense to 
> add an API to add jar into driver classpath, or just make it available in 
> SparkContext::addJar(). HIVE-9410 is a real case from Hive on Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command

2018-02-19 Thread Marcin Kurczych (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Kurczych updated SPARK-23464:

Description: 
Parameters passed to driver launching command in Mesos container are escaped 
using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
quotes has been introduced. This cancels out quoting done by _shellEscape_ and 
makes in unable to run tasks with whitespaces in parameters, as they are 
interpreted as additional parameters to in-container spark-submit.

This is how parameter passed to in-container spark-submit looks like now:
{code:java}
--conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
{code}
This is how they look after reverting SPARK-18114 related commit:
{code:java}
--conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
{code}
In current version submitting job with such extraJavaOptions causes following 
error:
{code:java}
Error: Unrecognized option: -Dbar=another

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
(... further spark-submit help ...)
{code}

Reverting SPARK-18114 is the solution to the issue. I can create a pull-request 
in GitHub. I thought about adding unit tests for that, buth methods generating 
driver launch command are private.

  was:
Parameters passed to driver launching command in Mesos container are escaped 
using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
quotes has been introduced. This cancels out quoting done by _shellEscape_ and 
makes in unable to run tasks with whitespaces in parameters, as they are 
interpreted as additional parameters to in-container spark-submit.

This is how parameter passed to in-container spark-submit looks like now:
{code:java}
--conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
{code}
This is how they look after reverting SPARK-18114 related commit:
{code:java}
--conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
{code}
In current version submitting job with such extraJavaOptions causes following 
error:
{code:java}
Error: Unrecognized option: -Dbar=another

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
(... further spark-submit help ...)
{code}


> MesosClusterScheduler double-escapes parameters to bash command
> ---
>
> Key: SPARK-23464
> URL: https://issues.apache.org/jira/browse/SPARK-23464
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 with Mesosphere patches (but the problem 
> exists in main repo)
> DC/OS 1.9.5
>Reporter: Marcin Kurczych
>Priority: Major
>
> Parameters passed to driver launching command in Mesos container are escaped 
> using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
> quotes has been introduced. This cancels out quoting done by _shellEscape_ 
> and makes in unable to run tasks with whitespaces in parameters, as they are 
> interpreted as additional parameters to in-container spark-submit.
> This is how parameter passed to in-container spark-submit looks like now:
> {code:java}
> --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
> {code}
> This is how they look after reverting SPARK-18114 related commit:
> {code:java}
> --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
> {code}
> In current version submitting job with such extraJavaOptions causes following 
> error:
> {code:java}
> Error: Unrecognized option: -Dbar=another
> Usage: spark-submit [options]  [app arguments]
> Usage: spark-submit --kill [submission ID] --master [spark://...]
> Usage: spark-submit --status [submission ID] --master [spark://...]
> Usage: 

[jira] [Created] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command

2018-02-19 Thread Marcin Kurczych (JIRA)
Marcin Kurczych created SPARK-23464:
---

 Summary: MesosClusterScheduler double-escapes parameters to bash 
command
 Key: SPARK-23464
 URL: https://issues.apache.org/jira/browse/SPARK-23464
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.2.0
 Environment: Spark 2.2.0 with Mesosphere patches (but the problem 
exists in main repo)

DC/OS 1.9.5
Reporter: Marcin Kurczych


Parameters passed to driver launching command in Mesos container are escaped 
using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
quotes has been introduced. This cancels out quoting done by _shellEscape_ and 
makes in unable to run tasks with whitespaces in parameters, as they are 
interpreted as additional parameters to in-container spark-submit.

This is how parameter passed to in-container spark-submit looks like now:
{code}
--conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
{code}

This is how they look after reverting SPARK-18114 related commit:
{code}
--conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
{code}

In current version submitting job with such extraJavaOptions causes following 
error:
{code}
Error: Unrecognized option: -Dfoo=another

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
(... further spark-submit help ...)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23464) MesosClusterScheduler double-escapes parameters to bash command

2018-02-19 Thread Marcin Kurczych (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Kurczych updated SPARK-23464:

Description: 
Parameters passed to driver launching command in Mesos container are escaped 
using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
quotes has been introduced. This cancels out quoting done by _shellEscape_ and 
makes in unable to run tasks with whitespaces in parameters, as they are 
interpreted as additional parameters to in-container spark-submit.

This is how parameter passed to in-container spark-submit looks like now:
{code:java}
--conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
{code}
This is how they look after reverting SPARK-18114 related commit:
{code:java}
--conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
{code}
In current version submitting job with such extraJavaOptions causes following 
error:
{code:java}
Error: Unrecognized option: -Dbar=another

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
(... further spark-submit help ...)
{code}

  was:
Parameters passed to driver launching command in Mesos container are escaped 
using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
quotes has been introduced. This cancels out quoting done by _shellEscape_ and 
makes in unable to run tasks with whitespaces in parameters, as they are 
interpreted as additional parameters to in-container spark-submit.

This is how parameter passed to in-container spark-submit looks like now:
{code}
--conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
{code}

This is how they look after reverting SPARK-18114 related commit:
{code}
--conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
{code}

In current version submitting job with such extraJavaOptions causes following 
error:
{code}
Error: Unrecognized option: -Dfoo=another

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
  on one of the worker machines inside the cluster 
("cluster")
  (Default: client).
(... further spark-submit help ...)
{code}


> MesosClusterScheduler double-escapes parameters to bash command
> ---
>
> Key: SPARK-23464
> URL: https://issues.apache.org/jira/browse/SPARK-23464
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 with Mesosphere patches (but the problem 
> exists in main repo)
> DC/OS 1.9.5
>Reporter: Marcin Kurczych
>Priority: Major
>
> Parameters passed to driver launching command in Mesos container are escaped 
> using _shellEscape_ function. In SPARK-18114 additional wrapping in double 
> quotes has been introduced. This cancels out quoting done by _shellEscape_ 
> and makes in unable to run tasks with whitespaces in parameters, as they are 
> interpreted as additional parameters to in-container spark-submit.
> This is how parameter passed to in-container spark-submit looks like now:
> {code:java}
> --conf "spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another""
> {code}
> This is how they look after reverting SPARK-18114 related commit:
> {code:java}
> --conf spark.executor.extraJavaOptions="-Dfoo=\"first value\" -Dbar=another"
> {code}
> In current version submitting job with such extraJavaOptions causes following 
> error:
> {code:java}
> Error: Unrecognized option: -Dbar=another
> Usage: spark-submit [options]  [app arguments]
> Usage: spark-submit --kill [submission ID] --master [spark://...]
> Usage: spark-submit --status [submission ID] --master [spark://...]
> Usage: spark-submit run-example [options] example-class [example args]
> Options:
>   --master MASTER_URL spark://host:port, mesos://host:port, yarn, or 
> local.
>   --deploy-mode DEPLOY_MODE   Whether to launch the 

[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368968#comment-16368968
 ] 

Marco Gaido commented on SPARK-23463:
-

sorry, what do you mean by blank values? Which is the type of "val"? Can you 
provide some code to generate a DataFrame like the one you posted to reproduce 
the issue? Thanks.

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23450) jars option in spark submit is documented in misleading way

2018-02-19 Thread Gregory Reshetniak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368953#comment-16368953
 ] 

Gregory Reshetniak commented on SPARK-23450:


Hi guys, any update on this, please?

> jars option in spark submit is documented in misleading way
> ---
>
> Key: SPARK-23450
> URL: https://issues.apache.org/jira/browse/SPARK-23450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
>Reporter: Gregory Reshetniak
>Priority: Major
>
> I am wondering if the {{--jars}} option on spark submit is actually meant for 
> distributing the dependency jars onto the nodes in cluster?
>  
> In my case I can see it working as a "symlink" of sorts. But the 
> documentation is written in the way that suggests otherwise. Please help me 
> figure out if this is a bug or just my reading of the docs. Thanks!
> _
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23415:


Assignee: Apache Spark

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23415:


Assignee: (was: Apache Spark)

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23415) BufferHolderSparkSubmitSuite is flaky

2018-02-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368895#comment-16368895
 ] 

Apache Spark commented on SPARK-23415:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20636

> BufferHolderSparkSubmitSuite is flaky
> -
>
> Key: SPARK-23415
> URL: https://issues.apache.org/jira/browse/SPARK-23415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> The test suite fails due to 60-second timeout sometimes.
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> failAfter did not complete within 60 seconds.
> {code}
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87380/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4206/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi updated SPARK-23463:
-
Comment: was deleted

(was: I believe that the bug has something to do with CBO that was introduced 
in Spark 2.2.0. The CBO looks at blank min value in table stats and is unable 
to decide on filter selectivity.)

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi updated SPARK-23463:
-
Description: 
Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
introduced to look at the table stats and decide filter selectivity. However, 
since then, filter has started behaving unexpectedly for blank values. The 
operation would not only drop columns with blank values but also filter out 
rows that actually meet the filter criteria.

Steps to repro

Consider a simple dataframe with some blank values as below:
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.

Also, if there are no blank values, the filter operation works as expected for 
all versions.

  was:
I have a simple dataframe with some blank values as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.

Also, if there are no blank values, the filter operation works as expected for 
all versions.


> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> Filter operations were updated in Spark 2.2.0. Cost Based Optimizer was 
> introduced to look at the table stats and decide filter selectivity. However, 
> since then, filter has started behaving unexpectedly for blank values. The 
> operation would not only drop columns with blank values but also filter out 
> rows that actually meet the filter criteria.
> Steps to repro
> Consider a simple dataframe with some blank values as below:
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi updated SPARK-23463:
-
Description: 
I have a simple dataframe with some blank values as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.

Also, if there are no blank values, the filter operation works as expected for 
all versions.

  was:
I have a simple dataframe with some blank values as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.

Also, the filter operation works as expected for all versions, if there are no 
blank values.


> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> I have a simple dataframe with some blank values as below
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, if there are no blank values, the filter operation works as expected 
> for all versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368879#comment-16368879
 ] 

Manan Bakshi commented on SPARK-23463:
--

I believe that the bug has something to do with CBO that was introduced in 
Spark 2.2.0. The CBO looks at blank min value in table stats and is unable to 
decide on filter selectivity.

> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> I have a simple dataframe with some blank values as below
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, the filter operation works as expected for all versions, if there are 
> no blank values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manan Bakshi updated SPARK-23463:
-
Description: 
I have a simple dataframe with some blank values as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.

Also, the filter operation works as expected for all versions, if there are no 
blank values.

  was:
I have a simple dataframe as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.


> Filter operation fails to handle blank values and evicts rows that even 
> satisfy the filtering condition
> ---
>
> Key: SPARK-23463
> URL: https://issues.apache.org/jira/browse/SPARK-23463
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
>Reporter: Manan Bakshi
>Priority: Critical
>
> I have a simple dataframe with some blank values as below
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL| |
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
> Running a simple filter operation over val column in this dataframe yields 
> unexpected results. For eg. the following query returned an empty dataframe:
> df.filter(df["val"] > 0)
> ||dev||val||
> However, the filter operation works as expected if 0 in filter condition is 
> replaced by float 0.0
> df.filter(df["val"] > 0.0)
> ||dev||val||
> |ALL|0.01|
> |ALL|0.02|
> |ALL|0.004|
> |ALL|2.5|
> |ALL|4.5|
> |ALL|45|
>  
> Note that this bug only exists in Spark 2.2.0 and later. The previous 
> versions filter as expected for both int (0) and float (0.0) values in the 
> filter condition.
> Also, the filter operation works as expected for all versions, if there are 
> no blank values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23463) Filter operation fails to handle blank values and evicts rows that even satisfy the filtering condition

2018-02-19 Thread Manan Bakshi (JIRA)
Manan Bakshi created SPARK-23463:


 Summary: Filter operation fails to handle blank values and evicts 
rows that even satisfy the filtering condition
 Key: SPARK-23463
 URL: https://issues.apache.org/jira/browse/SPARK-23463
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.1
Reporter: Manan Bakshi


I have a simple dataframe as below
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL| |
|ALL|2.5|
|ALL|4.5|
|ALL|45|

Running a simple filter operation over val column in this dataframe yields 
unexpected results. For eg. the following query returned an empty dataframe:

df.filter(df["val"] > 0)
||dev||val||

However, the filter operation works as expected if 0 in filter condition is 
replaced by float 0.0

df.filter(df["val"] > 0.0)
||dev||val||
|ALL|0.01|
|ALL|0.02|
|ALL|0.004|
|ALL|2.5|
|ALL|4.5|
|ALL|45|

 

Note that this bug only exists in Spark 2.2.0 and later. The previous versions 
filter as expected for both int (0) and float (0.0) values in the filter 
condition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21962) Distributed Tracing in Spark

2018-02-19 Thread Paul Doran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368877#comment-16368877
 ] 

Paul Doran commented on SPARK-21962:


I have a working local version for Jaeger implementing the 
SparkListenerInterface.

Is the scope of this issue broader?

Either way I'd be happy to open source what I have to provide something similar 
to: [https://github.com/JasonMWhite/spark-datadog-relay]

 

Thanks

> Distributed Tracing in Spark
> 
>
> Key: SPARK-21962
> URL: https://issues.apache.org/jira/browse/SPARK-21962
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Major
>
> Spark should support distributed tracing, which is the mechanism, widely 
> popularized by Google in the [Dapper 
> Paper|https://research.google.com/pubs/pub36356.html], where network requests 
> have additional metadata used for tracing requests between services.
> This would be useful for me since I have OpenZipkin style tracing in my 
> distributed application up to the Spark driver, and from the executors out to 
> my other services, but the link is broken in Spark between driver and 
> executor since the Span IDs aren't propagated across that link.
> An initial implementation could instrument the most important network calls 
> with trace ids (like launching and finishing tasks), and incrementally add 
> more tracing to other calls (torrent block distribution, external shuffle 
> service, etc) as the feature matures.
> Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org