[jira] [Comment Edited] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873021#comment-15873021
 ] 

Takeshi Yamamuro edited comment on SPARK-19638 at 2/18/17 6:52 AM:
---

Aha, I got you and you're right; in that case, catalyst does not push down such 
a condition.
{code}
scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", 
"b").write.parquet("/Users/maropu/Desktop/data")

scala> val df = spark.read.load("/Users/maropu/Desktop/data")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>]

scala> df.where($"a" === 1).explain
== Physical Plan ==
*Project [a#108, b#109]
+- *Filter (isnotnull(a#108) && (a#108 = 1))
   +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: 
[], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: 
struct>

scala> df.where($"b._1" === "b").explain
== Physical Plan ==
*Filter (b#109._1 = b)
+- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct>
{code}
BTW, this is not a bug, but improvement in the Type because this kind of 
queries does not return incorrect results.


was (Author: maropu):
Aha, I got you and you're right; in that case, catalyst does not push down such 
a condition.
{code}
scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", 
"b").write.parquet("/Users/maropu/Desktop/data")
scala> val df = spark.read.load("/Users/maropu/Desktop/data")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>]
scala> df.where($"a" === 1).explain
== Physical Plan ==
*Project [a#108, b#109]
+- *Filter (isnotnull(a#108) && (a#108 = 1))
   +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: 
[], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: 
struct>
scala> df.where($"b._1" === "b").explain
== Physical Plan ==
*Filter (b#109._1 = b)
+- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct>
{code}
BTW, this is not a bug, but improvement in the Type because this kind of 
queries does not return incorrect results.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873021#comment-15873021
 ] 

Takeshi Yamamuro edited comment on SPARK-19638 at 2/18/17 6:53 AM:
---

Aha, I got you and you're right; in that case, catalyst does not push down such 
a condition.
{code}
scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", 
"b").write.parquet("/Users/maropu/Desktop/data")

scala> val df = spark.read.load("/Users/maropu/Desktop/data")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>]

scala> df.where($"a" === 1).explain
== Physical Plan ==
*Project [a#108, b#109]
+- *Filter (isnotnull(a#108) && (a#108 = 1))
   +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: 
[], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: 
struct>

scala> df.where($"b._1" === "b").explain
== Physical Plan ==
*Filter (b#109._1 = b)
+- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct>
{code}
BTW, this is not a bug, but improvement in the Type because this kind of 
queries does not return incorrect results.


was (Author: maropu):
Aha, I got you and you're right; in that case, catalyst does not push down such 
a condition.
{code}
scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", 
"b").write.parquet("/Users/maropu/Desktop/data")

scala> val df = spark.read.load("/Users/maropu/Desktop/data")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>]

scala> df.where($"a" === 1).explain
== Physical Plan ==
*Project [a#108, b#109]
+- *Filter (isnotnull(a#108) && (a#108 = 1))
   +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: 
[], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: 
struct>

scala> df.where($"b._1" === "b").explain
== Physical Plan ==
*Filter (b#109._1 = b)
+- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct>
{code}
BTW, this is not a bug, but improvement in the Type because this kind of 
queries does not return incorrect results.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-19638:
-
Issue Type: Improvement  (was: Bug)

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873021#comment-15873021
 ] 

Takeshi Yamamuro commented on SPARK-19638:
--

Aha, I got you and you're right; in that case, catalyst does not push down such 
a condition.
{code}
scala> Seq((1, ("a", 0.1)), (2, ("b", 0.3))).toDF("a", 
"b").write.parquet("/Users/maropu/Desktop/data")
scala> val df = spark.read.load("/Users/maropu/Desktop/data")
df: org.apache.spark.sql.DataFrame = [a: int, b: struct<_1: string, _2: double>]
scala> df.where($"a" === 1).explain
== Physical Plan ==
*Project [a#108, b#109]
+- *Filter (isnotnull(a#108) && (a#108 = 1))
   +- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, 
Location: InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: 
[], PushedFilters: [IsNotNull(a), EqualTo(a,1)], ReadSchema: 
struct>
scala> df.where($"b._1" === "b").explain
== Physical Plan ==
*Filter (b#109._1 = b)
+- *FileScan parquet [a#108,b#109] Batched: false, Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/maropu/Desktop/data], PartitionFilters: [], 
PushedFilters: [], ReadSchema: struct>
{code}
BTW, this is not a bug, but improvement in the Type because this kind of 
queries does not return incorrect results.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Description: 
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
... 14 more
{quote}

 The bug can be easily reproduced, just modify {color:red} val metadataRoot 
= "hdfs://localhost:8020/tmp/checkpoint" {color} in StreamTest and then run the 
test {color:red} sort after aggregate in complete mode {color} in 
StreamingAggregationSuite. The main reason is that when restart streaming  job 
spark will recompute WAL offsets and generate the same hdfs delta file(latest 
delta file generated before restart and named "currentBatchId.delta") . In my 
opinion, this is a bug. If you guy consider that  this is a bug also,  I can 
fix it.


  was:
We are trying to use Structured Streaming in product, however currently 
there exists a bug refer to the process of streaming job restart. 
  The following is  the concrete error message:  
{quote}
   Caused by: java.lang.IllegalStateException: Error committing version 2 into 
HDFSStateStore[id = (op=0, part=136), dir = 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to rename 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 to 
/tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
at 

[jira] [Assigned] (SPARK-19654) Structured Streaming API for R

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19654:


Assignee: Apache Spark  (was: Felix Cheung)

> Structured Streaming API for R
> --
>
> Key: SPARK-19654
> URL: https://issues.apache.org/jira/browse/SPARK-19654
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>
> As a user I want to be able to process data from a streaming source in R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19654) Structured Streaming API for R

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873001#comment-15873001
 ] 

Apache Spark commented on SPARK-19654:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/16982

> Structured Streaming API for R
> --
>
> Key: SPARK-19654
> URL: https://issues.apache.org/jira/browse/SPARK-19654
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> As a user I want to be able to process data from a streaming source in R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19654) Structured Streaming API for R

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19654:


Assignee: Felix Cheung  (was: Apache Spark)

> Structured Streaming API for R
> --
>
> Key: SPARK-19654
> URL: https://issues.apache.org/jira/browse/SPARK-19654
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>
> As a user I want to be able to process data from a streaming source in R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19654) Structured Streaming API for R

2017-02-17 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-19654:


 Summary: Structured Streaming API for R
 Key: SPARK-19654
 URL: https://issues.apache.org/jira/browse/SPARK-19654
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Felix Cheung
Assignee: Felix Cheung


As a user I want to be able to process data from a streaming source in R.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19637:


Assignee: (was: Apache Spark)

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872984#comment-15872984
 ] 

Apache Spark commented on SPARK-19637:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/16981

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19637:


Assignee: Apache Spark

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872977#comment-15872977
 ] 

Apache Spark commented on SPARK-19617:
--

User 'gf53520' has created a pull request for this issue:
https://github.com/apache/spark/pull/16980

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19617:
-
Fix Version/s: 2.2.0

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872921#comment-15872921
 ] 

guifeng edited comment on SPARK-19645 at 2/18/17 3:17 AM:
--

[~zsxwing] ok,  I think the solution is that  determining whether the current 
versionId's Delta File exists before rename delta file.  when file exists, just 
ignore the next rename operation(may also need to delete current versionId's 
snapshot).


was (Author: guifengl...@gmail.com):
[~zsxwing] ok,  I think the solution is that  determining whether the Delta 
File exists before rename delta file.  when file exists, just ignore the next 
rename operation.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872938#comment-15872938
 ] 

Liang-Chi Hsieh edited comment on SPARK-19653 at 2/18/17 3:12 AM:
--

Actually some Spark SQL functions like the mentioned {{avg}}, {{sum}} only 
support {{NumericType}}. They don't support {{Vector}} is not all because 
{{Vector}} type isn't first-class citizen in Spark SQL.

Personally I would -1 for this.


was (Author: viirya):
Actually some Spark SQL functions like the mentioned {{avg}}, {{sum}} only 
support {{NumericType}}. They don't support {{Vector}} is not all because 
{{Vector}} type isn't first-class citizen in Spark SQL.

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872948#comment-15872948
 ] 

Apache Spark commented on SPARK-19617:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16979

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872938#comment-15872938
 ] 

Liang-Chi Hsieh commented on SPARK-19653:
-

Actually some Spark SQL functions like the mentioned {{avg}}, {{sum}} only 
support {{NumericType}}. They don't support {{Vector}} is not all because 
{{Vector}} type isn't first-class citizen in Spark SQL.

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872921#comment-15872921
 ] 

guifeng edited comment on SPARK-19645 at 2/18/17 2:37 AM:
--

[~zsxwing] ok,  I think the solution is that  determining whether the Delta 
File exists before rename delta file.  when file exists, just ignore the next 
rename operation.


was (Author: guifengl...@gmail.com):
[~zsxwing] ok,  I think the solution is that  determining whether the Delta 
File exists before rename delta file.  when file exists, ignore the next rename 
operation.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872921#comment-15872921
 ] 

guifeng commented on SPARK-19645:
-

[~zsxwing] ok,  I think the solution is that  determining whether the Delta 
File exists before rename delta file.  when file exists, ignore the next rename 
operation.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872901#comment-15872901
 ] 

Kazuaki Ishizaki commented on SPARK-19653:
--

cc: [~cloud_fan]

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2017-02-17 Thread Evan Chan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Chan updated SPARK-13219:
--

Hi Gagan,

That is an interesting optimization but not the same one that Venu speaks of (I 
worked on those optimizations).  Basically those optimizations are for where 
the column name in the WHERE clause are present in both tables, and my 
impression is this is what this fix is for as well.

Your case would be very useful too.  You can do it in two steps though, first 
do the lookup of postal codes from location, then translate your select from 
address into an IN condition.

Of course it’s better if Spark does this so that the results don’t have to be 
passed back through the driver.




> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872847#comment-15872847
 ] 

Mike Dusenberry commented on SPARK-19653:
-

cc [~sethah]

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19652:


Assignee: Apache Spark

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19652:


Assignee: (was: Apache Spark)

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872845#comment-15872845
 ] 

Apache Spark commented on SPARK-19652:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/16978

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872819#comment-15872819
 ] 

Xiao Li commented on SPARK-19653:
-

cc [~mengxr] [~josephkb]

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2017-02-17 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872811#comment-15872811
 ] 

gagan taneja commented on SPARK-13219:
--

This is what we are looking for 
For example 
Table Address is partitioned based on postal_code 
Table Location which contain location_name and potal_code 
Query SELECT * FROM address JOIN location ON postal_code WHERE location_name = 
'San Jose'
If the query is re-written as a in clause the optimizer will be able to prune 
the partitions which would be significantly faster

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872810#comment-15872810
 ] 

Mike Dusenberry commented on SPARK-19653:
-

cc [~mlnick], [~smilegator]

> `Vector` Type Should Be A First-Class Citizen In Spark SQL
> --
>
> Key: SPARK-19653
> URL: https://issues.apache.org/jira/browse/SPARK-19653
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Mike Dusenberry
>
> *Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
> "Spark ML") should be added as a first-class citizen to Spark SQL.
> *Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
>  to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary 
> for MLlib algorithms.  Although this allows a DataFrame/DataSet to contain 
> vectors, it does not allow one to make complete use of the rich set of 
> features made available by Spark SQL.  For example, it is not possible to use 
> any of the SQL functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} 
> column, nor is it possible to save a DataFrame with a {{Vector}} column as a 
> CSV file.  In any of these cases, an error message is returned with an note 
> that the operator is not supported on a {{Vector}} type.
> *Benefit*: Allow users to make use of all Spark SQL features that can be 
> reasonably applied to a vector.
> *Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
> first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19653) `Vector` Type Should Be A First-Class Citizen In Spark SQL

2017-02-17 Thread Mike Dusenberry (JIRA)
Mike Dusenberry created SPARK-19653:
---

 Summary: `Vector` Type Should Be A First-Class Citizen In Spark SQL
 Key: SPARK-19653
 URL: https://issues.apache.org/jira/browse/SPARK-19653
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib, SQL
Affects Versions: 2.1.0, 2.2.0
Reporter: Mike Dusenberry


*Issue*: The {{Vector}} type in Spark MLlib (DataFrame-based API, informally 
"Spark ML") should be added as a first-class citizen to Spark SQL.

*Current Status*:  Currently, Spark MLlib adds a [{{Vector}} SQL datatype | 
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.linalg.SQLDataTypes$]
 to allow DataFrames/DataSets to use {{Vector}} columns, which is necessary for 
MLlib algorithms.  Although this allows a DataFrame/DataSet to contain vectors, 
it does not allow one to make complete use of the rich set of features made 
available by Spark SQL.  For example, it is not possible to use any of the SQL 
functions, such as {{avg}}, {{sum}}, etc. on a {{Vector}} column, nor is it 
possible to save a DataFrame with a {{Vector}} column as a CSV file.  In any of 
these cases, an error message is returned with an note that the operator is not 
supported on a {{Vector}} type.

*Benefit*: Allow users to make use of all Spark SQL features that can be 
reasonably applied to a vector.

*Goal*:  Move the {{Vector}} type from Spark MLlib into Spark SQL as a 
first-class citizen.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2017-02-17 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872809#comment-15872809
 ] 

gagan taneja commented on SPARK-13219:
--

Venu
This is very interesting i would like to look at the code for all the 
optimization that are in-place 
Do you have plans to contribute is back to spark

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19609) Broadcast joins should pushdown join constraints as Filter to the larger relation

2017-02-17 Thread gagan taneja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872797#comment-15872797
 ] 

gagan taneja commented on SPARK-19609:
--

This can be further extended to join on the column that are also partitioned 
column 
For example 
Table Address is partitioned based on postal_code 
Table Location which contain location_name and potal_code 
Query SELECT * FROM address JOIN location ON postal_code WHERE location_name = 
'San Jose'
If the query is re-written as a in clause the optimizer will be able to prune 
the partitions which would be significantly faster


> Broadcast joins should pushdown join constraints as Filter to the larger 
> relation
> -
>
> Key: SPARK-19609
> URL: https://issues.apache.org/jira/browse/SPARK-19609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> For broadcast inner-joins, where the smaller relation is known to be small 
> enough to materialize on a worker, the set of values for all join columns is 
> known and fits in memory. Spark should translate these values into a 
> {{Filter}} pushed down to the datasource. The common join condition of 
> equality, i.e. {{lhs.a == rhs.a}}, can be written as an {{a in ...}} clause. 
> An example of pushing such filters is already present in the form of 
> {{IsNotNull}} filters via [~sameerag]'s work on SPARK-12957 subtasks.
> This optimization could even work when the smaller relation does not fit 
> entirely in memory. This could be done by partitioning the smaller relation 
> into N pieces, applying this predicate pushdown for each piece, and unioning 
> the results.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-17 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-19652:
--

 Summary: REST API does not perform user auth for individual apps
 Key: SPARK-19652
 URL: https://issues.apache.org/jira/browse/SPARK-19652
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.1.0, 2.0.0
Reporter: Marcelo Vanzin


(This goes back further than 2.0.0, btw.)

The REST API currently only performs authorization at the root of the UI; this 
works for live UIs, but not for the history server, where the root allows 
everybody to read data. That means that currently any user can see any 
application in the SHS through the REST API, when auth is enabled.

Instead, the REST API should behave like the regular UI and perform 
authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19651:


Assignee: Wenchen Fan  (was: Apache Spark)

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872769#comment-15872769
 ] 

Apache Spark commented on SPARK-19651:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16977

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19651:


Assignee: Apache Spark  (was: Wenchen Fan)

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19651) ParallelCollectionRDD.collect should not issue a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19651:

Summary: ParallelCollectionRDD.collect should not issue a Spark job  (was: 
ParallelCollectionRDD.collect should not issuse a Spark job)

> ParallelCollectionRDD.collect should not issue a Spark job
> --
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19651) ParallelCollectionRDD.collect should not issuse a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19651:

Summary: ParallelCollectionRDD.collect should not issuse a Spark job  (was: 
ParallelCollectionRDD.colect should not issuse a Spark job)

> ParallelCollectionRDD.collect should not issuse a Spark job
> ---
>
> Key: SPARK-19651
> URL: https://issues.apache.org/jira/browse/SPARK-19651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19651) ParallelCollectionRDD.colect should not issuse a Spark job

2017-02-17 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19651:
---

 Summary: ParallelCollectionRDD.colect should not issuse a Spark job
 Key: SPARK-19651
 URL: https://issues.apache.org/jira/browse/SPARK-19651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-17 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-19650:
-

Assignee: Sameer Agarwal

> Metastore-only operations shouldn't trigger a spark job
> ---
>
> Key: SPARK-19650
> URL: https://issues.apache.org/jira/browse/SPARK-19650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>
> We currently trigger a spark job even for simple metastore operations ({{SHOW 
> TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
> otherwise get executed on a driver, it prevents a user from doing these 
> operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job

2017-02-17 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-19650:
--

 Summary: Metastore-only operations shouldn't trigger a spark job
 Key: SPARK-19650
 URL: https://issues.apache.org/jira/browse/SPARK-19650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Sameer Agarwal


We currently trigger a spark job even for simple metastore operations ({{SHOW 
TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these 
otherwise get executed on a driver, it prevents a user from doing these 
operations on a driver-only cluster.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872728#comment-15872728
 ] 

Aaditya Ramesh commented on SPARK-19525:


Great, I will get the patch ready.

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3877) The exit code of spark-submit is still 0 when an yarn application fails

2017-02-17 Thread Joshua Caplan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872720#comment-15872720
 ] 

Joshua Caplan commented on SPARK-3877:
--

Done, as SPARK-19649 .

> The exit code of spark-submit is still 0 when an yarn application fails
> ---
>
> Key: SPARK-3877
> URL: https://issues.apache.org/jira/browse/SPARK-3877
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>  Labels: yarn
> Fix For: 1.1.1, 1.2.0
>
>
> When an yarn application fails (yarn-cluster mode), the exit code of 
> spark-submit is still 0. It's hard for people to write some automatic scripts 
> to run spark jobs in yarn because the failure can not be detected in these 
> scripts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872717#comment-15872717
 ] 

Shixiong Zhu commented on SPARK-19525:
--

I see. This is RDD checkpointing. Sounds a good idea.

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19525:
-
Component/s: (was: Structured Streaming)
 Spark Core

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19525:
-
Summary: Enable Compression of RDD Checkpoints  (was: Enable Compression of 
Spark Streaming Checkpoints)

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Caplan updated SPARK-19649:
--
Description: 
I believe the patch in SPARK-3877 created a new race condition between YARN and 
the Spark client.

I typically configure YARN not to keep *any* recent jobs in memory, as some of 
my jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
{code}

Client log:
{code}
17/01/09 19:31:23 INFO Client: Application report for 
application_1483983939941_0056 (state: RUNNING)
17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 not 
found.
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1483983939941_0056 is killed
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

  was:
I have configured YARN not to keep *any* recent jobs in memory, as some of my 
jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 

[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872698#comment-15872698
 ] 

Aaditya Ramesh commented on SPARK-19525:


We are suggesting to compress only before we write the checkpoint, not in 
memory. This is not happening right now - we just serialize the elements in the 
partition one by one and add to the serialization stream, according to 
{{ReliableCheckpointRDD.writePartitionToCheckpointFile}}:

{code}
val fileOutputStream = if (blockSize < 0) {
  fs.create(tempOutputPath, false, bufferSize)
} else {
  // This is mainly for testing purpose
  fs.create(tempOutputPath, false, bufferSize,
fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)
}
val serializer = env.serializer.newInstance()
val serializeStream = serializer.serializeStream(fileOutputStream)
Utils.tryWithSafeFinally {
  serializeStream.writeAll(iterator)
} {
  serializeStream.close()
}

{code}

As you can see, we don't do any compression after the serialization step. In 
our patch, we just use the CompressionCodec and wrap the serialization stream 
in compression codec output stream, and correspondingly in the read.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)
Joshua Caplan created SPARK-19649:
-

 Summary: Spark YARN client throws exception if job succeeds and 
max-completed-applications=0
 Key: SPARK-19649
 URL: https://issues.apache.org/jira/browse/SPARK-19649
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.3
 Environment: EMR release label 4.8.x
Reporter: Joshua Caplan
Priority: Minor


I have configured YARN not to keep *any* recent jobs in memory, as some of my 
jobs get pretty large.

{code}
yarn-site   yarn.resourcemanager.max-completed-applications 0
{code}

The once-per-second call to getApplicationReport may thus encounter a RUNNING 
application followed by a not found application, and report a false negative.

(typical) Executor log:
{code}
17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
exitCode: 0
17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at http://10.0.0.168:37046
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all executors
17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
shut down
17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
17/01/09 19:31:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster with 
SUCCEEDED
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
successfully unregistered.
17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
{code}

Client log:
{code}
17/01/09 19:31:23 INFO Client: Application report for 
application_1483983939941_0056 (state: RUNNING)
17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 not 
found.
Exception in thread "main" org.apache.spark.SparkException: Application 
application_1483983939941_0056 is killed
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19649) Spark YARN client throws exception if job succeeds and max-completed-applications=0

2017-02-17 Thread Joshua Caplan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872695#comment-15872695
 ] 

Joshua Caplan commented on SPARK-19649:
---

Hadoop encountered the same situation and fixed it in their client.

> Spark YARN client throws exception if job succeeds and 
> max-completed-applications=0
> ---
>
> Key: SPARK-19649
> URL: https://issues.apache.org/jira/browse/SPARK-19649
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3
> Environment: EMR release label 4.8.x
>Reporter: Joshua Caplan
>Priority: Minor
>
> I believe the patch in SPARK-3877 created a new race condition between YARN 
> and the Spark client.
> I typically configure YARN not to keep *any* recent jobs in memory, as some 
> of my jobs get pretty large.
> {code}
> yarn-site yarn.resourcemanager.max-completed-applications 0
> {code}
> The once-per-second call to getApplicationReport may thus encounter a RUNNING 
> application followed by a not found application, and report a false negative.
> (typical) Executor log:
> {code}
> 17/01/09 19:31:23 INFO ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0
> 17/01/09 19:31:23 INFO SparkContext: Invoking stop() from shutdown hook
> 17/01/09 19:31:24 INFO SparkUI: Stopped Spark web UI at 
> http://10.0.0.168:37046
> 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Shutting down all 
> executors
> 17/01/09 19:31:24 INFO YarnClusterSchedulerBackend: Asking each executor to 
> shut down
> 17/01/09 19:31:24 INFO MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 17/01/09 19:31:24 INFO MemoryStore: MemoryStore cleared
> 17/01/09 19:31:24 INFO BlockManager: BlockManager stopped
> 17/01/09 19:31:24 INFO BlockManagerMaster: BlockManagerMaster stopped
> 17/01/09 19:31:24 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/01/09 19:31:24 INFO SparkContext: Successfully stopped SparkContext
> 17/01/09 19:31:24 INFO ApplicationMaster: Unregistering ApplicationMaster 
> with SUCCEEDED
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Shutting 
> down remote daemon.
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remote 
> daemon shut down; proceeding with flushing remote transports.
> 17/01/09 19:31:24 INFO AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 17/01/09 19:31:24 INFO RemoteActorRefProvider$RemotingTerminator: Remoting 
> shut down.
> {code}
> Client log:
> {code}
> 17/01/09 19:31:23 INFO Client: Application report for 
> application_1483983939941_0056 (state: RUNNING)
> 17/01/09 19:31:24 ERROR Client: Application application_1483983939941_0056 
> not found.
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1483983939941_0056 is killed
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:1038)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1081)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming

2017-02-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872672#comment-15872672
 ] 

Shixiong Zhu commented on SPARK-19644:
--

[~deenbandhu] Could you check the GC root, please? These objects are from Scala 
reflection. Did you run the job in Spark shell?

> Memory leak in Spark Streaming
> --
>
> Key: SPARK-19644
> URL: https://issues.apache.org/jira/browse/SPARK-19644
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: 3 AWS EC2 c3.xLarge
> Number of cores - 3
> Number of executors 3 
> Memory to each executor 2GB
>Reporter: Deenbandhu Agarwal
>Priority: Critical
>  Labels: memory_leak, performance
> Attachments: heapdump.png
>
>
> I am using streaming on the production for some aggregation and fetching data 
> from cassandra and saving data back to cassandra. 
> I see a gradual increase in old generation heap capacity from 1161216 Bytes 
> to 1397760 Bytes over a period of six hours.
> After 50 hours of processing instances of class 
> scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a 
> huge number. 
> I think this is a clear case of memory leak



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872652#comment-15872652
 ] 

Shixiong Zhu commented on SPARK-19525:
--

Hm, Spark should support compression for data in RDD. Which code path did you 
find that not compressing data?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19640) Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2

2017-02-17 Thread Stephen Kinser (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Kinser closed SPARK-19640.
--
Resolution: Won't Fix

> Incorrect documentation for MLlib CountVectorizerModel for spark 1.5.2
> --
>
> Key: SPARK-19640
> URL: https://issues.apache.org/jira/browse/SPARK-19640
> Project: Spark
>  Issue Type: Task
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Stephen Kinser
>Priority: Trivial
>  Labels: documentation, easyfix
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Spark MLLib documentation for CountVectorizerModel in spark 1.5.2 currently 
> uses import statement of package path that does not exist 
> import org.apache.spark.ml.feature.CountVectorizer
> import org.apache.spark.mllib.util.CountVectorizerModel
> this should be revised to what it is like in spark 1.6+
> import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19637) add to_json APIs to SQL

2017-02-17 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872470#comment-15872470
 ] 

Michael Armbrust commented on SPARK-19637:
--

>From JSON is harder because the second argument is a StructType.  We could 
>consider accepting a string in the DDL format for declaring a tables schema 
>(i.e. {{a: Int, b: struct...}}.

> add to_json APIs to SQL
> ---
>
> Key: SPARK-19637
> URL: https://issues.apache.org/jira/browse/SPARK-19637
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> The method "to_json" is a useful method in turning a struct into a json 
> string. It currently doesn't work in SQL, but adding it should be trivial.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19517) KafkaSource fails to initialize partition offsets

2017-02-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19517:


Assignee: Roberto Agostino Vitillo

> KafkaSource fails to initialize partition offsets
> -
>
> Key: SPARK-19517
> URL: https://issues.apache.org/jira/browse/SPARK-19517
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Roberto Agostino Vitillo
>Assignee: Roberto Agostino Vitillo
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19517ProposalforfixingKafkaOffsetMetadata.pdf
>
>
> A Kafka source with many partitions can cause the check-pointing logic to 
> fail on restart. I got the following exception when trying to restart a 
> Structured Streaming app that reads from a Kafka topic with hundred 
> partitions.
> {code}
> 17/02/08 15:10:09 ERROR StreamExecution: Query [id = 
> 24e2a21a-4545-4a3e-80ea-bbe777d883ab, runId = 
> 025609c9-d59c-4de3-88b3-5d5f7eda4a66] terminated with error
> java.lang.IllegalArgumentException: Expected e.g. 
> {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"telemetry":{"92":302854
>   at 
> org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply(KafkaSourceOffset.scala:59)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:134)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:138)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:121)
>…
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-17 Thread Aaditya Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872441#comment-15872441
 ] 

Aaditya Ramesh commented on SPARK-19525:


[~zsxwing] Actually, we are compressing the data in the RDDs, not the streaming 
metadata. We compress all records in a partition together and write them to our 
DFS. In our case, the snappy-compressed size of each RDD partition is around 18 
MB, with 84 partitions, for a total of 1.5 GB per RDD.

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19517) KafkaSource fails to initialize partition offsets

2017-02-17 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19517.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> KafkaSource fails to initialize partition offsets
> -
>
> Key: SPARK-19517
> URL: https://issues.apache.org/jira/browse/SPARK-19517
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Roberto Agostino Vitillo
>Priority: Blocker
> Fix For: 2.1.1, 2.2.0
>
> Attachments: SPARK-19517ProposalforfixingKafkaOffsetMetadata.pdf
>
>
> A Kafka source with many partitions can cause the check-pointing logic to 
> fail on restart. I got the following exception when trying to restart a 
> Structured Streaming app that reads from a Kafka topic with hundred 
> partitions.
> {code}
> 17/02/08 15:10:09 ERROR StreamExecution: Query [id = 
> 24e2a21a-4545-4a3e-80ea-bbe777d883ab, runId = 
> 025609c9-d59c-4de3-88b3-5d5f7eda4a66] terminated with error
> java.lang.IllegalArgumentException: Expected e.g. 
> {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"telemetry":{"92":302854
>   at 
> org.apache.spark.sql.kafka010.JsonUtils$.partitionOffsets(JsonUtils.scala:74)
>   at 
> org.apache.spark.sql.kafka010.KafkaSourceOffset$.apply(KafkaSourceOffset.scala:59)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:134)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource$$anon$1.deserialize(KafkaSource.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.get(HDFSMetadataLog.scala:237)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets$lzycompute(KafkaSource.scala:138)
>   at 
> org.apache.spark.sql.kafka010.KafkaSource.initialPartitionOffsets(KafkaSource.scala:121)
>…
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18285) approxQuantile in R support multi-column

2017-02-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18285.
-
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.2.0

> approxQuantile in R support multi-column
> 
>
> Key: SPARK-18285
> URL: https://issues.apache.org/jira/browse/SPARK-18285
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: zhengruifeng
>Assignee: Yanbo Liang
> Fix For: 2.2.0
>
>
> approxQuantile in R should support multi-column.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872413#comment-15872413
 ] 

Nick Dimiduk commented on SPARK-19638:
--

Debugging. I'm looking at the match expression in 
[{{DataSourceStrategy#translateFilter(Expression)}}|https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L509].
 The predicate comes in as a {{EqualTo(GetStructField, Literal)}}. This doesn't 
match any of the cases. I was expecting it to step into the [{{case 
expressions.EqualTo(a: Attribute, Literal(v, t)) 
=>}}|https://github.com/apache/spark/blob/v2.1.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L511]
 on line 511 but execution steps past this point. Upon investigation, 
{{GetStructField}} does not extend {{Attribute}}.

>From this point, the {{EqualTo}} condition involving the struct field is 
>dropped from the filter set pushed down to the ES connector. Thus I believe 
>this is an issue in Spark, not in the connector.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator

2017-02-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18986.

   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.2.0

> ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its 
> iterator
> -
>
> Key: SPARK-18986
> URL: https://issues.apache.org/jira/browse/SPARK-18986
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an 
> iterator is not null in the map. However, the assertion is only true after 
> the map is asked for iterator. Before it, if another memory consumer asks 
> more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is 
> also be called too. In this case, we will see failure like this:
> {code}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:156)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196)
> [info]   at 
> org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111)
> [info]   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly
> MapSuite.scala:294)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872319#comment-15872319
 ] 

Shixiong Zhu commented on SPARK-19645:
--

[~guifengl...@gmail.com]  Thanks for reporting. Could you submit a PR to fix it?

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19610:


Assignee: (was: Apache Spark)

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19610:


Assignee: Apache Spark

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19610) multi line support for CSV

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872270#comment-15872270
 ] 

Apache Spark commented on SPARK-19610:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16976

> multi line support for CSV
> --
>
> Key: SPARK-19610
> URL: https://issues.apache.org/jira/browse/SPARK-19610
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used

2017-02-17 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19500.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 16844
[https://github.com/apache/spark/pull/16844]

> Fail to spill the aggregated hash map when radix sort is used
> -
>
> Key: SPARK-19500
> URL: https://issues.apache.org/jira/browse/SPARK-19500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> Radix sort requires that only half of the array could be occupied. But the 
> aggregated hash map have a off-by-1 bug that could have 1 more item than half 
> of the array, when this happen, the spilling will fail as:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
> in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 
> 10.0 (TID 23899, 10.145.253.180, executor 24): 
> java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>  
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
> at org.apache.spark.scheduler.Task.run(Task.scala:99) 
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at scala.Option.foreach(Option.scala:257) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137)
>  
> ... 32 more 
> Caused by: java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> 

[jira] [Assigned] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19522:


Assignee: Apache Spark  (was: Andrew Or)

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872109#comment-15872109
 ] 

Apache Spark commented on SPARK-19522:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/16975

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19522) --executor-memory flag doesn't work in local-cluster mode

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19522:


Assignee: Andrew Or  (was: Apache Spark)

> --executor-memory flag doesn't work in local-cluster mode
> -
>
> Key: SPARK-19522
> URL: https://issues.apache.org/jira/browse/SPARK-19522
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> {code}
> bin/spark-shell --master local-cluster[2,1,2048]
> {code}
> doesn't do what you think it does. You'll end up getting executors with 1GB 
> each because that's the default. This is because the executor memory flag 
> isn't actually read in local mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19638) Filter pushdown not working for struct fields

2017-02-17 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872060#comment-15872060
 ] 

Nick Dimiduk commented on SPARK-19638:
--

[~maropu] I have placed a breakpoint in the ES connector's implementation of 
{{PrunedFilterScan#buildScan(Array[String], Array[Filter])}}. Here I see no 
filters for the struct columns. Indeed this is precisely where the log messages 
are produced. For this reason, I believe this to be an issue with Catalyst, not 
the connector. Perhaps you can guide me through further debugging? Thanks.

> Filter pushdown not working for struct fields
> -
>
> Key: SPARK-19638
> URL: https://issues.apache.org/jira/browse/SPARK-19638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nick Dimiduk
>
> Working with a dataset containing struct fields, and enabling debug logging 
> in the ES connector, I'm seeing the following behavior. The dataframe is 
> created over the ES connector and then the schema is extended with a couple 
> column aliases, such as.
> {noformat}
> df.withColumn("f2", df("foo"))
> {noformat}
> Queries vs those alias columns work as expected for fields that are 
> non-struct members.
> {noformat}
> scala> df.withColumn("f2", df("foo")).where("f2 == '1'").limit(0).show
> 17/02/16 15:06:49 DEBUG DataSource: Pushing down filters 
> [IsNotNull(foo),EqualTo(foo,1)]
> 17/02/16 15:06:49 TRACE DataSource: Transformed filters into DSL 
> [{"exists":{"field":"foo"}},{"match":{"foo":"1"}}]
> {noformat}
> However, try the same with an alias over a struct field, and no filters are 
> pushed down.
> {noformat}
> scala> df.withColumn("bar_baz", df("bar.baz")).where("bar_baz == 
> '1'").limit(1).show
> {noformat}
> In fact, this is the case even when no alias is used at all.
> {noformat}
> scala> df.where("bar.baz == '1'").limit(1).show
> {noformat}
> Basically, pushdown for structs doesn't work at all.
> Maybe this is specific to the ES connector?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14194) spark csv reader not working properly if CSV content contains CRLF character (newline) in the intermediate cell

2017-02-17 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871975#comment-15871975
 ] 

Dongjoon Hyun commented on SPARK-14194:
---

+1 for @srowen 's opinion.

> spark csv reader not working properly if CSV content contains CRLF character 
> (newline) in the intermediate cell
> ---
>
> Key: SPARK-14194
> URL: https://issues.apache.org/jira/browse/SPARK-14194
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 2.1.0
>Reporter: Kumaresh C R
>
> We have CSV content like below,
> Sl.NO, Employee_Name, Company, Address, Country, ZIP_Code\n\r
> "1", "ABCD", "XYZ", "1234", "XZ Street \n\r(CRLF charater), 
> Municapality,","USA", "1234567"
> Since there is a '\n\r' character in the row middle (to be exact in the 
> Address Column), when we execute the below spark code, it tries to create the 
> dataframe with two rows (excluding header row), which is wrong. Since we have 
> specified delimiter as quote (") character , why it takes the middle 
> character as newline character ? This creates an issue while processing the 
> created dataframe.
>  DataFrame df = 
> sqlContextManager.getSqlContext().read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("inferSchema", "true")
> .option("delimiter", delim)
> .option("quote", quote)
> .option("escape", escape)
> .load(sourceFile);
>



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19593) Records read per each kinesis transaction

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19593.
---
Resolution: Invalid

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42M/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19547) KafkaUtil throw 'No current assignment for partition' Exception

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19547.
---
Resolution: Invalid

> KafkaUtil throw 'No current assignment for partition' Exception
> ---
>
> Key: SPARK-19547
> URL: https://issues.apache.org/jira/browse/SPARK-19547
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 1.6.1
>Reporter: wuchang
>
> Below is my scala code to create spark kafka stream:
> val kafkaParams = Map[String, Object](
>   "bootstrap.servers" -> "server110:2181,server110:9092",
>   "zookeeper" -> "server110:2181",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> classOf[StringDeserializer],
>   "group.id" -> "example",
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean)
> )
> val topics = Array("ABTest")
> val stream = KafkaUtils.createDirectStream[String, String](
>   ssc,
>   PreferConsistent,
>   Subscribe[String, String](topics, kafkaParams)
> )
> But after run for 10 hours, it throws exceptions:
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Revoking previously assigned partitions [ABTest-0, ABTest-1] for group example
> 2017-02-10 10:56:20,000 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:20,011 INFO  [JobGenerator] internals.AbstractCoordinator: 
> (Re-)joining group example
> 2017-02-10 10:56:40,057 INFO  [JobGenerator] internals.AbstractCoordinator: 
> Successfully joined group example with generation 5
> 2017-02-10 10:56:40,058 INFO  [JobGenerator] internals.ConsumerCoordinator: 
> Setting newly assigned partitions [ABTest-1] for group example
> 2017-02-10 10:56:40,080 ERROR [JobScheduler] scheduler.JobScheduler: Error 
> generating jobs for time 148669538 ms
> java.lang.IllegalStateException: No current assignment for partition ABTest-0
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:231)
> at 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:295)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1169)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.latestOffsets(DirectKafkaInputDStream.scala:179)
> at 
> org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:196)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:48)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:117)
> at 
> org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at 
> org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
> at 
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:246)
> at 

[jira] [Updated] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19622:
--
Fix Version/s: 2.1.1

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19647) Spark query hive is extremelly slow even the result data is small

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19647.
---
Resolution: Invalid

Questions should go to u...@spark.apache.org

> Spark query hive is extremelly slow even the result data is small
> -
>
> Key: SPARK-19647
> URL: https://issues.apache.org/jira/browse/SPARK-19647
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.0.2
>Reporter: wuchang
>Priority: Critical
>
> I am using spark 2.0.0 to query hive table:
> my sql is:
> select * from app.abtestmsg_v limit 10
> Yes, I want to get the first 10 records from a view app.abtestmsg_v.
> When I run this sql in spark-shell,it is very fast, USE about 2 seconds .
> But then the problem comes when I try to implement this query by my python 
> code.
> I am using Spark 2.0.0 and write a very simple pyspark program, code is:
> from pyspark.sql import HiveContext
> from pyspark.sql.functions import *
> import json
> hc = HiveContext(sc)
> hc.setConf("hive.exec.orc.split.strategy", "ETL")
> hc.setConf("hive.security.authorization.enabled",false)
> zj_sql = 'select * from app.abtestmsg_v limit 10'
> zj_df = hc.sql(zj_sql)
> zj_df.collect()
> From the info log , I find: although I use "limit 10" to tell spark that I 
> just want the first 10 records , but spark still scan and read all files(in 
> my case, the source data of this view contains 100 files and each file's size 
> is about 1G) of the view , So , there are nearly 100 tasks , each task read a 
> file , and all the task is executed serially. I use nearlly 15 minutes to 
> finish these 100 tasks! but what I want is just to get the first 10 
> records.
> So , I don't know what to do and what is wrong;
> Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19622.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16953
[https://github.com/apache/spark/pull/16953]

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19622) Fix a http error in a paged table when using a `Go` button to search.

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19622:
-

Assignee: StanZhai

> Fix a http error in a paged table when using a `Go` button to search.
> -
>
> Key: SPARK-19622
> URL: https://issues.apache.org/jira/browse/SPARK-19622
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Minor
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png
>
>
> The search function of paged table is not available because of we don't skip 
> the hash data of the reqeust path. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19648) Unable to access column containing '.' for approxQuantile function on DataFrame

2017-02-17 Thread John Compitello (JIRA)
John Compitello created SPARK-19648:
---

 Summary: Unable to access column containing '.' for approxQuantile 
function on DataFrame
 Key: SPARK-19648
 URL: https://issues.apache.org/jira/browse/SPARK-19648
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.0.2
 Environment: Running spark in an ipython prompt on Mac OSX. 
Reporter: John Compitello


It seems that the approx quantiles method does not offer any way to access a 
column with a period in string name. I am aware of the backtick solution, but 
it does not work in this scenario.

For example, let's say I have a column named 'va.x'. Passing approx quantiles 
this string without backticks results in the following error:

'Cannot resolve column name '`va.x`' given input columns: '

Note that backticks seem to have been automatically inserted, but it cannot 
find column name regardless. 

If I do include backticks, I get a different error. An IllegalArgumentException 
is thrown as follows:

"IllegalArgumentException: 'Field "`va.x`" does not exist."



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19633) FileSource read from FileSink

2017-02-17 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871876#comment-15871876
 ] 

Liwei Lin commented on SPARK-19633:
---

Hi [~marmbrus], I'd like to take this if it's ok by you.

Just to confirm, you meant FileSource should also use MetadataLogFileIndex over 
InMemoryFileIndex whenever possible, right?

> FileSource read from FileSink
> -
>
> Key: SPARK-19633
> URL: https://issues.apache.org/jira/browse/SPARK-19633
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Priority: Critical
>
> Right now, you can't start a streaming query from a location that is being 
> written to by the file sink.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19647) Spark query hive is extremelly slow even the result data is small

2017-02-17 Thread wuchang (JIRA)
wuchang created SPARK-19647:
---

 Summary: Spark query hive is extremelly slow even the result data 
is small
 Key: SPARK-19647
 URL: https://issues.apache.org/jira/browse/SPARK-19647
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.0.2
Reporter: wuchang
Priority: Critical


I am using spark 2.0.0 to query hive table:

my sql is:

select * from app.abtestmsg_v limit 10
Yes, I want to get the first 10 records from a view app.abtestmsg_v.

When I run this sql in spark-shell,it is very fast, USE about 2 seconds .

But then the problem comes when I try to implement this query by my python code.

I am using Spark 2.0.0 and write a very simple pyspark program, code is:

from pyspark.sql import HiveContext
from pyspark.sql.functions import *
import json
hc = HiveContext(sc)
hc.setConf("hive.exec.orc.split.strategy", "ETL")
hc.setConf("hive.security.authorization.enabled",false)
zj_sql = 'select * from app.abtestmsg_v limit 10'
zj_df = hc.sql(zj_sql)
zj_df.collect()
>From the info log , I find: although I use "limit 10" to tell spark that I 
>just want the first 10 records , but spark still scan and read all files(in my 
>case, the source data of this view contains 100 files and each file's size is 
>about 1G) of the view , So , there are nearly 100 tasks , each task read a 
>file , and all the task is executed serially. I use nearlly 15 minutes to 
>finish these 100 tasks! but what I want is just to get the first 10 
>records.

So , I don't know what to do and what is wrong;

Anybode could give me some suggestions?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871716#comment-15871716
 ] 

guifeng commented on SPARK-19645:
-

So, I think current workaround is to delete(if delta file exist) and then 
rename file.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2017-02-17 Thread Jose Soltren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871702#comment-15871702
 ] 

Jose Soltren commented on SPARK-18091:
--

FWIW, anyone pulling this fix in the future will also want 
https://github.com/apache/spark/pull/16244, or else a bunch of tests will fail.

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Assignee: Kapil Singh
>Priority: Critical
> Fix For: 2.0.3, 2.1.0
>
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871678#comment-15871678
 ] 

Sean Owen commented on SPARK-19646:
---

I think it's because the array is copied elsewhere as it moves between the JVM 
and Python anyway

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871680#comment-15871680
 ] 

Sean Owen commented on SPARK-19645:
---

I'm not suggesting it works with Hadoop 2.6; I'm responding to the comment 
above about lack of a rename method. It's available now.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-02-17 Thread Mahmoud Rawas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871635#comment-15871635
 ] 

Mahmoud Rawas commented on SPARK-16920:
---

It seems that there is no N^2 complexity issue, and as for the stress test I 
have added a guide on how to perform one with some explanation on the fix, 
please review the following gist and let me know if you prefer any changes.

https://gist.github.com/mhmoudr/3681668f0ae56ca70cd95c8602f963e1

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871627#comment-15871627
 ] 

guifeng commented on SPARK-19645:
-

aBut, spark mater don't set rename options that support overwrite, so I think 
rename will failed also. After use hadoop 2.6+ remain failed.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guifeng updated SPARK-19645:

Comment: was deleted

(was: aBut, spark mater don't set rename options that support overwrite, so I 
think rename will failed also. After use hadoop 2.6+ remain failed.)

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871552#comment-15871552
 ] 

guifeng edited comment on SPARK-19645 at 2/17/17 10:36 AM:
---

 But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.
[~srowen] After use hadoop 2.6+ remain failed. 


was (Author: guifengl...@gmail.com):
But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871597#comment-15871597
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

What's puzzling though, is I looked at pyspark's implementation of 
binaryRecords, and it's just calling _jsc.binaryRecords and wrapping it with a 
pyspark RDD
so, if it is indeed calling the scala implementation, shouldn't pyspark have 
the same problem?

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread BahaaEddin AlAila (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871595#comment-15871595
 ] 

BahaaEddin AlAila commented on SPARK-19646:
---

Thank you very much for the speedy fix!

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871552#comment-15871552
 ] 

guifeng edited comment on SPARK-19645 at 2/17/17 9:50 AM:
--

But, spark mater don't set rename options that support overwrite, so I think 
rename will failed also.


was (Author: guifengl...@gmail.com):
But, spark mater don't set rename options that support overwrite. 

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871552#comment-15871552
 ] 

guifeng commented on SPARK-19645:
-

But, spark mater don't set rename options that support overwrite. 

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871545#comment-15871545
 ] 

Apache Spark commented on SPARK-19533:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16961

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features

2017-02-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19534:
--
Comment: was deleted

(was: User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16961)

> Convert Java tests to use lambdas, Java 8 features
> --
>
> Key: SPARK-19534
> URL: https://issues.apache.org/jira/browse/SPARK-19534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a 
> significant sub-task in its own right. This shouldn't mean that 'old' APIs go 
> untested because there are no separate Java 8 APIs; it's just syntactic sugar 
> for calls to the same APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19533:


Assignee: Sean Owen  (was: Apache Spark)

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19533:


Assignee: Apache Spark  (was: Sean Owen)

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871543#comment-15871543
 ] 

Apache Spark commented on SPARK-19646:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16974

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871542#comment-15871542
 ] 

Sean Owen commented on SPARK-19645:
---

Note that master requires Hadoop 2.6+ now.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19646:


Assignee: Sean Owen  (was: Apache Spark)

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Sean Owen
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19646) binaryRecords replicates records in scala API

2017-02-17 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19646:


Assignee: Apache Spark  (was: Sean Owen)

> binaryRecords replicates records in scala API
> -
>
> Key: SPARK-19646
> URL: https://issues.apache.org/jira/browse/SPARK-19646
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0
>Reporter: BahaaEddin AlAila
>Assignee: Apache Spark
>
> The scala sc.binaryRecords replicates one record for the entire set.
> for example, I am trying to load the cifar binary data where in a big binary 
> file, each 3073 represents a 32x32x3 bytes image with 1 byte for the label 
> label. The file resides on my local filesystem.
> .take(5) returns 5 records all the same, .collect() returns 10,000 records 
> all the same.
> What is puzzling is that the pyspark one works perfectly even though 
> underneath it is calling the scala implementation.
> I have tested this on 2.1.0 and 2.0.0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19645) structured streaming job restart bug

2017-02-17 Thread guifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15871536#comment-15871536
 ] 

guifeng commented on SPARK-19645:
-

spark's default hadoop version is hadoop 2.2 that rename method don't exist 
overwrite options, so we need delete(if file exist) and then rename file.

> structured streaming job restart bug
> 
>
> Key: SPARK-19645
> URL: https://issues.apache.org/jira/browse/SPARK-19645
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: guifeng
>Priority: Critical
>
> We are trying to use Structured Streaming in product, however currently 
> there exists a bug refer to the process of streaming job restart. 
>   The following is  the concrete error message:  
> {quote}
>Caused by: java.lang.IllegalStateException: Error committing version 2 
> into HDFSStateStore[id = (op=0, part=136), dir = 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136]
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:173)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to rename 
> /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/temp--5345709896617324284 
> to /tmp/Pipeline_112346-continueagg-bxaxs/state/0/136/2.delta
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$commitUpdates(HDFSBackedStateStoreProvider.scala:259)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:156)
>   ... 14 more
> {quote}
>  The bug can be easily reproduce when restart previous streaming job, and 
> the main reason is that when restart streaming  job spark will recompute WAL 
> offsets and generate the same hdfs delta file(latest delta file generated 
> before restart and named "currentBatchId.delta") . In my opinion, this is a 
> bug. If you guy consider that  this is a bug also,  I can fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >