date:20161117

[jira] [Commented] (SPARK-12278) Move the shuffle related test case from Yarn module to Core module

2016-11-17 Thread Ferdinand Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675998#comment-15675998
 ] 

Ferdinand Xu commented on SPARK-12278:
--

Thanks [~srowen] for pointing this out. The main consideration here is that 
shuffle file encryption should support all modes not only limited to Yarn mode. 
We can lower the priority since now it's not supporting Yarn mode.  

> Move the shuffle related test case from Yarn module to Core module
> --
>
> Key: SPARK-12278
> URL: https://issues.apache.org/jira/browse/SPARK-12278
> Project: Spark
>  Issue Type: Test
>  Components: Shuffle
>Reporter: Ferdinand Xu
>
> The test in _YarnShuffleEncryptionSuite_ requires _YarnSparkHadoopUtil_ which 
> is in Yarn module. So we have to leave it in Yarn module instead of Core 
> module. After SPARK-11807 is resolved, we will move the test back to Core 
> module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-11-17 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675985#comment-15675985
 ] 

Jiang Xingbo commented on SPARK-17932:
--

I’m working on this, thanks!

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18500:


Assignee: Apache Spark

> Make GenericStrategy be able to prune plans by itself after placeholders are 
> replaced.
> --
>
> Key: SPARK-18500
> URL: https://issues.apache.org/jira/browse/SPARK-18500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> Add a functionality to {{GenericStrategy}} to be able to prune bad physical 
> plans by itself after their placeholders are replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675978#comment-15675978
 ] 

Apache Spark commented on SPARK-18500:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15927

> Make GenericStrategy be able to prune plans by itself after placeholders are 
> replaced.
> --
>
> Key: SPARK-18500
> URL: https://issues.apache.org/jira/browse/SPARK-18500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Add a functionality to {{GenericStrategy}} to be able to prune bad physical 
> plans by itself after their placeholders are replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18500:


Assignee: (was: Apache Spark)

> Make GenericStrategy be able to prune plans by itself after placeholders are 
> replaced.
> --
>
> Key: SPARK-18500
> URL: https://issues.apache.org/jira/browse/SPARK-18500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> Add a functionality to {{GenericStrategy}} to be able to prune bad physical 
> plans by itself after their placeholders are replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675966#comment-15675966
 ] 

Nathan Howell commented on SPARK-18352:
---

Sounds good to me. I have an implementation that's passing basic tests but 
needs to be cleaned up a bit. I'll get a pull request up in the next few days.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18500) Make GenericStrategy be able to prune plans by itself after placeholders are replaced.

2016-11-17 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-18500:
-

 Summary: Make GenericStrategy be able to prune plans by itself 
after placeholders are replaced.
 Key: SPARK-18500
 URL: https://issues.apache.org/jira/browse/SPARK-18500
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


Add a functionality to {{GenericStrategy}} to be able to prune bad physical 
plans by itself after their placeholders are replaced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add sample weights to Random Forest

2016-11-17 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675770#comment-15675770
 ] 

Seth Hendrickson commented on SPARK-9478:
-

I'm going to work on submitting a PR for adding sample weights for 2.2. That pr 
is for adding class weights, which I think we decided against.

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9478) Add sample weights to Random Forest

2016-11-17 Thread Seth Hendrickson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seth Hendrickson updated SPARK-9478:

Summary: Add sample weights to Random Forest  (was: Add class weights to 
Random Forest)

> Add sample weights to Random Forest
> ---
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18499) Add back support for custom Spark SQL dialects

2016-11-17 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675714#comment-15675714
 ] 

Andrew Ash commented on SPARK-18499:


Specifically what I'm most interested in is a strict ANSI SQL dialect, not 
bending Spark SQL to support a proprietary dialect.

> Add back support for custom Spark SQL dialects
> --
>
> Key: SPARK-18499
> URL: https://issues.apache.org/jira/browse/SPARK-18499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Ash
>
> Point 5 from the parent task:
> {quote}
> 5. I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18499) Add back support for custom Spark SQL dialects

2016-11-17 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-18499:
--

 Summary: Add back support for custom Spark SQL dialects
 Key: SPARK-18499
 URL: https://issues.apache.org/jira/browse/SPARK-18499
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Andrew Ash


Point 5 from the parent task:

{quote}
5. I want to be able to use my own customized SQL constructs. An example of 
this would supporting my own dialect, or be able to add constructs to the 
current SQL language. I should not have to implement a complete parse, and 
should be able to delegate to an underlying parser.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2016-11-17 Thread German Eduardo Melo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675699#comment-15675699
 ] 

German Eduardo Melo commented on SPARK-9478:


[~sethah] I was wondering if you are working on this request...the current PR 
for the improvement is https://github.com/apache/spark/pull/13851, right? Due 
my research I am looking forward this feature, thanks a lot in advance for any 
update!

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-11-17 Thread cen yuhai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675639#comment-15675639
 ] 

cen yuhai commented on SPARK-17450:
---

I will upgrade to 2.x, please close this issue

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Resolved] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server

2016-11-17 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18462.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

> SparkListenerDriverAccumUpdates event does not deserialize properly in 
> history server
> -
>
> Key: SPARK-18462
> URL: https://issues.apache.org/jira/browse/SPARK-18462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.3, 2.1.0
>
>
> The following test fails with a ClassCastException due to oddities in how 
> Jackson object mapping works, breaking the SQL tab in the history server:
> {code}
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui
>  import java.util.Properties
> +import org.json4s.jackson.JsonMethods._
>  import org.mockito.Mockito.mock
>  import org.apache.spark._
> @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, 
> QueryExecution, SparkPlanIn
>  import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
>  import org.apache.spark.sql.test.SharedSQLContext
>  import org.apache.spark.ui.SparkUI
> -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator}
> +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, 
> LongAccumulator}
>  class SQLListenerSuite extends SparkFunSuite with SharedSQLContext {
> @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with 
> SharedSQLContext {
>  assert(driverUpdates(physicalPlan.longMetric("dummy").id) == 
> expectedAccumValue)
>}
> +  test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") 
> {
> +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L)))
> +val actualJsonString = 
> compact(render(JsonProtocol.sparkEventToJson(event)))
> +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString))
> +newEvent match {
> +  case SparkListenerDriverAccumUpdates(executionId, accums) =>
> +assert(executionId == 1L)
> +accums.foreach { case (a, b) =>
> +  assert(a == 2L)
> +  assert(b == 3L)
> +}
> +}
> +  }
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675517#comment-15675517
 ] 

Reynold Xin commented on SPARK-18352:
-

Actually just talked to [~marmbrus] and now I understand more how JSON reader 
works.

I'd say we always turn the top level array into multiple records, and then have 
only one option: wholeFile. This same option can be used in json and text.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16803) SaveAsTable does not work when source DataFrame is built on a Hive Table

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675507#comment-15675507
 ] 

Apache Spark commented on SPARK-16803:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15926

> SaveAsTable does not work when source DataFrame is built on a Hive Table
> 
>
> Key: SPARK-16803
> URL: https://issues.apache.org/jira/browse/SPARK-16803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> {noformat}
> scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as 
> key, 'abc' as value")
> res2: org.apache.spark.sql.DataFrame = []
> scala> val df = sql("select key, value as value from sample.sample")
> df: org.apache.spark.sql.DataFrame = [key: int, value: string]
> scala> df.write.mode("append").saveAsTable("sample.sample")
> scala> sql("select * from sample.sample").show()
> +---+-+
> |key|value|
> +---+-+
> |  1|  abc|
> |  1|  abc|
> +---+-+
> {noformat}
> In Spark 1.6, it works, but Spark 2.0 does not work. The error message from 
> Spark 2.0 is
> {noformat}
> scala> df.write.mode("append").saveAsTable("sample.sample")
> org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation 
> sample, sample
>  is not supported.;
> {noformat}
> So far, we do not plan to support it in Spark 2.0. Spark 1.6 works because it 
> internally uses {{insertInto}}. But, if we change it back it will break the 
> semantic of {{saveAsTable}} (this method uses by-name resolution instead of 
> using by-position resolution used by {{insertInto}}).
> Instead, users should use {{insertInto}} API. We should correct the error 
> messages. Users can understand how to bypass it before we support it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18487) Add task completion listener to HashAggregate to avoid memory leak

2016-11-17 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-18487:

Summary: Add task completion listener to HashAggregate to avoid memory leak 
 (was: Consume all elements for Dataset.show/take to avoid memory leak)

> Add task completion listener to HashAggregate to avoid memory leak
> --
>
> Key: SPARK-18487
> URL: https://issues.apache.org/jira/browse/SPARK-18487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
> leverages SparkPlan.executeTake to efficiently collect required number of 
> elements back to the driver.
> However, under wholestage codege, we usually release resources after all 
> elements are consumed (e.g., HashAggregate). In this case, we will not 
> release the resources and cause memory leak with Dataset.show, for example.
> We can add task completion listener to HashAggregate to avoid the memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18487) Consume all elements for Dataset.show/take to avoid memory leak

2016-11-17 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-18487:

Description: 
The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
leverages SparkPlan.executeTake to efficiently collect required number of 
elements back to the driver.

However, under wholestage codege, we usually release resources after all 
elements are consumed (e.g., HashAggregate). In this case, we will not release 
the resources and cause memory leak with Dataset.show, for example.

We can add task completion listener to HashAggregate to avoid the memory leak.


  was:
The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
leverages SparkPlan.executeTake to efficiently collect required number of 
elements back to the driver.

However, under wholestage codege, we usually release resources after all 
elements are consumed (e.g., HashAggregate). In this case, we will not release 
the resources and cause memory leak with Dataset.show, for example.

We should consume all elements in the iterator to avoid memory leak.



> Consume all elements for Dataset.show/take to avoid memory leak
> ---
>
> Key: SPARK-18487
> URL: https://issues.apache.org/jira/browse/SPARK-18487
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> The methods such as Dataset.show and take use Limit (CollectLimitExec) which 
> leverages SparkPlan.executeTake to efficiently collect required number of 
> elements back to the driver.
> However, under wholestage codege, we usually release resources after all 
> elements are consumed (e.g., HashAggregate). In this case, we will not 
> release the resources and cause memory leak with Dataset.show, for example.
> We can add task completion listener to HashAggregate to avoid the memory leak.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18436) isin causing SQL syntax error with JDBC

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18436:


Assignee: Apache Spark

> isin causing SQL syntax error with JDBC
> ---
>
> Key: SPARK-18436
> URL: https://issues.apache.org/jira/browse/SPARK-18436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Linux, SQL Server 2012
>Reporter: Dan
>Assignee: Apache Spark
>  Labels: jdbc, sql
>
> When using a JDBC data source, the "isin" function generates invalid SQL 
> syntax when called with an empty array, which causes the JDBC driver to throw 
> an exception.
> If the array is not empty, it works fine.
> In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and 
> TABLE are all correctly defined.
> {noformat}
> scala> val filter = Array[String]()
> filter: Array[String] = Array()
> scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> 
> SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> 
> TABLE)).load().filter($"cl_ult".isin(filter:_*))
> sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields]
> scala> sortDF.show()
> 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205)
> com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18436) isin causing SQL syntax error with JDBC

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18436:


Assignee: (was: Apache Spark)

> isin causing SQL syntax error with JDBC
> ---
>
> Key: SPARK-18436
> URL: https://issues.apache.org/jira/browse/SPARK-18436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Linux, SQL Server 2012
>Reporter: Dan
>  Labels: jdbc, sql
>
> When using a JDBC data source, the "isin" function generates invalid SQL 
> syntax when called with an empty array, which causes the JDBC driver to throw 
> an exception.
> If the array is not empty, it works fine.
> In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and 
> TABLE are all correctly defined.
> {noformat}
> scala> val filter = Array[String]()
> filter: Array[String] = Array()
> scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> 
> SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> 
> TABLE)).load().filter($"cl_ult".isin(filter:_*))
> sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields]
> scala> sortDF.show()
> 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205)
> com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18436) isin causing SQL syntax error with JDBC

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675473#comment-15675473
 ] 

Apache Spark commented on SPARK-18436:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15925

> isin causing SQL syntax error with JDBC
> ---
>
> Key: SPARK-18436
> URL: https://issues.apache.org/jira/browse/SPARK-18436
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Linux, SQL Server 2012
>Reporter: Dan
>  Labels: jdbc, sql
>
> When using a JDBC data source, the "isin" function generates invalid SQL 
> syntax when called with an empty array, which causes the JDBC driver to throw 
> an exception.
> If the array is not empty, it works fine.
> In the below example you can assume that SOURCE_CONNECTION, SQL_DRIVER and 
> TABLE are all correctly defined.
> {noformat}
> scala> val filter = Array[String]()
> filter: Array[String] = Array()
> scala> val sortDF = spark.read.format("jdbc").options(Map("url" -> 
> SOURCE_CONNECTION, "driver" -> SQL_DRIVER, "dbtable" -> 
> TABLE)).load().filter($"cl_ult".isin(filter:_*))
> sortDF: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = 
> [ibi_bulk_id: bigint, ibi_row_id: int ... 174 more fields]
> scala> sortDF.show()
> 16/11/14 15:35:46 ERROR Executor: Exception in task 0.0 in stage 6.0 (TID 205)
> com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near ')'.
> at 
> com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:216)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1515)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:404)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:350)
> at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:5696)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1715)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:180)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:155)
> at 
> com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:285)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:408)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:379)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675466#comment-15675466
 ] 

Hyukjin Kwon commented on SPARK-18352:
--

Ah, you meant producing each row while parsing the whole text in iteration. I 
see.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-17 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675460#comment-15675460
 ] 

yuhao yang commented on SPARK-18356:


I assume the performance improvement depends on the computation costs of 
uncached RDD lineage. Do you plan to send a PR for the improvement?

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675446#comment-15675446
 ] 

Reynold Xin commented on SPARK-18352:
-

No that's not sufficient. It doesn't do streaming.


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675441#comment-15675441
 ] 

Hyukjin Kwon edited comment on SPARK-18352 at 11/18/16 1:53 AM:


Hi [~rxin], I think it seems this can be simply done after 
https://github.com/apache/spark/pull/14151 and 
https://github.com/apache/spark/pull/15813 are merged. I guess we could just 
add another option in `JSONOptions` which sets `wholetext` internally (and of 
course resembling https://github.com/apache/spark/pull/14151). Would this be 
what you think in your mind already? If so, I can work on this if anyone is not 
supposed to do this. (I am fine if anyone is assigned to this internally).


was (Author: hyukjin.kwon):
Hi [~rxin], I think it seems this can be simply done after 
https://github.com/apache/spark/pull/14151 and 
https://github.com/apache/spark/pull/15813 are merged. I guess we could just 
add another option in `JSONOptions` which sets `wholetext` internally. Would 
this be what you think in your mind already? If so, I can work on this if 
anyone is not supposed to do this. (I am fine if anyone is assigned to this 
internally).

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675441#comment-15675441
 ] 

Hyukjin Kwon commented on SPARK-18352:
--

Hi [~rxin], I think it seems this can be simply done after 
https://github.com/apache/spark/pull/14151 and 
https://github.com/apache/spark/pull/15813 are merged. I guess we could just 
add another option in `JSONOptions` which sets `wholetext` internally. Would 
this be what you think in your mind already? If so, I can work on this if 
anyone is not supposed to do this. (I am fine if anyone is assigned to this 
internally).

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675437#comment-15675437
 ] 

Reynold Xin commented on SPARK-18352:
-

I guess maybe it should be a user-configurable option? Otherwise Spark on its 
own don't have enough information to disambiguate.


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-11-17 Thread Narayanan Nachiappan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675434#comment-15675434
 ] 

Narayanan Nachiappan commented on SPARK-13767:
--

[~rahul.bhati...@gmail.com] Were you able to figure out the root cause for that 
issue ?

> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 325, in _create_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 432, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> I am getting the same error executing the command : conf = 
> SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675421#comment-15675421
 ] 

Nathan Howell commented on SPARK-18352:
---

Do you have any ideas how to support this? {{DataFrameReader.schema}} currently 
takes a {{StructType}} and the existing row level json reader flattens arrays 
out to support this restriction.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18498:


Assignee: (was: Apache Spark)

> Clean up HDFSMetadataLog API for better testing
> ---
>
> Key: SPARK-18498
> URL: https://issues.apache.org/jira/browse/SPARK-18498
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tyson Condie
>Priority: Minor
>  Labels: test
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> HDFSMetadataLog current conflates metadata log serialization and file writes. 
> The goal is to separate these two steps to enable more thorough testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675412#comment-15675412
 ] 

Apache Spark commented on SPARK-18498:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/15924

> Clean up HDFSMetadataLog API for better testing
> ---
>
> Key: SPARK-18498
> URL: https://issues.apache.org/jira/browse/SPARK-18498
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tyson Condie
>Priority: Minor
>  Labels: test
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> HDFSMetadataLog current conflates metadata log serialization and file writes. 
> The goal is to separate these two steps to enable more thorough testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18498:


Assignee: Apache Spark

> Clean up HDFSMetadataLog API for better testing
> ---
>
> Key: SPARK-18498
> URL: https://issues.apache.org/jira/browse/SPARK-18498
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.1
>Reporter: Tyson Condie
>Assignee: Apache Spark
>Priority: Minor
>  Labels: test
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> HDFSMetadataLog current conflates metadata log serialization and file writes. 
> The goal is to separate these two steps to enable more thorough testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675405#comment-15675405
 ] 

Reynold Xin commented on SPARK-18352:
-

Are these actually record delimiters? If the top level structure is an array, 
would we want to parse a single file as multiple records?


> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18360) default table path of tables in default database should depend on the location of default database

2016-11-17 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18360.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15812
[https://github.com/apache/spark/pull/15812]

> default table path of tables in default database should depend on the 
> location of default database
> --
>
> Key: SPARK-18360
> URL: https://issues.apache.org/jira/browse/SPARK-18360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes, releasenotes
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18360) default table path of tables in default database should depend on the location of default database

2016-11-17 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18360:
-
Labels: release_notes releasenotes  (was: )

> default table path of tables in default database should depend on the 
> location of default database
> --
>
> Key: SPARK-18360
> URL: https://issues.apache.org/jira/browse/SPARK-18360
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes, releasenotes
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18483) spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032

2016-11-17 Thread inred (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675388#comment-15675388
 ] 

inred commented on SPARK-18483:
---

it failed even when i set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop,

i submit it from windows development node to remote Linux yarn cluster

> spark on yarn always connect to  yarn resourcemanager at  0.0.0.0:8032
> --
>
> Key: SPARK-18483
> URL: https://issues.apache.org/jira/browse/SPARK-18483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: java8
> SBT0.13
> scala2.11.8
> spark-2.0.1-bin-hadoop2.6
>Reporter: inred
>
> I have installed the yarn resource manager at 192.168.13.159:8032 and have  
> set YARN_CONF_DIR environment var and have the yarn-site.xml
> configured as the following, but it always connects to 0.0.0.0:8032 instead 
> of 
> 192.168.13.159:8032
> set environment
> E:\app>set yarn
> YARN_CONF_DIR=D:\Documents\download\hadoop
> E:\app>set had
> HADOOP_HOME=D:\Documents\download\hadoop
> E:\app>cat D:\Documents\download\hadoop\yarn-site.xml
> 
> yarn.resourcemanager.address
> 192.168.13.159:8032
> 
> "C:\Program Files\Java\jdk1.8.0_92\bin\java" -Didea.launcher.port=7532 
> "-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA 
> Community Edition 2016.2.5\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\charsets.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\deploy.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\access-bridge-64.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\cldrdata.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\dnsns.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\jaccess.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\jfxrt.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\localedata.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\nashorn.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\sunec.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\sunjce_provider.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\sunmscapi.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\sunpkcs11.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\ext\zipfs.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\javaws.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\jce.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\jfr.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\jfxswt.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\jsse.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\management-agent.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\plugin.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\resources.jar;C:\Program 
> Files\Java\jdk1.8.0_92\jre\lib\rt.jar;E:\app\adam-ano\target\scala-2.11\classes;C:\Users\feiwu\.ivy2\cache\org.scala-lang\scala-library\jars\scala-library-2.11.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\xz-1.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jta-1.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jpam-1.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\guice-3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\ivy-2.4.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\lz4-1.3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\oro-2.0.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\ST4-4.0.4.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\avro-1.7.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\core-1.1.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\gson-2.2.4.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\mail-1.4.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\mx4j-3.0.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\snappy-0.2.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\antlr-2.7.7.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\opencsv-2.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\py4j-0.10.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\xmlenc-0.52.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\base64-2.3.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\guava-14.0.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\janino-2.7.8.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jets3t-0.9.3.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jetty-6.1.26.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jline-2.12.1.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\jsr305-1.3.9.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\log4j-1.2.17.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\minlog-1.3.0.jar;D:\Documents\download\spark-2.0.1-bin-hadoop2.6\jars\pyrolite-4.9.j

[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)

2016-11-17 Thread Nathan Howell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675386#comment-15675386
 ] 

Nathan Howell commented on SPARK-18352:
---

Any opinions on configuring this with an option instead of a creating a new 
data source? It looks fairly straightforward to support this as an option. E.g.:

{code}
// parse one json value per line
// this would be the default behavior, for backwards compatibility
spark.read.option("recordDelimiter", "line").json(???)

// parse one json value per file
spark.read.option("recordDelimiter", "file").json(???)
{code}

The refactoring work would be the same in either case, but it would require 
less plumbing for Python/Java/etc to enable this with an option.

As an aside... it also is straightforward to extend this to support {{Text}} 
and {{UTF8String}} values directly, avoiding a string conversion of the entire 
column prior to parsing.

> Parse normal, multi-line JSON files (not just JSON Lines)
> -
>
> Key: SPARK-18352
> URL: https://issues.apache.org/jira/browse/SPARK-18352
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> Spark currently can only parse JSON files that are JSON lines, i.e. each 
> record has an entire line and records are separated by new line. In reality, 
> a lot of users want to use Spark to parse actual JSON files, and are 
> surprised to learn that it doesn't do that.
> We can introduce a new mode (wholeJsonFile?) in which we don't split the 
> files, and rather stream through them to parse the JSON files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18498) Clean up HDFSMetadataLog API for better testing

2016-11-17 Thread Tyson Condie (JIRA)

Tyson Condie created SPARK-18498:


 Summary: Clean up HDFSMetadataLog API for better testing
 Key: SPARK-18498
 URL: https://issues.apache.org/jira/browse/SPARK-18498
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 2.0.1
Reporter: Tyson Condie
Priority: Minor


HDFSMetadataLog current conflates metadata log serialization and file writes. 
The goal is to separate these two steps to enable more thorough testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"

2016-11-17 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18497:
-
Target Version/s: 2.1.0

> ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"
> -
>
> Key: SPARK-18497
> URL: https://issues.apache.org/jira/browse/SPARK-18497
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaron Davidson
>
> I have a pretty standard stream. I call ".writeStream.foreach(...).start()" 
> and get
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> timestamp#39688: timestamp, interval 1 days
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751)
>   at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290)
>   at 
> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcZ$sp(StreamExecution.scala:227)
>

[jira] [Updated] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"

2016-11-17 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18497:
-
Priority: Critical  (was: Major)

> ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"
> -
>
> Key: SPARK-18497
> URL: https://issues.apache.org/jira/browse/SPARK-18497
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> I have a pretty standard stream. I call ".writeStream.foreach(...).start()" 
> and get
> {code}
> java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> timestamp#39688: timestamp, interval 1 days
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751)
>   at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290)
>   at 
> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.

[jira] [Created] (SPARK-18497) ForeachSink fails with "assertion failed: No plan for EventTimeWatermark"

2016-11-17 Thread Aaron Davidson (JIRA)

Aaron Davidson created SPARK-18497:
--

 Summary: ForeachSink fails with "assertion failed: No plan for 
EventTimeWatermark"
 Key: SPARK-18497
 URL: https://issues.apache.org/jira/browse/SPARK-18497
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Aaron Davidson


I have a pretty standard stream. I call ".writeStream.foreach(...).start()" and 
get

{code}
java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
timestamp#39688: timestamp, interval 1 days
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at 
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$toString$3.apply(QueryExecution.scala:232)
at 
org.apache.spark.sql.execution.QueryExecution.stringOrError(QueryExecution.scala:107)
at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:232)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:54)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2751)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2290)
at 
org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:70)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:449)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcZ$sp(StreamExecution.scala:227)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:215)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sq

[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Description: 
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current Nearest Neighbor implementation to make it 
align with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf

  was:
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current MultiProbe implementation to make it align 
with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf


> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) How we should change the current Nearest Neighbor implementation to make 
> it align with the MultiProbe NN Search from the origin paper: 
> http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Description: 
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) What are the issues and how we should change the current Nearest Neighbor 
implementation

  was:
We all agree to do the following improvement to Multi-Probe NN Search:
(1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
full sort on the whole dataset

Currently we are still discussing the following:
(1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
(2) How we should change the current Nearest Neighbor implementation to make it 
align with the MultiProbe NN Search from the origin paper: 
http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf


> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18454) Changes to fix Nearest Neighbor Search for LSH

2016-11-17 Thread Yun Ni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18454:
---
Summary: Changes to fix Nearest Neighbor Search for LSH  (was: Changes to 
fix Multi-Probe Nearest Neighbor Search for LSH)

> Changes to fix Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) How we should change the current MultiProbe implementation to make it 
> align with the MultiProbe NN Search from the origin paper: 
> http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs

2016-11-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675345#comment-15675345
 ] 

Joseph K. Bradley commented on SPARK-18321:
---

I just noticed there are also problems with having 2 copies of many methods in 
the Java doc, one static and one not.  This too is a long-standing problem.

> ML 2.1 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-18321
> URL: https://issues.apache.org/jira/browse/SPARK-18321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675306#comment-15675306
 ] 

Apache Spark commented on SPARK-4105:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15923

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

[jira] [Commented] (SPARK-18321) ML 2.1 QA: API: Java compatibility, docs

2016-11-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675287#comment-15675287
 ] 

Joseph K. Bradley commented on SPARK-18321:
---

I'm guessing it's because it's a private class within a file with a different 
name containing other classes.  I think it's OK to leave and hopefully fix in 
the unidoc gen later on.

> ML 2.1 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-18321
> URL: https://issues.apache.org/jira/browse/SPARK-18321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-11-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-4105:
-

Assignee: Davies Liu  (was: Josh Rosen)

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {co

[jira] [Commented] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15675234#comment-15675234
 ] 

Apache Spark commented on SPARK-18462:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15922

> SparkListenerDriverAccumUpdates event does not deserialize properly in 
> history server
> -
>
> Key: SPARK-18462
> URL: https://issues.apache.org/jira/browse/SPARK-18462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following test fails with a ClassCastException due to oddities in how 
> Jackson object mapping works, breaking the SQL tab in the history server:
> {code}
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui
>  import java.util.Properties
> +import org.json4s.jackson.JsonMethods._
>  import org.mockito.Mockito.mock
>  import org.apache.spark._
> @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, 
> QueryExecution, SparkPlanIn
>  import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
>  import org.apache.spark.sql.test.SharedSQLContext
>  import org.apache.spark.ui.SparkUI
> -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator}
> +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, 
> LongAccumulator}
>  class SQLListenerSuite extends SparkFunSuite with SharedSQLContext {
> @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with 
> SharedSQLContext {
>  assert(driverUpdates(physicalPlan.longMetric("dummy").id) == 
> expectedAccumValue)
>}
> +  test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") 
> {
> +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L)))
> +val actualJsonString = 
> compact(render(JsonProtocol.sparkEventToJson(event)))
> +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString))
> +newEvent match {
> +  case SparkListenerDriverAccumUpdates(executionId, accums) =>
> +assert(executionId == 1L)
> +accums.foreach { case (a, b) =>
> +  assert(a == 2L)
> +  assert(b == 3L)
> +}
> +}
> +  }
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18462:


Assignee: Josh Rosen  (was: Apache Spark)

> SparkListenerDriverAccumUpdates event does not deserialize properly in 
> history server
> -
>
> Key: SPARK-18462
> URL: https://issues.apache.org/jira/browse/SPARK-18462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following test fails with a ClassCastException due to oddities in how 
> Jackson object mapping works, breaking the SQL tab in the history server:
> {code}
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui
>  import java.util.Properties
> +import org.json4s.jackson.JsonMethods._
>  import org.mockito.Mockito.mock
>  import org.apache.spark._
> @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, 
> QueryExecution, SparkPlanIn
>  import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
>  import org.apache.spark.sql.test.SharedSQLContext
>  import org.apache.spark.ui.SparkUI
> -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator}
> +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, 
> LongAccumulator}
>  class SQLListenerSuite extends SparkFunSuite with SharedSQLContext {
> @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with 
> SharedSQLContext {
>  assert(driverUpdates(physicalPlan.longMetric("dummy").id) == 
> expectedAccumValue)
>}
> +  test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") 
> {
> +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L)))
> +val actualJsonString = 
> compact(render(JsonProtocol.sparkEventToJson(event)))
> +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString))
> +newEvent match {
> +  case SparkListenerDriverAccumUpdates(executionId, accums) =>
> +assert(executionId == 1L)
> +accums.foreach { case (a, b) =>
> +  assert(a == 2L)
> +  assert(b == 3L)
> +}
> +}
> +  }
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18462:


Assignee: Apache Spark  (was: Josh Rosen)

> SparkListenerDriverAccumUpdates event does not deserialize properly in 
> history server
> -
>
> Key: SPARK-18462
> URL: https://issues.apache.org/jira/browse/SPARK-18462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The following test fails with a ClassCastException due to oddities in how 
> Jackson object mapping works, breaking the SQL tab in the history server:
> {code}
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui
>  import java.util.Properties
> +import org.json4s.jackson.JsonMethods._
>  import org.mockito.Mockito.mock
>  import org.apache.spark._
> @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, 
> QueryExecution, SparkPlanIn
>  import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
>  import org.apache.spark.sql.test.SharedSQLContext
>  import org.apache.spark.ui.SparkUI
> -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator}
> +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, 
> LongAccumulator}
>  class SQLListenerSuite extends SparkFunSuite with SharedSQLContext {
> @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with 
> SharedSQLContext {
>  assert(driverUpdates(physicalPlan.longMetric("dummy").id) == 
> expectedAccumValue)
>}
> +  test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") 
> {
> +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L)))
> +val actualJsonString = 
> compact(render(JsonProtocol.sparkEventToJson(event)))
> +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString))
> +newEvent match {
> +  case SparkListenerDriverAccumUpdates(executionId, accums) =>
> +assert(executionId == 1L)
> +accums.foreach { case (a, b) =>
> +  assert(a == 2L)
> +  assert(b == 3L)
> +}
> +}
> +  }
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18462) SparkListenerDriverAccumUpdates event does not deserialize properly in history server

2016-11-17 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18462:
---
Target Version/s: 2.0.3, 2.1.0

> SparkListenerDriverAccumUpdates event does not deserialize properly in 
> history server
> -
>
> Key: SPARK-18462
> URL: https://issues.apache.org/jira/browse/SPARK-18462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The following test fails with a ClassCastException due to oddities in how 
> Jackson object mapping works, breaking the SQL tab in the history server:
> {code}
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala
> @@ -19,6 +19,7 @@ package org.apache.spark.sql.execution.ui
>  import java.util.Properties
> +import org.json4s.jackson.JsonMethods._
>  import org.mockito.Mockito.mock
>  import org.apache.spark._
> @@ -35,7 +36,7 @@ import org.apache.spark.sql.execution.{LeafExecNode, 
> QueryExecution, SparkPlanIn
>  import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
>  import org.apache.spark.sql.test.SharedSQLContext
>  import org.apache.spark.ui.SparkUI
> -import org.apache.spark.util.{AccumulatorMetadata, LongAccumulator}
> +import org.apache.spark.util.{AccumulatorMetadata, JsonProtocol, 
> LongAccumulator}
>  class SQLListenerSuite extends SparkFunSuite with SharedSQLContext {
> @@ -416,6 +417,20 @@ class SQLListenerSuite extends SparkFunSuite with 
> SharedSQLContext {
>  assert(driverUpdates(physicalPlan.longMetric("dummy").id) == 
> expectedAccumValue)
>}
> +  test("roundtripping SparkListenerDriverAccumUpdates through JsonProtocol") 
> {
> +val event = SparkListenerDriverAccumUpdates(1L, Seq((2L, 3L)))
> +val actualJsonString = 
> compact(render(JsonProtocol.sparkEventToJson(event)))
> +val newEvent = JsonProtocol.sparkEventFromJson(parse(actualJsonString))
> +newEvent match {
> +  case SparkListenerDriverAccumUpdates(executionId, accums) =>
> +assert(executionId == 1L)
> +accums.foreach { case (a, b) =>
> +  assert(a == 2L)
> +  assert(b == 3L)
> +}
> +}
> +  }
> +
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18496) java.lang.AssertionError: assertion failed

2016-11-17 Thread Harish (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harish updated SPARK-18496:
---
Affects Version/s: 2.0.2

> java.lang.AssertionError: assertion failed
> --
>
> Key: SPARK-18496
> URL: https://issues.apache.org/jira/browse/SPARK-18496
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.2
> Environment: 2.0.2 snapshot
>Reporter: Harish
>
> I am getting this error when i store the estimates from Julia output to a DF 
> and then i do df.cache()
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 177.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 177.0 (TID 9722, 10.63.136.108): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:156)
>   at 
> org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
>   at 
> org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
>   at 
> org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
>   at 
> org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
>   at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
>   at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>   at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(Gatewa

[jira] [Created] (SPARK-18496) java.lang.AssertionError: assertion failed

2016-11-17 Thread Harish (JIRA)

Harish created SPARK-18496:
--

 Summary: java.lang.AssertionError: assertion failed
 Key: SPARK-18496
 URL: https://issues.apache.org/jira/browse/SPARK-18496
 Project: Spark
  Issue Type: Bug
 Environment: 2.0.2 snapshot
Reporter: Harish


I am getting this error when i store the estimates from Julia output to a DF 
and then i do df.cache()

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 177.0 failed 4 times, most recent failure: Lost task 0.3 in stage 177.0 
(TID 9722, 10.63.136.108): java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at 
org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361)
at 
org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356)
at 
org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:441)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84)
at

[jira] [Commented] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization

2016-11-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674980#comment-15674980
 ] 

Nicholas Chammas commented on SPARK-18495:
--

cc [~andrewor14]

> Web UI should document meaning of green dot in DAG visualization
> 
>
> Key: SPARK-18495
> URL: https://issues.apache.org/jira/browse/SPARK-18495
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> A green dot in the DAG visualization apparently means that the referenced RDD 
> is cached. This is not documented anywhere except in [this blog 
> post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html].
> It would be good if the Web UI itself documented this somehow (perhaps in the 
> tooltip?) so that the user can naturally learn what it means while using the 
> Web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18495) Web UI should document meaning of green dot in DAG visualization

2016-11-17 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-18495:


 Summary: Web UI should document meaning of green dot in DAG 
visualization
 Key: SPARK-18495
 URL: https://issues.apache.org/jira/browse/SPARK-18495
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.0.2
Reporter: Nicholas Chammas
Priority: Trivial


A green dot in the DAG visualization apparently means that the referenced RDD 
is cached. This is not documented anywhere except in [this blog 
post|https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html].

It would be good if the Web UI itself documented this somehow (perhaps in the 
tooltip?) so that the user can naturally learn what it means while using the 
Web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18493) Add withWatermark and checkpoint to python dataframe

2016-11-17 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-18493:

Component/s: PySpark

> Add withWatermark and checkpoint to python dataframe
> 
>
> Key: SPARK-18493
> URL: https://issues.apache.org/jira/browse/SPARK-18493
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Burak Yavuz
>
> These two methods were added to Scala Datasets, but are not available in 
> Python yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674801#comment-15674801
 ] 

Reynold Xin commented on SPARK-18252:
-

Those two methods are pretty inefficient.

When we use this in SQL for joins, we'd want a hyper efficient probing without 
caring too much about the false positive rate (i.e. size will be small), and 
being able to have all the internal details exposed in a simple way would be 
critical there.

It might be a good idea to create a version of the bloom filter that can be 
compressed, but I definitely don't want that to be the only implementation.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674821#comment-15674821
 ] 

Aleksey Ponkin commented on SPARK-18252:


I do not have any benchmarks, but I believe that iteration over true positions 
in RB must be pretty fast, since this is sort of Run Length Encoding, where 
only random access operations like get and set are pretty painful. We are 
definitely need some benchmarks for this.

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674818#comment-15674818
 ] 

Reynold Xin commented on SPARK-18252:
-

Regarding this - can you find some performance data on how the build time 
compares between roaring one and the normal bitset?

I'm also thinking maybe we can have the default build returning a roaring one, 
and then have a way to return a simple bloom filter based on normal bitsets and 
SQL can use that one.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18493) Add withWatermark and checkpoint to python dataframe

2016-11-17 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-18493:
---

 Summary: Add withWatermark and checkpoint to python dataframe
 Key: SPARK-18493
 URL: https://issues.apache.org/jira/browse/SPARK-18493
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Burak Yavuz


These two methods were added to Scala Datasets, but are not available in Python 
yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18493) Add withWatermark and checkpoint to python dataframe

2016-11-17 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674806#comment-15674806
 ] 

Apache Spark commented on SPARK-18493:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15921

> Add withWatermark and checkpoint to python dataframe
> 
>
> Key: SPARK-18493
> URL: https://issues.apache.org/jira/browse/SPARK-18493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Burak Yavuz
>
> These two methods were added to Scala Datasets, but are not available in 
> Python yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18493) Add withWatermark and checkpoint to python dataframe

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18493:


Assignee: Apache Spark

> Add withWatermark and checkpoint to python dataframe
> 
>
> Key: SPARK-18493
> URL: https://issues.apache.org/jira/browse/SPARK-18493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> These two methods were added to Scala Datasets, but are not available in 
> Python yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18493) Add withWatermark and checkpoint to python dataframe

2016-11-17 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18493:


Assignee: (was: Apache Spark)

> Add withWatermark and checkpoint to python dataframe
> 
>
> Key: SPARK-18493
> URL: https://issues.apache.org/jira/browse/SPARK-18493
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Burak Yavuz
>
> These two methods were added to Scala Datasets, but are not available in 
> Python yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674791#comment-15674791
 ] 

Aleksey Ponkin commented on SPARK-18252:


well, I do not know anything about vectorized probing, but you can freely 
iterate over all true positions in RoaringBitmap with this method 
[http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#iterator--],
 you can get int[] of true positions in RB with this method 
[http://static.javadoc.io/org.roaringbitmap/RoaringBitmap/0.6.27/org/roaringbitmap/RoaringBitmap.html#toArray--].
 Or I missed something?

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674767#comment-15674767
 ] 

Reynold Xin commented on SPARK-18252:
-

For 3, the sketch package has no external dependency, and was created 
explicitly this way so bloom filter built in Spark can be used in other 
applications without having to worry about dependency conflicts.

For 4, it is much easier to just create a vectorized version of the probing 
code when all we are dealing with is a simple for loop.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Ofir Manor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674628#comment-15674628
 ] 

Ofir Manor commented on SPARK-18475:


I was just wondering if it actually works, but it seems you found a way to hack 
it (I thought you would need different consumer group per worker to avoid 
coordination by the broker, but it didn't seem like it). 
If it does provide a big perf boost in some cases, and it is not enabled by 
default, I personally don't have any objections.
[~c...@koeninger.org] - I didn't understand your objection. An RDD / dataset 
does not have any inherent order guarantees (same as a SQL result set), and the 
Kafka metadata per message (including topic, partition, offset) is exposed if 
someone really cares. If you guys have a smart hack that allows you to divide a 
specific partition into ranges and have different workers read different range 
of a partition in parallel, and if it does provide a significant perf boost, 
why not have it as an option? 
I don't think it should break correctness, as the boundries are decided anyway 
by the driver. 

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-11-17 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674599#comment-15674599
 ] 

Kazuaki Ishizaki edited comment on SPARK-18492 at 11/17/16 7:31 PM:


I agree with your point. Can you post a small program that can reproduce this 
issue?


was (Author: kiszk):
Can you post a small program that can reproduce this issue?

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData projec

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-11-17 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674599#comment-15674599
 ] 

Kazuaki Ishizaki commented on SPARK-18492:
--

Can you post a small program that can reproduce this issue?

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 12270 */ project_value252 = project_result249;
> /* 12271 */   }
> /* 12272 */

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674579#comment-15674579
 ] 

Sean Owen commented on SPARK-9487:
--

Agree, it seems like it should not be sensitive to ordering within each batch. 
This could convert the List> to Set> so that they are 
compared without regard to order. If that's the nature of the difference, then 
yes it's the test that should be fixed as part of this change.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558
 ] 

Aleksey Ponkin edited comment on SPARK-18252 at 11/17/16 7:23 PM:
--

Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.Current implementation is not very 
useful, because you can not serialize bloom filter (with 10 millions items for 
example) - kryo has a limit of 2G object size.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.



was (Author: ponkin):
Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.Current implementation is very useful 
because you can not serialize bloom filter (with 10 millions items for example) 
- kryo has a limit of 2G object size.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558
 ] 

Aleksey Ponkin edited comment on SPARK-18252 at 11/17/16 7:22 PM:
--

Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.Current implementation is very useful 
because you can not serialize bloom filter (with 10 millions items for example) 
- kryo has a limit of 2G object size.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.



was (Author: ponkin):
Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674558#comment-15674558
 ] 

Aleksey Ponkin commented on SPARK-18252:


Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Ponkin updated SPARK-18252:
---
Comment: was deleted

(was: Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.
)

> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Aleksey Ponkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674559#comment-15674559
 ] 

Aleksey Ponkin commented on SPARK-18252:


Hi, good points.
1. Problem with current implementation that bloom filter acquire memory for 
full bloom filter even if there are no items inside. RoaringBitmap is not 
compressing a bit vector, it is compact representation of BitSet. Also 
compressing random bit vector with usual compressions will not gain any effect, 
since it is random and half filled. To compress bloom filter we need to create 
"sparse" bitset 
([http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf]) and than 
apply something like arithmetic encoding.
2. This is true, we need to increment version inside bloom filter
3. RoaringBitmap is already used in 
Spark([https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L135]).
 What is the problem with dependencies? 
4. Please can you tell me more about vectorized probing and why RoaringBitmap 
is not suitable for this?
Thanks in advance.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-17 Thread Saikat Kanjilal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674543#comment-15674543
 ] 

Saikat Kanjilal commented on SPARK-9487:


Sean,
I took a look at the code and here it is:

List> inputData = Arrays.asList(
  Arrays.asList("hello", "world"),
  Arrays.asList("hello", "moon"),
  Arrays.asList("hello"));

List>> expected = Arrays.asList(
Arrays.asList(
new Tuple2<>("hello", 1L),
new Tuple2<>("world", 1L)),
Arrays.asList(
new Tuple2<>("hello", 1L),
new Tuple2<>("moon", 1L)),
Arrays.asList(
new Tuple2<>("hello", 1L)));

JavaDStream stream = JavaTestUtils.attachTestInputStream(ssc, 
inputData, 1);
JavaPairDStream counted = stream.countByValue();
JavaTestUtils.attachTestOutputStream(counted);
List>> result = JavaTestUtils.runStreams(ssc, 3, 
3);

Assert.assertEquals(expected, result);


As you can see the expected is assuming that the contents of the stream get 
counted accurately for every word, the output that gets generated through the 
flakiness just has hello,1 moon,1 reversed which I dont think matters, unless 
the goal of the test ist o identify words in order of how they enter the stream 
the expected and the actual answer are correct.  Therefore net net the test is 
flaky, should I refactor the test to actually look at the word count and not 
the order, thoughts on next steps?


> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674535#comment-15674535
 ] 

Burak Yavuz commented on SPARK-18475:
-

[~c...@koeninger.org] I don't see where you may need strict ordering in SQL 
land. The way I see it, I should be able to tell Kafka: Give me point A to B 
from Stream X. I thought this was the general use case of Kafka. Maybe you may 
want to use it as a FIFO buffer, maybe you want to use it as centralized 
storage.

There are other systems out there that does this kind of processing (e.g. secor 
-> https://github.com/pinterest/secor/) therefore I don't think that it's 
pretty appropriate for the general use case.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18252) Improve serialized BloomFilter size

2016-11-17 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674532#comment-15674532
 ] 

Reynold Xin commented on SPARK-18252:
-

I'm not sure if it is worth fixing this:

1. We already compress data before we send them across the network.
2. This is also not a backward compatible change would require different 
versioning.
3. This brings in extra dependency for a package that has 0 external dependency.
4. We very likely will implement vectorized probing for bloom filter to be used 
in Spark SQL joins, and using roaring bitmap would make that a lot harder to do.


> Improve serialized BloomFilter size
> ---
>
> Key: SPARK-18252
> URL: https://issues.apache.org/jira/browse/SPARK-18252
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aleksey Ponkin
>Priority: Minor
>
> Since version 2.0 Spark has BloomFilter implementation - 
> org.apache.spark.util.sketch.BloomFilterImpl. I have noticed that current 
> implementation is using custom class org.apache.spark.util.sketch.BitArray 
> for bit vector, which is allocating memory for the whole filter no matter how 
> many elements are set. Since BloomFilter can be serialized and sent over 
> network in distribution stage may be we need to use some kind of compressed 
> bloom filters? For example 
> [https://github.com/RoaringBitmap/RoaringBitmap][RoaringBitmap] or 
> [javaewah][https://github.com/lemire/javaewah]. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2016-11-17 Thread Norris Merritt (JIRA)

Norris Merritt created SPARK-18492:
--

 Summary: GeneratedIterator grows beyond 64 KB
 Key: SPARK-18492
 URL: https://issues.apache.org/jira/browse/SPARK-18492
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: CentOS release 6.7 (Final)
Reporter: Norris Merritt


spark-submit fails with ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"(I[Lscala/collection/Iterator;)V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
grows beyond 64 KB

Error message is followed by a huge dump of generated source code.

The generated code declares 1,454 field sequences like the following:

/* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
project_scalaUDF1;
/* 037 */   private scala.Function1 project_catalystConverter1;
/* 038 */   private scala.Function1 project_converter1;
/* 039 */   private scala.Function1 project_converter2;
/* 040 */   private scala.Function2 project_udf1;
  (many omitted lines) ...
/* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
project_scalaUDF1454;
/* 6090 */   private scala.Function1 project_catalystConverter1454;
/* 6091 */   private scala.Function1 project_converter1695;
/* 6092 */   private scala.Function1 project_udf1454;

It then proceeds to emit code for several methods (init, processNext) each of 
which has totally repetitive sequences of statements pertaining to each of the 
sequences of variables declared in the class.  For example:

/* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {

The reason that the 64KB JVM limit for code for a method is exceeded is because 
the code generator is using an incredibly naive strategy.  It emits a sequence 
like the one shown below for each of the 1,454 groups of variables shown above, 
in 

/* 6132 */ this.project_udf = 
(scala.Function1)project_scalaUDF.userDefinedFunc();
/* 6133 */ this.project_scalaUDF1 = 
(org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
/* 6134 */ this.project_catalystConverter1 = 
(scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
/* 6135 */ this.project_converter1 = 
(scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
/* 6136 */ this.project_converter2 = 
(scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());

It blows up after emitting 230 such sequences, while trying to emit the 231st:

/* 7282 */ this.project_udf230 = 
(scala.Function2)project_scalaUDF230.userDefinedFunc();
/* 7283 */ this.project_scalaUDF231 = 
(org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
/* 7284 */ this.project_catalystConverter231 = 
(scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());

  many omitted lines ...

 Example of repetitive code sequences emitted for processNext method:

/* 12253 */   boolean project_isNull247 = project_result244 == null;
/* 12254 */   MapData project_value247 = null;
/* 12255 */   if (!project_isNull247) {
/* 12256 */ project_value247 = project_result244;
/* 12257 */   }
/* 12258 */   Object project_arg = sort_isNull5 ? null : 
project_converter489.apply(sort_value5);
/* 12259 */
/* 12260 */   ArrayData project_result249 = null;
/* 12261 */   try {
/* 12262 */ project_result249 = 
(ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
/* 12263 */   } catch (Exception e) {
/* 12264 */ throw new 
org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
/* 12265 */   }
/* 12266 */
/* 12267 */   boolean project_isNull252 = project_result249 == null;
/* 12268 */   ArrayData project_value252 = null;
/* 12269 */   if (!project_isNull252) {
/* 12270 */ project_value252 = project_result249;
/* 12271 */   }
/* 12272 */   Object project_arg1 = project_isNull252 ? null : 
project_converter488.apply(project_value252);
/* 12273 */
/* 12274 */   ArrayData project_result248 = null;
/* 12275 */   try {
/* 12276 */ project_result248 = 
(ArrayData)project_catalystConverter247.apply(project_udf247.apply(project_arg1));
/* 12277 */   } catch (Exception e) {
/* 12278 */ throw new 
org.apache.sp

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674459#comment-15674459
 ] 

Cody Koeninger commented on SPARK-18475:


This has come up several times, and my answer is consistently the same - as 
Ofir said, the Kafka model is parallelism bounded by number of partitions.  
Breaking that model breaks user expectations, e.g. concerning ordering.  It's 
fine for you if this helps your specific use case, but I think it is not 
appropriate for general use.  I'd recommend people fix their skew and/or 
repartition at the producer level.  

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18317:
--
Attachment: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html

I checked the result from japi-compliance-checker. All binary incompatible 
changes reported are either private or package private. So we are good to go.

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
> Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-17 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18317.
---
Resolution: Done

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
> Attachments: spark-graphx_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib-local_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html, 
> spark-mllib_2.11-2.0.2_to_2.11-2.1.0-SNAPSHOT.html
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2016-11-17 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674308#comment-15674308
 ] 

holdenk commented on SPARK-2620:


I don't think its been resolved, does your code need to be in the repl or can 
you compile it? Do you have a small repro case so we can see if it might be the 
same issue?

If your working in Jupyter its possible that the issue might be fixed by using 
the new scala kernel which has a different REPL backend but I haven't had a 
chance to investigate it and see if the same issue is present there.

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0, 1.3.0, 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674253#comment-15674253
 ] 

Sean Owen commented on SPARK-17436:
---

Not sure, I have no failures on OS X, on Ubuntu, and all of the Jenkins tests 
pass. I don't see reports of problems building.

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674254#comment-15674254
 ] 

Burak Yavuz commented on SPARK-18475:
-

[~ofirm] Thanks for your comment. I've seen significant performance 
improvements, and here's my explanation on how it happened, and why a simple 
repartition won't help:

You're correct on the fact that for a given group.id, multiple consumers can't 
consume from the same TopicPartition concurrently. Where you get the benefit is 
that for skewed partitions, you don't wait for one Spark task to try and read 
everything from Kafka while all cores wait idle. You achieve better utilization 
because as tasks read less data, they can move on to the second step of the 
computation quicker, and while the first CPU has moved on to the second step 
(writing out to some storage system), your second CPU can start reading from 
Kafka. It kind of helps you pipeline your operations. If you use repartition, 
you're still going to have all your cores wait while that one consumer tries to 
read everything, and then you're going to cause a shuffle on top of it which is 
even worse.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18490:
--
Assignee: Song Jun

> duplicate nodename extrainfo of ShuffleExchange
> ---
>
> Key: SPARK-18490
> URL: https://issues.apache.org/jira/browse/SPARK-18490
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Song Jun
>Assignee: Song Jun
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {noformat}
>   override def nodeName: String = {
> val extraInfo = coordinator match {
>   case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case None => ""
> }
> {noformat}
>  [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18490.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15920
[https://github.com/apache/spark/pull/15920]

> duplicate nodename extrainfo of ShuffleExchange
> ---
>
> Key: SPARK-18490
> URL: https://issues.apache.org/jira/browse/SPARK-18490
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Song Jun
>Priority: Trivial
> Fix For: 2.1.0
>
>
> {noformat}
>   override def nodeName: String = {
> val extraInfo = coordinator match {
>   case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case None => ""
> }
> {noformat}
>  [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-17 Thread zakaria hili (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674161#comment-15674161
 ] 

zakaria hili commented on SPARK-18356:
--

Hi [~yuhaoyan],
I tried to improve the Kmeans using the same concept of caching in Logistic 
Regression.
https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L310
and my result of performances:
I used only one VM (Local Mode) with python
-> Spark without improvement: the training takes ~0,605s (as a mean value)
-> Spark with Kmeans improved: ~0,518s (the warning disappeared)
so we can say that we did not gain a lot, but maybe we will see the difference 
if we run the train method many times.

what do you think ?



> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2016-11-17 Thread Amit Sela (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674147#comment-15674147
 ] 

Amit Sela commented on SPARK-10816:
---

[~rxin] it might be worth taking into account a more generic implementation of 
"Merging Windows".
Sessions are merging windows that use "gap duration" to determine when to close 
the session, such that if we say that the first element arrives in window 
{{[event_time1, event_time1 + gap_duration)}} and the next one at 
{{[event_time2, event_time2 + gap_duration)}} and {{event_time1 < 
event_time2}}, their combined value will belong to {{[event_time1, event_time2 
+ gap_duration)}}, right ?
But the same "merge" of windows could very well be determined by a "close 
session" element (using Kafka for example would guarantee order of messages), 
or any user defined logic for that matter, as long as the "merge function" is 
provided by the user.
Of course providing Sessions API out-of-the-box would prove most useful as it 
is the most common, but I don't see any downside to also have a more "advanced" 
API here.
Thanks!

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()

2016-11-17 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674129#comment-15674129
 ] 

Barry Becker commented on SPARK-12965:
--

This is a big issue for us because we don't control the names of the columns 
that we get. One ugly workaround might be to to convert . to _ in the columns, 
but then you need to worry about conflicting with other columns that differ 
only by their use of . or _. The backquoting works in many places, but there 
are still many places, like this, where we have seen that it does not work.  

> Indexer setInputCol() doesn't resolve column names like DataFrame.col()
> ---
>
> Key: SPARK-12965
> URL: https://issues.apache.org/jira/browse/SPARK-12965
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Spark Core
>Affects Versions: 1.6.0
>Reporter: Joshua Taylor
> Attachments: SparkMLDotColumn.java
>
>
> The setInputCol() method doesn't seem to resolve column names in the same way 
> that other methods do.  E.g., Given a DataFrame df, {{df.col("`a.b`")}} will 
> return a column.  On a StringIndexer indexer, 
> {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting 
> and transforming seem to have no effect.  Running the following code produces:
> {noformat}
> +---+---++
> |a.b|a_b|a_bIndex|
> +---+---++
> |foo|foo| 0.0|
> |bar|bar| 1.0|
> +---+---++
> {noformat}
> but I think it should have another column, {{abIndex}} with the same contents 
> as a_bIndex.
> {code}
> public class SparkMLDotColumn {
>   public static void main(String[] args) {
>   // Get the contexts
>   SparkConf conf = new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("test")
>   .set("spark.ui.enabled", "false");
>   JavaSparkContext sparkContext = new JavaSparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sparkContext);
>   
>   // Create a schema with a single string column named "a.b"
>   StructType schema = new StructType(new StructField[] {
>   DataTypes.createStructField("a.b", 
> DataTypes.StringType, false)
>   });
>   // Create an empty RDD and DataFrame
>   List rows = Arrays.asList(RowFactory.create("foo"), 
> RowFactory.create("bar")); 
>   JavaRDD rdd = sparkContext.parallelize(rows);
>   DataFrame df = sqlContext.createDataFrame(rdd, schema);
>   
>   df = df.withColumn("a_b", df.col("`a.b`"));
>   
>   StringIndexer indexer0 = new StringIndexer();
>   indexer0.setInputCol("a_b");
>   indexer0.setOutputCol("a_bIndex");
>   df = indexer0.fit(df).transform(df);
>   
>   StringIndexer indexer1 = new StringIndexer();
>   indexer1.setInputCol("`a.b`");
>   indexer1.setOutputCol("abIndex");
>   df = indexer1.fit(df).transform(df);
>   
>   df.show();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-17 Thread Ofir Manor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674092#comment-15674092
 ] 

Ofir Manor commented on SPARK-18475:


Are you sure this is working? Having a visible perf effect?
As far as I know, the maximum parallelism of Kafka is the number of 
topic-partitions, by design. If your consumer group has more consumers than 
that, some of them will be just idle. This is because when reading, each 
partition is owned by a single consumer (that allocation of partitions to 
consumer is dynamic, as consumers joins and leaves). 
To quote an older source:
??The first thing to understand is that a topic partition is the unit of 
parallelism in Kafka. On both the producer and the broker side, writes to 
different partitions can be done fully in parallel. So expensive operations 
such as compression can utilize more hardware resources. On the consumer side, 
Kafka always gives a single partition’s data to one consumer thread. Thus, the 
degree of parallelism in the consumer (within a consumer group) is bounded by 
the number of partitions being consumed. Therefore, in general, the more 
partitions there are in a Kafka cluster, the higher the throughput one can 
achieve.??
https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

You could repartition the data in Spark after reading, to increase parallelism 
of Spark's processing.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032
 ] 

Ran Haim commented on SPARK-17436:
--

I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test

This always fails for mecan you point me to someone who can help me?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-17 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674032#comment-15674032
 ] 

Ran Haim edited comment on SPARK-17436 at 11/17/16 3:48 PM:


I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 clean install"

This always fails for mecan you point me to someone who can help me?


was (Author: ran.h...@optimalplus.com):
I have basiaclly cloned the repository from https://github.com/apache/spark and 
ran "build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 test

This always fails for mecan you point me to someone who can help me?

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-17 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673931#comment-15673931
 ] 

Emlyn Corrin commented on SPARK-18172:
--

I'm not sure I've got the time to build from source at the moment to verify 
this, I think if it now works for you and [~hvanhovell] it's most likely fixed 
now. If it reoccurs for me with 2.0.3 or 2.1.0 once they're released I'll 
reopen this.

Thanks.

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
> Fix For: 2.0.3, 2.1.0
>
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggr

[jira] [Commented] (SPARK-18490) duplicate nodename extrainfo of ShuffleExchange

2016-11-17 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673918#comment-15673918
 ] 

Song Jun commented on SPARK-18490:
--

it is not a bug, just simplify this code 

> duplicate nodename extrainfo of ShuffleExchange
> ---
>
> Key: SPARK-18490
> URL: https://issues.apache.org/jira/browse/SPARK-18490
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Song Jun
>Priority: Trivial
>
> {noformat}
>   override def nodeName: String = {
> val extraInfo = coordinator match {
>   case Some(exchangeCoordinator) if exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case Some(exchangeCoordinator) if !exchangeCoordinator.isEstimated =>
> s"(coordinator id: ${System.identityHashCode(coordinator)})"
>   case None => ""
> }
> {noformat}
>  [if exchangeCoordinator.isEstimated ] true or false with the same result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2016-11-17 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673904#comment-15673904
 ] 

Herman van Hovell commented on SPARK-18004:
---

which format should be passed to oracle?

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package

2016-11-17 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18444:

Target Version/s: 2.1.0

> SparkR running in yarn-cluster mode should not download Spark package
> -
>
> Key: SPARK-18444
> URL: https://issues.apache.org/jira/browse/SPARK-18444
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
>
> When running SparkR job in yarn-cluster mode, it will download Spark package 
> from apache website which is not necessary.
> {code}
> ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
> {code}
> The following is output:
> {code}
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> ..
> {code} 
> There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a 
> remote host of the yarn cluster rather than in the client host. The JVM comes 
> up first and the R process then connects to it. So in such cases we should 
> never have to download Spark as Spark is already running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package

2016-11-17 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18444:

Priority: Critical  (was: Major)

> SparkR running in yarn-cluster mode should not download Spark package
> -
>
> Key: SPARK-18444
> URL: https://issues.apache.org/jira/browse/SPARK-18444
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Critical
>
> When running SparkR job in yarn-cluster mode, it will download Spark package 
> from apache website which is not necessary.
> {code}
> ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
> {code}
> The following is output:
> {code}
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> ..
> {code} 
> There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a 
> remote host of the yarn cluster rather than in the client host. The JVM comes 
> up first and the R process then connects to it. So in such cases we should 
> never have to download Spark as Spark is already running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18444) SparkR running in yarn-cluster mode should not download Spark package

2016-11-17 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-18444:
---

Assignee: Yanbo Liang

> SparkR running in yarn-cluster mode should not download Spark package
> -
>
> Key: SPARK-18444
> URL: https://issues.apache.org/jira/browse/SPARK-18444
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When running SparkR job in yarn-cluster mode, it will download Spark package 
> from apache website which is not necessary.
> {code}
> ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
> {code}
> The following is output:
> {code}
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> Spark not found in SPARK_HOME:
> Spark not found in the cache directory. Installation will start.
> MirrorUrl not provided.
> Looking for preferred site from apache website...
> ..
> {code} 
> There's no {{SPARK_HOME}} in yarn-cluster mode since the R process is in a 
> remote host of the yarn cluster rather than in the client host. The JVM comes 
> up first and the R process then connects to it. So in such cases we should 
> never have to download Spark as Spark is already running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2016-11-17 Thread Suhas Nalapure (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673885#comment-15673885
 ] 

Suhas Nalapure commented on SPARK-18004:


Date format as per the physical plan logged by Spark Dataframe:   
PushedFilters: [LessThan(TS,2016-11-17 19:42:01.057)]

Confirmed the same from Oracle query logs as well: WHERE TS < '2016-11-17 
19:42:01.057'

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spar

1 2 >

1 - 100 of 188 matches

Mail list logo