[jira] [Commented] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567714#comment-15567714
 ] 

Apache Spark commented on SPARK-14804:
--

User 'apivovarov' has created a pull request for this issue:
https://github.com/apache/spark/pull/15447

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17880.
-
   Resolution: Fixed
 Assignee: Kousuke Saruta
Fix Version/s: 2.1.0
   2.0.2

> The url linking to `AccumulatorV2` in the document is incorrect.
> 
>
> Key: SPARK-17880
> URL: https://issues.apache.org/jira/browse/SPARK-17880
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> In `programming-guide.md`, the url which links to `AccumulatorV2` says 
> `api/scala/index.html#org.apache.spark.AccumulatorV2` but 
> `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17846) A bad state of Running Applications with spark standalone HA

2016-10-11 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567662#comment-15567662
 ] 

Saisai Shao commented on SPARK-17846:
-

I think this issue should be the same as SPARK-14262.

> A bad state of Running Applications with spark standalone HA 
> -
>
> Key: SPARK-17846
> URL: https://issues.apache.org/jira/browse/SPARK-17846
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: dylanzhou
>Priority: Critical
> Attachments: Problem screenshots.jpg
>
>
> i am use standalone mode,when i use HA from two ways,i found the applications 
> states was "WAITING",Is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567457#comment-15567457
 ] 

Cody Koeninger commented on SPARK-17344:


Given the choice between rewriting underlying kafka consumers and having a 
split codebase, I'd rather have a split codebase.  Of course I'd rather not 
sink development effort into an old version of kafka at all, until the 
structured stream for 0.10 is working for my use cases.

But If you want to wrap the 0.8 rdd in a structured stream, go for it, I'll 
help you figure out how do it.  Seriously.  Don't expect larger project uptake, 
but if you just need something to work for you

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17837) Disaster recovery of offsets from WAL

2016-10-11 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger closed SPARK-17837.
--
Resolution: Duplicate

Duplicate of SPARK-17829

> Disaster recovery of offsets from WAL
> -
>
> Key: SPARK-17837
> URL: https://issues.apache.org/jira/browse/SPARK-17837
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
> As reynold suggests though, we should change this to use a less opaque 
> format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17720) Static configurations in SQL

2016-10-11 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17720.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15295
[https://github.com/apache/spark/pull/15295]

> Static configurations in SQL
> 
>
> Key: SPARK-17720
> URL: https://issues.apache.org/jira/browse/SPARK-17720
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> Spark SQL has two kinds of configuration parameters: dynamic configs and 
> static configs. Dynamic configs can be modified after Spark SQL is launched 
> (after SparkSession is setup), whereas static configs are immutable once the 
> service starts.
> It would be useful to have this separation and tell users if the user tries 
> to set a static config after the service starts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17882:


Assignee: Apache Spark

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: Apache Spark
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17882:


Assignee: (was: Apache Spark)

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567423#comment-15567423
 ] 

Apache Spark commented on SPARK-17882:
--

User 'jrshust' has created a pull request for this issue:
https://github.com/apache/spark/pull/15446

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17817) PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567418#comment-15567418
 ] 

Apache Spark commented on SPARK-17817:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15445

> PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
> ---
>
> Key: SPARK-17817
> URL: https://issues.apache.org/jira/browse/SPARK-17817
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Mike Dusenberry
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Calling {{repartition}} on a PySpark RDD to increase the number of partitions 
> results in highly skewed partition sizes, with most having 0 rows.  The 
> {{repartition}} method should evenly spread out the rows across the 
> partitions, and this behavior is correctly seen on the Scala side.
> Please reference the following code for a reproducible example of this issue:
> {code}
> # Python
> num_partitions = 2
> a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
> l = a.repartition(num_partitions).glom().map(len).collect()  # get length of 
> each partition
> min(l), max(l), sum(l)/len(l), len(l)  # skewed!
> # Scala
> val numPartitions = 2
> val a = sc.parallelize(0 until 1e6.toInt, 2)  # start with 2 even partitions
> val l = a.repartition(numPartitions).glom().map(_.length).collect()  # get 
> length of each partition
> print(l.min, l.max, l.sum/l.length, l.length)  # even!
> {code}
> The issue here is that highly skewed partitions can result in severe memory 
> pressure in subsequent steps of a processing pipeline, resulting in OOM 
> errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread James Shuster (JIRA)
James Shuster created SPARK-17882:
-

 Summary: RBackendHandler swallowing errors
 Key: SPARK-17882
 URL: https://issues.apache.org/jira/browse/SPARK-17882
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: James Shuster
Priority: Minor


RBackendHandler is swallowing general exceptions in handleMethodCall which 
makes it impossible to debug certain issues that happen when doing an 
invokeJava call.

In my case this was the following error
java.lang.IllegalAccessException: Class org.apache.spark.api.r.RBackendHandler 
can not access a member of class with modifiers "public final"

The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Jeremy Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567367#comment-15567367
 ] 

Jeremy Smith edited comment on SPARK-17344 at 10/12/16 2:56 AM:


{quote}
By contrast, writing a streaming source shim around the existing simple 
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have 
stuff like SSL, dynamic topics, or offset committing.
{quote}

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind 
to the beginning, or throw away everything from the past, or (as before) rely 
on the incredibly fragile StreamingContext checkpoints. Or, commit the 
topic/partition/offset to the sink so you can recover the actually processed 
messages from there. Again, decoupling each operation from the entire state of 
the StreamingContext is a huge step up, because you can actually upgrade your 
application code (at least in certain ways) without having to worry about 
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10 
use cases involve some sort of structured data, the processing of which will 
have dramatically better performance when being used with tungsten than with 
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense 
of some features, why not introduce it? I have a tremendous amount of respect 
for the complexity of Kafka and the work you're doing with it, but I also get a 
sense that the conceptual "perfect" here is the enemy of the good. The weekend 
project you mentioned would result in a dramatic improvement in the experience 
for a large percentage of users who are currently using Spark and Kafka 
together. Most companies are using some kind of Hadoop distribution (i.e. HDP 
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH 
doesn't), but at what rate are people actually able to update HDP? I don't have 
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly 
significant portion of the Kafka install base.

Just my two cents on the matter.


was (Author: jeremyrsmith):
 > By contrast, writing a streaming source shim around the existing simple 
 > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't 
 > have stuff like SSL, dynamic topics, or offset committing.

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather 

[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Jeremy Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567367#comment-15567367
 ] 

Jeremy Smith commented on SPARK-17344:
--

 > By contrast, writing a streaming source shim around the existing simple 
 > consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't 
 > have stuff like SSL, dynamic topics, or offset committing.

Serious question: Would it be so bad to have a bifurcated codebase here? People 
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for 
them, and are probably not all that concerned about the features you mentioned. 
In general, structured streaming already provides a lot of the capabilities 
that I for one am concerned about when using Kafka - offsets are tracked 
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster 
specifically, you are probably using network-level security and aren't viewing 
the lack of SSL as a blocker; and finally you're already resigned to static 
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying 
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't 
have to throw away all your checkpoints just to do something with a new topic 
in the same application. Currently, you have to do this, because the entire 
graph is stored in the checkpoints along with all the topics you're ever going 
to look at. Structured streaming at least gives you separate checkpoints per 
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind 
to the beginning, or throw away everything from the past, or (as before) rely 
on the incredibly fragile StreamingContext checkpoints. Or, commit the 
topic/partition/offset to the sink so you can recover the actually processed 
messages from there. Again, decoupling each operation from the entire state of 
the StreamingContext is a huge step up, because you can actually upgrade your 
application code (at least in certain ways) without having to worry about 
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10 
use cases involve some sort of structured data, the processing of which will 
have dramatically better performance when being used with tungsten than with 
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense 
of some features, why not introduce it? I have a tremendous amount of respect 
for the complexity of Kafka and the work you're doing with it, but I also get a 
sense that the conceptual "perfect" here is the enemy of the good. The weekend 
project you mentioned would result in a dramatic improvement in the experience 
for a large percentage of users who are currently using Spark and Kafka 
together. Most companies are using some kind of Hadoop distribution (i.e. HDP 
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH 
doesn't), but at what rate are people actually able to update HDP? I don't have 
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly 
significant portion of the Kafka install base.

Just my two cents on the matter.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567362#comment-15567362
 ] 

Liwei Lin commented on SPARK-16845:
---

Thanks for the pointer, let me look into this. :-)

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567353#comment-15567353
 ] 

Liwei Lin edited comment on SPARK-16845 at 10/12/16 2:51 AM:
-

[~dondrake] [~Utsumi] Could you provide a simple reproducer?


was (Author: lwlin):
-[~dondrake] [~Utsumi] Could you provide a simple reproducer?-

I've found a reproducer in SPARK-17092; I'll take a look at this, thanks!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567358#comment-15567358
 ] 

Don Drake commented on SPARK-16845:
---

I can't at the moment, mine is not simple.  

But this JIRA has one: https://issues.apache.org/jira/browse/SPARK-17092

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567353#comment-15567353
 ] 

Liwei Lin edited comment on SPARK-16845 at 10/12/16 2:50 AM:
-

-[~dondrake] [~Utsumi] Could you provide a simple reproducer?-

I've found a reproducer in SPARK-17092; I'll take a look at this, thanks!


was (Author: lwlin):
[~dondrake] [~Utsumi] Could you provide a simple reproducer? I may help look 
into this, thanks!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567353#comment-15567353
 ] 

Liwei Lin commented on SPARK-16845:
---

[~dondrake] [~Utsumi] Could you provide a simple reproducer? I may help look 
into this, thanks!

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-11 Thread Leandro Ferrado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567341#comment-15567341
 ] 

Leandro Ferrado edited comment on SPARK-11758 at 10/12/16 2:43 AM:
---

Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:  # Having the 2 following lines on the clause
   schema = [str(x) for x in data.columns]
   data = data.astype(str)  # Converting all fields on string 
objects because we don't have a defined schema
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.


was (Author: leferrad):
Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
## begin if clause ##
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string objects 
because we don't have a defined schema
## end if clause ##
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-

[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-11 Thread Leandro Ferrado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567341#comment-15567341
 ] 

Leandro Ferrado commented on SPARK-11758:
-

Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
## begin if clause ##
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string objects 
because we don't have a defined schema
## end if clause ##
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-11 Thread Leandro Ferrado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567329#comment-15567329
 ] 

Leandro Ferrado edited comment on SPARK-11758 at 10/12/16 2:39 AM:
---

Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
# begin if clause#
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string objects 
because we don't have a defined schema
   # end if clause#
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.


was (Author: leferrad):
Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string 
objects because we don't have a defined schema
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-11 Thread Leandro Ferrado (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leandro Ferrado updated SPARK-11758:

Comment: was deleted

(was: Hi Holden. First, I would add just a single line in order to avoid the 
bad conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
# begin if clause#
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string objects 
because we don't have a defined schema
   # end if clause#
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.)

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2016-10-11 Thread Leandro Ferrado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567329#comment-15567329
 ] 

Leandro Ferrado commented on SPARK-11758:
-

Hi Holden. First, I would add just a single line in order to avoid the bad 
conversion of 'datetime' objects (so far, DataFrame.to_records(index=False) 
converts a Date column into a LongInt column). The idea is to first convert all 
columns into string types, thus the function DataFrame.to_records(index=False) 
wouldn't make bad conversions with datetime.datetime objects. However, that can 
be done only if we define a pyspark.sql.dataframe.DataFrame with a schema of 
strings or if we didn't define an schema (in that case, the function create an 
schema of strings). So, the modification is only present on the condition 
'schema=None' and the snippet would be:

---
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = data.astype(str)  # Converting all fields on string 
objects because we don't have a defined schema
data = [r.tolist() for r in data.to_records(index=False)]
---

In case of having an schema with timestamps (e.g. TimestampType() or 
DateType()), it is needed a prior conversion between datetime.datetime objects 
on Python to a convenient format for pyspark DataFrames. 
Regarding to the 'index=False' term, so far I can't figure out an scenario in 
which it is needed an index per row on a DataFrame. So it may be fine that 
argument on the function, I'm not sure.

> Missing Index column while creating a DataFrame from Pandas 
> 
>
> Key: SPARK-11758
> URL: https://issues.apache.org/jira/browse/SPARK-11758
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Linux Debian, PySpark, in local testing.
>Reporter: Leandro Ferrado
>Priority: Minor
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> In PySpark's SQLContext, when it invokes createDataFrame() from a 
> pandas.DataFrame and indicating a 'schema' with StructFields, the function 
> _createFromLocal() converts the pandas.DataFrame but ignoring two points:
> - Index column, because the flag index=False
> - Timestamp's records, because a Date column can't be index and Pandas 
> doesn't converts its records in Timestamp's type.
> So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
> temporal records.
> Doc: 
> http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html
> Affected code:
> def _createFromLocal(self, data, schema):
> """
> Create an RDD for DataFrame from an list or pandas.DataFrame, returns
> the RDD and schema.
> """
> if has_pandas and isinstance(data, pandas.DataFrame):
> if schema is None:
> schema = [str(x) for x in data.columns]
> data = [r.tolist() for r in data.to_records(index=False)]  # HERE
> # ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567321#comment-15567321
 ] 

Michael Armbrust commented on SPARK-17344:
--

These are good questions.  A few thoughts:

bq. How long would it take CDH to distribute 0.10 if there was a compelling 
Spark client for it?

Even if they were going to release kafka 0.10 in CDH yesterday, my experience 
is that many will take a long time for people to upgrade.  We spent a fair 
amount of effort on multi-version compatibility for Hive in Spark SQL and it 
was great boost for adoption.  I think this could be the same thing.

bq. How are you going to handle SSL?  You can't avoid the complexity of caching 
consumers if you still want the benefits of prefetching, and doing an SSL 
handshake for every batch will kill performance if they aren't cached.

An option here would be to use the internal client directly.  This way we can 
leverage all the work that they did to support SSL, etc yet make it speak 
specific versions of the protocol as we need. I did a [really rough 
prototype|https://gist.github.com/marmbrus/7d116b0a9672337497ddfccc0657dbf0] 
using the APIs described above and it is not that much code.  There is clearly 
a lot more we'd need to do, but I think we should strongly consider this option.

Caching connections to the specific brokers should probably still be 
implemented for the reasons you describe (and this is already handled by the 
internal client).  An advantage here is you'd actually be able to share 
connections across queries without running into correctness problems.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12484) DataFrame withColumn() does not work in Java

2016-10-11 Thread Ryan Brant (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567271#comment-15567271
 ] 

Ryan Brant edited comment on SPARK-12484 at 10/12/16 2:04 AM:
--

Was there a resolution to this?  I am also getting this issue in Scala.  I am 
currently using Spark 2.0


was (Author: brantrm):
Was there a resolution to this?  I am also getting this issue in Scala.

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2016-10-11 Thread Ryan Brant (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567271#comment-15567271
 ] 

Ryan Brant commented on SPARK-12484:


Was there a resolution to this?  I am also getting this issue in Scala.

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17870:


Assignee: Apache Spark

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Assignee: Apache Spark
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567244#comment-15567244
 ] 

Apache Spark commented on SPARK-17870:
--

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15444

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17870:


Assignee: (was: Apache Spark)

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17881) Aggregation function for generating string histograms

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567226#comment-15567226
 ] 

Apache Spark commented on SPARK-17881:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15443

> Aggregation function for generating string histograms
> -
>
> Key: SPARK-17881
> URL: https://issues.apache.org/jira/browse/SPARK-17881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This agg function generates equi-width histograms for string type columns, 
> with a maximum number of histogram bins. It returns a empty result if the 
> ndv(number of distinct value) of the column exceeds the maximum number 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17881) Aggregation function for generating string histograms

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17881:


Assignee: Apache Spark

> Aggregation function for generating string histograms
> -
>
> Key: SPARK-17881
> URL: https://issues.apache.org/jira/browse/SPARK-17881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> This agg function generates equi-width histograms for string type columns, 
> with a maximum number of histogram bins. It returns a empty result if the 
> ndv(number of distinct value) of the column exceeds the maximum number 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17881) Aggregation function for generating string histograms

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17881:


Assignee: (was: Apache Spark)

> Aggregation function for generating string histograms
> -
>
> Key: SPARK-17881
> URL: https://issues.apache.org/jira/browse/SPARK-17881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This agg function generates equi-width histograms for string type columns, 
> with a maximum number of histogram bins. It returns a empty result if the 
> ndv(number of distinct value) of the column exceeds the maximum number 
> allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17881) Aggregation function for generating string histograms

2016-10-11 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-17881:


 Summary: Aggregation function for generating string histograms
 Key: SPARK-17881
 URL: https://issues.apache.org/jira/browse/SPARK-17881
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Zhenhua Wang


This agg function generates equi-width histograms for string type columns, with 
a maximum number of histogram bins. It returns a empty result if the ndv(number 
of distinct value) of the column exceeds the maximum number allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567199#comment-15567199
 ] 

Apache Spark commented on SPARK-17853:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/15442

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17853:


Assignee: (was: Apache Spark)

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17853) Kafka OffsetOutOfRangeException on DStreams union from separate Kafka clusters with identical topic names.

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17853:


Assignee: Apache Spark

> Kafka OffsetOutOfRangeException on DStreams union from separate Kafka 
> clusters with identical topic names.
> --
>
> Key: SPARK-17853
> URL: https://issues.apache.org/jira/browse/SPARK-17853
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Marcin Kuthan
>Assignee: Apache Spark
>
> During migration from Spark 1.6 to 2.0 I observed OffsetOutOfRangeException  
> reported by Kafka client. In our scenario we create single DStream as a union 
> of multiple DStreams. One DStream for one Kafka cluster (multi dc solution). 
> Both Kafka clusters have the same topics and number of partitions.
> After quick investigation, I found that class DirectKafkaInputDStream keeps 
> offset state for topic and partitions, but it is not aware of different Kafka 
> clusters. 
> For every topic, single DStream is created as a union from all configured 
> Kafka clusters.
> {code}
> class KafkaDStreamSource(configs: Iterable[Map[String, String]]) {
> def createSource(ssc: StreamingContext, topic: String): DStream[(String, 
> Array[Byte])] = {
> val streams = configs.map { config =>
>   val kafkaParams = config
>   val kafkaTopics = Set(topic)
>   KafkaUtils.
>   createDirectStream[String, Array[Byte]](
> ssc,
> LocationStrategies.PreferConsistent,
> ConsumerStrategies.Subscribe[String, Array[Byte]](kafkaTopics, 
> kafkaParams)
>   ).map { record =>
> (record.key, record.value)
>   }
> }
> ssc.union(streams.toSeq)
>   }
> }
> {code}
> At the end, offsets from one Kafka cluster overwrite offsets from second one. 
> Fortunately OffsetOutOfRangeException was thrown because offsets in both 
> Kafka clusters are significantly different.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567180#comment-15567180
 ] 

Hossein Falaki commented on SPARK-17878:


Sure. If passing a list is possible it is the better choice. I just don't want 
to block this feature on API change in SparkSQL.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567170#comment-15567170
 ] 

Peng Meng commented on SPARK-17870:
---

hi [~avulanov], the question here is not use raw chi2 scores or pvalues, the 
question is if use raw chi2 scores, the DoF should be the same.   
"chi2-test is used multiple times" is another problem.  According to 
(http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),"whenever
 a statistical test is used multiple times, then the probability of getting at 
least one error increases.", this problem is partially solved by Select the 
p-values corresponding to Family-wise error rate (SelectFwe, SPARK-17645). 
Thanks very much.

Hi [~srowen], I totally agree with your comments. Based on the DoF is different 
in Spark ChiSquare value, we can use the p-values for Spark SelectKBest, and 
SelectPercentile. Thanks very much.

I will submit a pr for this.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567149#comment-15567149
 ] 

Hyukjin Kwon commented on SPARK-17878:
--

BTW, maybe, I will try to investigate further if it is really possible (setting 
a list as the value in options) if you think both ideas are okay.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567124#comment-15567124
 ] 

Hyukjin Kwon edited comment on SPARK-17878 at 10/12/16 12:50 AM:
-

Oh, I didn't mean I am against this. I am just wondering if it is just possible 
to deal with this in general. If it is not easy for now, I'd rather support 
this idea if we should deal with this problem. (Actually, one of the votes is 
from me :))


was (Author: hyukjin.kwon):
Oh, I didn't mean I am against this. I am just wondering if it is just possible 
to deal with this in general. If it is not easy for now, I support this idea.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567124#comment-15567124
 ] 

Hyukjin Kwon commented on SPARK-17878:
--

Oh, I didn't mean I am against this. I am just wondering if it is just possible 
to deal with this in general. If it is not easy for now, I support this idea.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567104#comment-15567104
 ] 

Hossein Falaki commented on SPARK-17878:


That would require API change in SparkSQL. Otherwise, we need to split the 
string passed to {{nullValue}} on a delimiter. Neither these are safe choices. 
I think accepting {{nullValue1}}, {{nullValue2}}, etc (along with existing 
{{nullValue}}) is:
* backwards compatible
* clear
* extensible for other options in future. E.g., quoteCharacter, etc.

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567046#comment-15567046
 ] 

Hyukjin Kwon commented on SPARK-17878:
--

Maybe it'd be nicer if options allow list or nested map (if possible). I 
noticed {{read.csv}} in R can use {{na.strings}} as a list which is actually 
currently being mapped to {{nullValue}} as string..

> Support for multiple null values when reading CSV data
> --
>
> Key: SPARK-17878
> URL: https://issues.apache.org/jira/browse/SPARK-17878
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> There are CSV files out there with multiple values that are supposed to be 
> interpreted as null. As a result, multiple spark users have asked for this 
> feature built into the CSV data source. It can be easily implemented in a 
> backwards compatible way:
> - Currently CSV data source supports an option named {{nullValue}}.
> - We can add logic in {{CSVOptions}} to understands option names that match 
> {{nullValue[\d]}}. This way user can specify a query with multiple or one 
> null value.
> {code}
> val df = spark.read.format("CSV").option("nullValue1", 
> "-").option("nullValue2", "*")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4411) Add "kill" link for jobs in the UI

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567028#comment-15567028
 ] 

Apache Spark commented on SPARK-4411:
-

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/15441

> Add "kill" link for jobs in the UI
> --
>
> Key: SPARK-4411
> URL: https://issues.apache.org/jira/browse/SPARK-4411
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Kay Ousterhout
>
> SPARK-4145 changes the default landing page for the UI to show jobs. We 
> should have a "kill" link for each job, similar to what we have for each 
> stage, so it's easier for users to kill slow jobs (and the semantics of 
> killing a job are slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17880:


Assignee: (was: Apache Spark)

> The url linking to `AccumulatorV2` in the document is incorrect.
> 
>
> Key: SPARK-17880
> URL: https://issues.apache.org/jira/browse/SPARK-17880
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `programming-guide.md`, the url which links to `AccumulatorV2` says 
> `api/scala/index.html#org.apache.spark.AccumulatorV2` but 
> `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566936#comment-15566936
 ] 

Apache Spark commented on SPARK-17880:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/15439

> The url linking to `AccumulatorV2` in the document is incorrect.
> 
>
> Key: SPARK-17880
> URL: https://issues.apache.org/jira/browse/SPARK-17880
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Kousuke Saruta
>Priority: Minor
>
> In `programming-guide.md`, the url which links to `AccumulatorV2` says 
> `api/scala/index.html#org.apache.spark.AccumulatorV2` but 
> `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17880:


Assignee: Apache Spark

> The url linking to `AccumulatorV2` in the document is incorrect.
> 
>
> Key: SPARK-17880
> URL: https://issues.apache.org/jira/browse/SPARK-17880
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> In `programming-guide.md`, the url which links to `AccumulatorV2` says 
> `api/scala/index.html#org.apache.spark.AccumulatorV2` but 
> `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-17880:
--

 Summary: The url linking to `AccumulatorV2` in the document is 
incorrect.
 Key: SPARK-17880
 URL: https://issues.apache.org/jira/browse/SPARK-17880
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Kousuke Saruta
Priority: Minor


In `programming-guide.md`, the url which links to `AccumulatorV2` says 
`api/scala/index.html#org.apache.spark.AccumulatorV2` but 
`api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-10-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566858#comment-15566858
 ] 

Davies Liu commented on SPARK-15621:


[~rezasafi] We usually do not backport this kind of improvements, it's too 
large and risky for maintain release (2.0.X), sorry for that.

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-10-11 Thread Reza Safi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566752#comment-15566752
 ] 

Reza Safi commented on SPARK-15621:
---

Hi [~davies], can the fix be backported to branch-2.0 as well, since it affects 
version 2.0.0 as well? Thank you very much in advance.

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2016-10-11 Thread K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566741#comment-15566741
 ] 

K commented on SPARK-16845:
---

We manually wrote parts that were throwing errors (StringIndexer and 
FeatureAssembler) in RDD and converted to DataFrame to run 
RandomForestClassifier.  

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, ML, MLlib
>Affects Versions: 2.0.0
>Reporter: hejie
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17387) Creating SparkContext() from python without spark-submit ignores user conf

2016-10-11 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17387.

   Resolution: Fixed
 Assignee: Jeff Zhang
Fix Version/s: 2.1.0

> Creating SparkContext() from python without spark-submit ignores user conf
> --
>
> Key: SPARK-17387
> URL: https://issues.apache.org/jira/browse/SPARK-17387
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Consider the following scenario: user runs a python application not through 
> spark-submit, but by adding the pyspark module and manually creating a Spark 
> context. Kinda like this:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> from pyspark import SparkContext
> >>> from pyspark import SparkConf
> >>> conf = SparkConf().set("spark.driver.memory", "4g")
> >>> sc = SparkContext(conf=conf)
> {noformat}
> If you look at the JVM launched by the pyspark code, it ignores the user's 
> configuration:
> {noformat}
> $ ps ax | grep $(pgrep -f SparkSubmit)
> 12283 pts/2Sl+0:03 /apps/java7/bin/java -cp ... -Xmx1g 
> -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit pyspark-shell
> {noformat}
> Note the "1g" of memory. If instead you use "pyspark", you get the correct 
> "4g" in the JVM.
> This also affects other configs; for example, you can't really add jars to 
> the driver's classpath using "spark.jars".
> You can work around this by setting the undocumented env variable Spark 
> itself uses:
> {noformat}
> $ SPARK_HOME=$PWD PYTHONPATH=python:python/lib/py4j-0.10.3-src.zip python
> Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
> [GCC 5.4.0 20160609] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import os
> >>> os.environ['PYSPARK_SUBMIT_ARGS'] = "pyspark-shell --conf 
> >>> spark.driver.memory=4g"
> >>> from pyspark import SparkContext
> >>> sc = SparkContext()
> {noformat}
> But it would be nicer if the configs were automatically propagated.
> BTW the reason for this is that the {{launch_gateway}} function used to start 
> the JVM does not take any parameters, and the only place where it reads 
> arguments for Spark is that env variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false,   /tmp/check still contains only 1 
folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false,   /tmp/check/ contains only 1 
folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false,   /tmp/check still contains only 
> 1 folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9879) OOM in LIMIT clause with large number

2016-10-11 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566639#comment-15566639
 ] 

Dongjoon Hyun commented on SPARK-9879:
--

Hi, All.
The PR seems to be closed last December.
Can we close this if this is not reproduced?

> OOM in LIMIT clause with large number
> -
>
> Key: SPARK-9879
> URL: https://issues.apache.org/jira/browse/SPARK-9879
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> create table spark.tablsetest as select * from dpa_ord_bill_tf order by 
> member_id limit 2000;
> {code}
>  
> {code}
> spark-sql --driver-memory 48g --executor-memory 24g --driver-java-options 
> -XX:PermSize=1024M -XX:MaxPermSize=2048M
> Error logs
> 15/07/27 10:22:43 ERROR ActorSystemImpl: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
> at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
> at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:165)
> at 
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1239)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition.writeObject(ParallelCollectionRDD.scala:51)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
> at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
> at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)
> at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:168)
> at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:231)
> 15/07/27 10:22:43 ERROR ErrorMonitor: Uncaught fatal error from thread 
> [sparkDriver-akka.actor.default-dispatcher-20]shutting down ActorSystem 
> [sparkDriver]
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> at java.util.Arrays.copyOf(Arrays.java:2271)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
> at 
> java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1852)
> at java.io.ObjectOutputStream.write(ObjectOutputStream.java:708)
> at org.apache.spark.util.Utils$$anon$2.write(Utils.scala:134)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.close(Output.java:165)
> at 
> org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:162)
> at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:139)
> at 
> org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$writeObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:65)
> at 

[jira] [Created] (SPARK-17879) Don't compact metadata logs constantly into a single compacted file

2016-10-11 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-17879:
---

 Summary: Don't compact metadata logs constantly into a single 
compacted file
 Key: SPARK-17879
 URL: https://issues.apache.org/jira/browse/SPARK-17879
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Affects Versions: 2.0.1
Reporter: Burak Yavuz


With metadata log compaction, we compact all files into a single file every "n" 
batches. The problem is, over time, this single file becomes huge, and could 
become an issue to constantly write out in the driver.

It would be a good idea to cap the compacted file size, so that we don't end up 
writing huge files in the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566586#comment-15566586
 ] 

Sean Owen commented on SPARK-17870:
---

If the degrees of freedom are the same across the tests, then ranking on 
p-value or statistic should give the same ranking because the p-value is a 
monotonically decreasing function of the statistic. That's the case in what the 
scikit code is effectively doing because there are always (# label classes - 1) 
degrees of freedom. Really the p-value is the comparable quantity, but there's 
no point computing it in this case because it's just for ranking.

The Spark code performs a chi-squared test but applies it to answer a different 
question, where DOF is no longer the same; it's (# label classes - 1) * (# 
feature classes - 1) in the contingency table here. p-value is no longer always 
smaller when the statistic is larger. So it's necessary to actually use the 
p-values for what Spark is doing.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17878) Support for multiple null values when reading CSV data

2016-10-11 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-17878:
--

 Summary: Support for multiple null values when reading CSV data
 Key: SPARK-17878
 URL: https://issues.apache.org/jira/browse/SPARK-17878
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 2.0.1
Reporter: Hossein Falaki


There are CSV files out there with multiple values that are supposed to be 
interpreted as null. As a result, multiple spark users have asked for this 
feature built into the CSV data source. It can be easily implemented in a 
backwards compatible way:

- Currently CSV data source supports an option named {{nullValue}}.
- We can add logic in {{CSVOptions}} to understands option names that match 
{{nullValue[\d]}}. This way user can specify a query with multiple or one null 
value.

{code}
val df = spark.read.format("CSV").option("nullValue1", 
"-").option("nullValue2", "*")
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17455) IsotonicRegression takes non-polynomial time for some inputs

2016-10-11 Thread Nic Eggert (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Eggert updated SPARK-17455:
---
Priority: Major  (was: Minor)

> IsotonicRegression takes non-polynomial time for some inputs
> 
>
> Key: SPARK-17455
> URL: https://issues.apache.org/jira/browse/SPARK-17455
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.2, 2.0.0
>Reporter: Nic Eggert
>
> The Pool Adjacent Violators Algorithm (PAVA) implementation that's currently 
> in MLlib can take O(N!) time for certain inputs, when it should have 
> worst-case complexity of O(N^2).
> To reproduce this, I pulled the private method poolAdjacentViolators out of 
> mllib.regression.IsotonicRegression and into a benchmarking harness.
> Given this input
> {code}
> val x = (1 to length).toArray.map(_.toDouble)
> val y = x.reverse.zipWithIndex.map{ case (yi, i) => if (i % 2 == 1) yi - 1.5 
> else yi}
> val w = Array.fill(length)(1d)
> val input: Array[(Double, Double, Double)] = (y zip x zip w) map{ case ((y, 
> x), w) => (y, x, w)}
> {code}
> I vary the length of the input to get these timings:
> || Input Length || Time (us) ||
> | 100 | 1.35 |
> | 200 | 3.14 | 
> | 400 | 116.10 |
> | 800 | 2134225.90 |
> (tests were performed using 
> https://github.com/sirthias/scala-benchmarking-template)
> I can also confirm that I run into this issue on a real dataset I'm working 
> on when trying to calibrate random forest probability output. Some partitions 
> take > 12 hours to run. This isn't a skew issue, since the largest partitions 
> finish in minutes. I can only assume that some partitions cause something 
> approaching this worst-case complexity.
> I'm working on a patch that borrows the implementation that is used in 
> scikit-learn and the R "iso" package, both of which handle this particular 
> input in linear time and are quadratic in the worst case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-11 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566541#comment-15566541
 ] 

Hossein Falaki commented on SPARK-17781:


[~shivaram] Thanks for looking into it. I think the problem applies to 
{{dapply}} as well. For example this fails:
{code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> collect(dapply(df, function(x) {data.frame(res = x$date)}, schema = 
> structType(structField("res", "date"
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 52.0 failed 4 times, most recent failure: Lost task 0.3 in stage 52.0 
(TID 10114, 10.0.229.211): java.lang.RuntimeException: java.lang.Double is not 
a valid external type for schema of date
{code}

I spent a few hours getting to the root of it. We have the correct type all the 
way until {{readList}} in {{deserialize.R}}. I instrumented that function. We 
get the correct type from {{readObject()}} but once it is placed in the list it 
loses its type.

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17863) SELECT distinct does not work if there is a order by clause

2016-10-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17863:
-
Description: 
{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b
{code}
This query generates
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
{code}
The plan is wrong because the analyze somehow added {{struct#21805}} to the 
project list, which changes the semantic of the distinct (basically, the query 
is changed to {{select distinct struct.a, struct.b, struct}} from {{select 
distinct struct.a, struct.b}}).
{code}
== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
  +- 'SubqueryAlias tmp
 +- 'Union
:- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
:  +- OneRowRelation$
+- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
   +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
  +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
struct#21805]
 +- SubqueryAlias tmp
+- Union
   :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
   :  +- OneRowRelation$
   +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
struct#21805]
  +- Union
 :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
 :  +- OneRowRelation$
 +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
+- OneRowRelation$

== Physical Plan ==
*Project [a#21819, b#21820]
+- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
   +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
output=[a#21819, b#21820, struct#21805])
 +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
+- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 
functions=[], output=[a#21819, b#21820, struct#21805])
   +- Union
  :- *Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS 
struct#21805]
  :  +- Scan OneRowRelation[]
  +- *Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS 
struct#21806]
 +- Scan OneRowRelation[]
{code}

  was:
{code}
select distinct struct.a, struct.b
from (
  select named_struct('a', 1, 'b', 2, 'c', 3) as struct
  union all
  select named_struct('a', 1, 'b', 2, 'c', 4) as struct) tmp
order by struct.a, struct.b
{code}
This query generates
{code}
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  1|  2|
+---+---+
{code}
The plan is wrong
{code}
== Parsed Logical Plan ==
'Sort ['struct.a ASC, 'struct.b ASC], true
+- 'Distinct
   +- 'Project ['struct.a, 'struct.b]
  +- 'SubqueryAlias tmp
 +- 'Union
:- 'Project ['named_struct(a, 1, b, 2, c, 3) AS struct#21805]
:  +- OneRowRelation$
+- 'Project ['named_struct(a, 1, b, 2, c, 4) AS struct#21806]
   +- OneRowRelation$

== Analyzed Logical Plan ==
a: int, b: int
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Distinct
  +- Project [struct#21805.a AS a#21819, struct#21805.b AS b#21820, 
struct#21805]
 +- SubqueryAlias tmp
+- Union
   :- Project [named_struct(a, 1, b, 2, c, 3) AS struct#21805]
   :  +- OneRowRelation$
   +- Project [named_struct(a, 1, b, 2, c, 4) AS struct#21806]
  +- OneRowRelation$

== Optimized Logical Plan ==
Project [a#21819, b#21820]
+- Sort [struct#21805.a ASC, struct#21805.b ASC], true
   +- Aggregate [a#21819, b#21820, struct#21805], [a#21819, b#21820, 
struct#21805]
  +- Union
 :- Project [1 AS a#21819, 2 AS b#21820, [1,2,3] AS struct#21805]
 :  +- OneRowRelation$
 +- Project [1 AS a#21819, 2 AS b#21820, [1,2,4] AS struct#21806]
+- OneRowRelation$

== Physical Plan ==
*Project [a#21819, b#21820]
+- *Sort [struct#21805.a ASC, struct#21805.b ASC], true, 0
   +- Exchange rangepartitioning(struct#21805.a ASC, struct#21805.b ASC, 200)
  +- *HashAggregate(keys=[a#21819, b#21820, struct#21805], functions=[], 
output=[a#21819, b#21820, struct#21805])
 +- Exchange hashpartitioning(a#21819, b#21820, struct#21805, 200)
+- *HashAggregate(keys=[a#21819, b#21820, struct#21805], 

[jira] [Comment Edited] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566515#comment-15566515
 ] 

Harish edited comment on SPARK-17463 at 10/11/16 8:34 PM:
--

My second approach was:
def testfunc(keys, vals, columnsToStandardize):
   df= pd.DataFrame(vals, columns = keys)
   df[columnsToStandardize] = df[columnsToStandardize] - 
df[columnsToStandardize].mean() 

df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], 
keys[1], columnsToStandardize))


was (Author: harishk15):
My second approach was:
def testfunc(keys, vals, columnsToStandardize):
   df= pd.DataFrame(vals, columns = keys)
   df[columnsToStandardize] = df[columnsToStandardize] - 
df[columnsToStandardize].mean() 

df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], 
keys[1], columnsToStandardize))

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> 

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566515#comment-15566515
 ] 

Harish commented on SPARK-17463:


My second approach was:
def testfunc(keys, vals, columnsToStandardize):
   df= pd.DataFrame(vals, columns = keys)
   df[columnsToStandardize] = df[columnsToStandardize] - 
df[columnsToStandardize].mean() 

df3.rdd.map(keys).groupByKey().flatMap(lambda keyval: testfunc(keys[0], 
keys[1], columnsToStandardize))

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> 

[jira] [Assigned] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17845:


Assignee: Apache Spark  (was: Reynold Xin)

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17845:


Assignee: Reynold Xin  (was: Apache Spark)

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566509#comment-15566509
 ] 

Apache Spark commented on SPARK-17845:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15438

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17857) SHOW TABLES IN schema throws exception if schema doesn't exist

2016-10-11 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-17857.
-
Resolution: Not A Problem

Although the behavior is changed from 1.x, we had better close this issue as 
'NOT A PROBLEM' in Spark 2.x. I'm closing this now.

You can reopen this if you need to do.

> SHOW TABLES IN schema throws exception if schema doesn't exist
> --
>
> Key: SPARK-17857
> URL: https://issues.apache.org/jira/browse/SPARK-17857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Todd Nemet
>Priority: Minor
>
> SHOW TABLES IN badschema; throws 
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException if badschema 
> doesn't exist. In Spark 1.x it would return an empty result set.
> On Spark 2.0.1:
> {code}
> [683|12:45:56] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10006/ -n hive
> Connecting to jdbc:hive2://localhost:10006/
> 16/10/10 12:46:00 INFO jdbc.Utils: Supplied authorities: localhost:10006
> 16/10/10 12:46:00 INFO jdbc.Utils: Resolved authority: localhost:10006
> 16/10/10 12:46:00 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10006/
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
> 0: jdbc:hive2://localhost:10006/> show schemas;
> +---+--+
> | databaseName  |
> +---+--+
> | default   |
> | looker_scratch|
> | spark_jira|
> | spark_looker_scratch  |
> | spark_looker_test |
> +---+--+
> 5 rows selected (0.61 seconds)
> 0: jdbc:hive2://localhost:10006/> show tables in spark_looker_test;
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.611 seconds)
> 0: jdbc:hive2://localhost:10006/> show tables in badschema;
> Error: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: 
> Database 'badschema' not found; (state=,code=0)
> {code}
> On Spark 1.6.2:
> {code}
> [680|12:47:26] ~/Documents/spark/spark$ bin/beeline -u 
> jdbc:hive2://localhost:10005/ -n hive
> Connecting to jdbc:hive2://localhost:10005/
> 16/10/10 12:47:29 INFO jdbc.Utils: Supplied authorities: localhost:10005
> 16/10/10 12:47:29 INFO jdbc.Utils: Resolved authority: localhost:10005
> 16/10/10 12:47:30 INFO jdbc.HiveConnection: Will try to open client transport 
> with JDBC Uri: jdbc:hive2://localhost:10005/
> Connected to: Spark SQL (version 1.6.2)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1.spark2 by Apache Hive
> 0: jdbc:hive2://localhost:10005/> show schemas;
> ++--+
> |   result   |
> ++--+
> | default|
> | spark_jira |
> | spark_looker_test  |
> | spark_scratch  |
> ++--+
> 4 rows selected (0.613 seconds)
> 0: jdbc:hive2://localhost:10005/> show tables in spark_looker_test;
> +--+--+--+
> |  tableName   | isTemporary  |
> +--+--+--+
> | all_types| false|
> | order_items  | false|
> | orders   | false|
> | users| false|
> +--+--+--+
> 4 rows selected (0.575 seconds)
> 0: jdbc:hive2://localhost:10005/> show tables in badschema;
> ++--+--+
> | tableName  | isTemporary  |
> ++--+--+
> ++--+--+
> No rows selected (0.458 seconds)
> {code}
> [Relevant part of Hive QL 
> docs|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowTables]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566467#comment-15566467
 ] 

Alexander Ulanov commented on SPARK-17870:
--

[`SelectKBest`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest)
 works with "a Function taking two arrays X and y, and returning a pair of 
arrays (scores, pvalues) or a single array with scores". According to what you 
observe, it uses pvalues for sorting of `chi2` outputs. Indeed, it is the case 
for all functions that return two arrays: 
https://github.com/scikit-learn/scikit-learn/blob/412996f/sklearn/feature_selection/univariate_selection.py#L331.
 Alternative, one case use raw `chi2` scores for sorting. She need to pass only 
the first array from `chi2` to `SelectKBest`. As far as I remember, using raw 
chi2 scores is default in Weka's 
[ChiSquaredAttributeEval](http://weka.sourceforge.net/doc.stable/weka/attributeSelection/ChiSquaredAttributeEval.html).
 So, I would not claim that either of approaches is incorrect. According to 
[Introduction to 
IR](http://nlp.stanford.edu/IR-book/html/htmledition/assessing-as-a-feature-selection-methodassessing-chi-square-as-a-feature-selection-method-1.html),
 there might be an issue with computing p-values because then chi2-test is used 
multiple times. Using plain chi2 values does not involve statistical test, so 
it might be treated as just some ranking with no statistical implications.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17781) datetime is serialized as double inside dapply()

2016-10-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566457#comment-15566457
 ] 

Shivaram Venkataraman commented on SPARK-17781:
---

[~falaki] I looked at this a bit more today and it looks like this problem is 
specific to dapplyCollect -- what happens here is that we dont have the schema 
for the output table, so we serialize it as a byteArray [1] and rely on the 
driver to do the conversion / deserialization while running collect. I couldn't 
trace this part to the end, but it looks like this gets deserialized in [2] and 
the call to unserialize there interprets the bytes as double instead of date. 
I'm not sure what is a good fix for this as well.


[1] https://github.com/apache/spark/blob/master/R/pkg/inst/worker/worker.R#L75
[2] https://github.com/apache/spark/blob/master/R/pkg/R/DataFrame.R#L1431

> datetime is serialized as double inside dapply()
> 
>
> Key: SPARK-17781
> URL: https://issues.apache.org/jira/browse/SPARK-17781
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> When we ship a SparkDataFrame to workers for dapply family functions, inside 
> the worker DateTime objects are serialized as double.
> To reproduce:
> {code}
> df <- createDataFrame(data.frame(id = 1:10, date = Sys.Date()))
> dapplyCollect(df, function(x) { return(x$date) })
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11784) enable Timestamp filter pushdown

2016-10-11 Thread Ian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566439#comment-15566439
 ] 

Ian commented on SPARK-11784:
-

Yes, I meant TimestampType filter pushdown

> enable Timestamp filter pushdown
> 
>
> Key: SPARK-11784
> URL: https://issues.apache.org/jira/browse/SPARK-11784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4411) Add "kill" link for jobs in the UI

2016-10-11 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566430#comment-15566430
 ] 

Alex Bozarth commented on SPARK-4411:
-

I'm currently working on this.
I'm updating the original pr to work with the latest code and to match how the 
kill stage code works

> Add "kill" link for jobs in the UI
> --
>
> Key: SPARK-4411
> URL: https://issues.apache.org/jira/browse/SPARK-4411
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Kay Ousterhout
>
> SPARK-4145 changes the default landing page for the UI to show jobs. We 
> should have a "kill" link for each job, similar to what we have for each 
> stage, so it's easier for users to kill slow jobs (and the semantics of 
> killing a job are slightly different than killing a stage).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566427#comment-15566427
 ] 

Harish edited comment on SPARK-17463 at 10/11/16 8:06 PM:
--

No i dont have any code like that. I use pyspark .. Please find my code snippet
df1 with 60 columns (70M records)
df2  with 3000-7000 (varies) columns (10M)
join df1 and df2 with key columns (please note df1 is more granular data and 
df2 one level above. So data set will grow

df3 = df1.join(df2, [keys])
aggList = [func.mean(col).alias(col + '_m') for col in df2.columns]

Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to 
each column of the df3 data frame which might be 3000-1 columns.
Let me know if you need entire stack trace of this issue.

PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So 
i have to break number of columns 500 chunks

 


was (Author: harishk15):
No i dont have any code like that. I use pyspark .. Please find my code snippet
df1 with 60 columns (70M records)
df2  with 3000-7000 (varies) columns (10M)
join df1 and df2 with key columns (please note df1 is more granular data and 
df2 one level above. So data set will grow

df3 = df1.join(df2, [keys])
aggList = [func.mean(col).alias(col + '_m') for col in df2.columns]

Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to 
each column of the df3 data frame which might be 3000-1 columns.

PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So 
i have to break number of columns 500 chunks

 

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> 

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566427#comment-15566427
 ] 

Harish commented on SPARK-17463:


No i dont have any code like that. I use pyspark .. Please find my code snippet
df1 with 60 columns (70M records)
df2  with 3000-7000 (varies) columns (10M)
join df1 and df2 with key columns (please note df1 is more granular data and 
df2 one level above. So data set will grow

df3 = df1.join(df2, [keys])
aggList = [func.mean(col).alias(col + '_m') for col in df2.columns]

Last part is i do -- df4 = df3.groupBy(keys).agg(*aggList) --I applying mean to 
each column of the df3 data frame which might be 3000-1 columns.

PS: We still have issue https://issues.apache.org/jira/browse/SPARK-16845 -- So 
i have to break number of columns 500 chunks

 

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> 

[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2016-10-11 Thread Jerome Scheuring (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566349#comment-15566349
 ] 

Jerome Scheuring edited comment on SPARK-12216 at 10/11/16 7:59 PM:


_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running 
under Spark 2.0.1.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

_Update:_  The bug also does not occur when run on the installation of Spark 
2.0.1 on the Windows 10 machine running inside "Bash on Ubuntu on Windows", 
i.e. the Linux subsystem running on the Windows 10 machine where the bug _does_ 
occur when the program is executed from Windows.

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}


was (Author: jerome.scheuring):
_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running 
under Spark 2.0.1.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>  

[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false,   /tmp/check/ contains only 1 
folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint   // /tmp/check/b1f46ba5-357a-4d6d-8f4d-411b64b27c2f appears
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false,   /tmp/check/ contains only 1 
> folder b1f46ba5-357a-4d6d-8f4d-411b64b27c2f
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566413#comment-15566413
 ] 

Harish commented on SPARK-17463:


Ok. thanks for the update. Do we have any work around for the second part of 
the issue? I tried this set("spark.rpc.netty.dispatcher.numThreads","2") but no 
luck


> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 

[jira] [Updated] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException

2016-10-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17816:
-
Fix Version/s: 2.0.2

> Json serialzation of accumulators are failing with 
> ConcurrentModificationException
> --
>
> Key: SPARK-17816
> URL: https://issues.apache.org/jira/browse/SPARK-17816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
> Fix For: 2.0.2, 2.1.0
>
>
> This is the stack trace: See  {{ConcurrentModificationException}}:
> {code}
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
> at java.util.ArrayList$Itr.next(ArrayList.java:851)
> at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
> at scala.collection.AbstractTraversable.to(Traversable.scala:104)
> at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
> at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
> at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
> at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17816) Json serialzation of accumulators are failing with ConcurrentModificationException

2016-10-11 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17816:
-
Affects Version/s: 2.0.1

> Json serialzation of accumulators are failing with 
> ConcurrentModificationException
> --
>
> Key: SPARK-17816
> URL: https://issues.apache.org/jira/browse/SPARK-17816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ergin Seyfe
>Assignee: Ergin Seyfe
> Fix For: 2.0.2, 2.1.0
>
>
> This is the stack trace: See  {{ConcurrentModificationException}}:
> {code}
> java.util.ConcurrentModificationException
> at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
> at java.util.ArrayList$Itr.next(ArrayList.java:851)
> at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
> at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
> at scala.collection.AbstractTraversable.to(Traversable.scala:104)
> at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
> at scala.collection.AbstractTraversable.toList(Traversable.scala:104)
> at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:314)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$5.apply(JsonProtocol.scala:291)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:291)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> org.apache.spark.util.JsonProtocol$$anonfun$taskInfoToJson$12.apply(JsonProtocol.scala:283)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:283)
> at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
> at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:137)
> at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:157)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus.doPostEvent(LiveListenerBus.scala:35)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.scheduler.LiveListenerBus.postToAll(LiveListenerBus.scala:35)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:81)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:66)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:65)
> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1244)
> at 
> org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566385#comment-15566385
 ] 

Alexander Pivovarov edited comment on SPARK-17877 at 10/11/16 7:50 PM:
---

Another open issues related to checkpointing are SPARK-14804 and SPARK-12431


was (Author: apivovarov):
Another open issue with checkpointing is SPARK-14804

> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17812) More granular control of starting offsets (assign)

2016-10-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17812:
-
Summary: More granular control of starting offsets (assign)  (was: More 
granular control of starting offsets)

> More granular control of starting offsets (assign)
> --
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566391#comment-15566391
 ] 

Shixiong Zhu commented on SPARK-17463:
--

Do you have a reproducer? I saw `at 
java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)`
 in the stack trace, so I think the internal ArrayList is accessed in some 
place. Did you use `collectionAccumulator` in your codes?

FYI,  https://github.com/apache/spark/pull/15371 is for SPARK-17816 which fixes 
an issue in driver.

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> 

[jira] [Commented] (SPARK-17812) More granular control of starting offsets

2016-10-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566392#comment-15566392
 ] 

Michael Armbrust commented on SPARK-17812:
--

For the seeking back {{X}} offsets use case, I was interactively querying a 
stream and I wanted *some* data, but not *all available data*.  I did not have 
specific offsets in mind, and under the assumption that items are getting 
hashed across partitions, X offsets back is a very reasonable proxy for time.  
I agree actual time would be better.  However, since there is disagreement on 
this case, I'd propose we break that out into its own ticket and focus on 
assign here.

I'm not sure I understand the concern with the {{startingOffsets}} option 
naming (which we can still change, though, it would be nice to do so before a 
release happens).  It affects which offsets will be included in the query and 
it only takes affect when the query is first started.  [~ofirm], currently we 
support  (1) (though I wouldn't say *all* data as we are limited by retention / 
compaction) and (2).  As you said, we can also support (3), though this must be 
done after the fact by adding a predicate to the stream on the timestamp 
column.  For performance it would be nice to push that down into Kafaka, but 
I'd split that optimization into another ticket.

Regarding (4), I like the proposed JSON solution.  It would be nice if this was 
unified with whatever format we decide to use in [SPARK-17829] so that you 
could easily pick up where another query left off.  I'd also suggest we use 
{{-1}} and {{-2}} as special offsets for subscribing to a topicpartition at the 
earliest or latests available offsets at query start time.

> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek back {{X}} offsets in the stream from the moment the query starts
>  - seek to user specified offsets



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17812) More granular control of starting offsets

2016-10-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17812:
-
Description: 
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek to user specified offsets for manually specified topicpartitions

  was:
Right now you can only run a Streaming Query starting from either the earliest 
or latests offsets available at the moment the query is started.  Sometimes 
this is a lot of data.  It would be nice to be able to do the following:
 - seek back {{X}} offsets in the stream from the moment the query starts
 - seek to user specified offsets


> More granular control of starting offsets
> -
>
> Key: SPARK-17812
> URL: https://issues.apache.org/jira/browse/SPARK-17812
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>
> Right now you can only run a Streaming Query starting from either the 
> earliest or latests offsets available at the moment the query is started.  
> Sometimes this is a lot of data.  It would be nice to be able to do the 
> following:
>  - seek to user specified offsets for manually specified topicpartitions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566385#comment-15566385
 ] 

Alexander Pivovarov commented on SPARK-17877:
-

Another open issue with checkpointing is SPARK-14804

> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566380#comment-15566380
 ] 

Timothy Hunter commented on SPARK-17845:


I like the {{Window.rowsBetween(Long.MinValue, -3)}} syntax, but it is exposing 
a system implementation detail. How about having some static/singleton values 
that define our notion of plus/minus infinity instead of relying on the system 
values?

Here is a suggestion:

{code}
Window.rowsBetween(Window.unboundedBefore, -3)

object Window {
  def unboundedBefore: Long = Int.MinVal.toLong
}
{code}

To get around that various sizes of the ints for each language, I suggest we 
say that every value above 2^31 is considered unbounded above. That should be 
more than enough and cover at least python, scala, R, java.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15153.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15431
[https://github.com/apache/spark/pull/15431]

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed  // res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed  // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
-> "kelly"))
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List(3L -> "lucas", 7L -> "john", 5L -> "matt", 2L 
> -> "kelly"))
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)
Alexander Pivovarov created SPARK-17877:
---

 Summary: Can not checkpoint connectedComponents resulting graph
 Key: SPARK-17877
 URL: https://issues.apache.org/jira/browse/SPARK-17877
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.1, 1.6.2, 1.5.2
Reporter: Alexander Pivovarov
Priority: Minor


The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
```
I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates the issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates the issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
> "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents()
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17877) Can not checkpoint connectedComponents resulting graph

2016-10-11 Thread Alexander Pivovarov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Pivovarov updated SPARK-17877:

Description: 
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
{code}
I think the last line should return true instead of false

  was:
The following code demonstrates an issue
{code}
import org.apache.spark.graphx._
val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
"postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
sc.setCheckpointDir("/tmp/check")

val g = Graph(users, rel)
g.checkpoint

val gg = g.connectedComponents()
gg.checkpoint

gg.vertices.collect
gg.edges.collect
gg.isCheckpointed
// res5: Boolean = false
```
I think the last line should return true instead of false


> Can not checkpoint connectedComponents resulting graph
> --
>
> Key: SPARK-17877
> URL: https://issues.apache.org/jira/browse/SPARK-17877
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
>Reporter: Alexander Pivovarov
>Priority: Minor
>
> The following code demonstrates an issue
> {code}
> import org.apache.spark.graphx._
> val users = sc.parallelize(List((3L, ("lucas", "student")), (7L, ("john", 
> "postdoc")), (5L, ("matt", "prof")), (2L, ("kelly", "prof"
> val rel = sc.parallelize(List(Edge(3L, 7L, "collab"), Edge(5L, 3L, 
> "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
> sc.setCheckpointDir("/tmp/check")
> val g = Graph(users, rel)
> g.checkpoint
> val gg = g.connectedComponents()
> gg.checkpoint
> gg.vertices.collect
> gg.edges.collect
> gg.isCheckpointed
> // res5: Boolean = false
> {code}
> I think the last line should return true instead of false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12216) Spark failed to delete temp directory

2016-10-11 Thread Jerome Scheuring (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566349#comment-15566349
 ] 

Jerome Scheuring edited comment on SPARK-12216 at 10/11/16 7:34 PM:


_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10, compiled with Scala 2.11 and running 
under Spark 2.0.1.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}


was (Author: jerome.scheuring):
_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 

[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2016-10-11 Thread Jerome Scheuring (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566349#comment-15566349
 ] 

Jerome Scheuring commented on SPARK-12216:
--

_Note that I am entirely new to the process of submitting issues on this 
system: if this needs to be a new issue, I would appreciate someone letting me 
know._

A bug very similar to this one is 100% reproducible across multiple machines, 
running both Windows 8.1 and Windows 10.

It occurs

* in Scala, but not Python (have not tried R)
* only when reading CSV files (and not, for example, when reading Parquet files)
* only when running local, not submitted to a cluster

This program will produce the bug (if {{poemData}} is defined per the 
commented-out section, rather than being read from a CSV file, the bug does not 
occur):

{code}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._

object SparkBugDemo {
  def main(args: Array[String]): Unit = {

val poemSchema = StructType(
  Seq(
StructField("label",IntegerType), 
StructField("line",StringType)
  )
)

val sparkSession = SparkSession.builder()
  .appName("Spark Bug Demonstration")
  .master("local[*]")
  .getOrCreate()

//val poemData = sparkSession.createDataFrame(Seq(
//  (0, "There's many a strong farmer"),
//  (0, "Who's heart would break in two"),
//  (1, "If he could see the townland"),
//  (1, "That we are riding to;")
//)).toDF("label", "line")

val poemData = sparkSession.read
  .option("quote", value="")
  .schema(poemSchema)
  .csv(args(0))

println(s"Record count: ${poemData.count()}")

  }
}
{code}

Assuming that {{args(0)}} contains the path to a file with comma-separated 
integer/string pairs, as in:

{noformat}
0,There's many a strong farmer
0,Who's heart would break in two
1,If he could see the townland
1,That we are riding to;
{noformat}

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> 

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566299#comment-15566299
 ] 

Sean Owen commented on SPARK-17463:
---

No, that change came after, and is part of a different JIRA that addresses 
another part of the same problem. It is not in 2.0.1

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566300#comment-15566300
 ] 

Sean Owen commented on SPARK-17463:
---

No, that change came after, and is part of a different JIRA that addresses 
another part of the same problem. It is not in 2.0.1

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 

[jira] [Commented] (SPARK-17870) ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong

2016-10-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566284#comment-15566284
 ] 

Sean Owen commented on SPARK-17870:
---

OK I get it, they're doing different things really. The scikit version is 
computing the statistic for count-valued features vs categorical label, and the 
Spark version is computing this for categorical features vs categorical labels. 
Although the number of label classes is constant in both cases, the Spark 
computation would depend on the number of feature classes too. Yes, it does 
need to be changed in Spark.

> ML/MLLIB: ChiSquareSelector based on Statistics.chiSqTest(RDD) is wrong 
> 
>
> Key: SPARK-17870
> URL: https://issues.apache.org/jira/browse/SPARK-17870
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Peng Meng
>Priority: Critical
>
> The method to count ChiSqureTestResult in mllib/feature/ChiSqSelector.scala  
> (line 233) is wrong.
> For feature selection method ChiSquareSelector, it is based on the 
> ChiSquareTestResult.statistic (ChiSqure value) to select the features. It 
> select the features with the largest ChiSqure value. But the Degree of 
> Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and 
> for different df, you cannot base on ChiSqure value to select features.
> Because of the wrong method to count ChiSquare value, the feature selection 
> results are strange.
> Take the test suite in ml/feature/ChiSqSelectorSuite.scala as an example:
> If use selectKBest to select: the feature 3 will be selected.
> If use selectFpr to select: feature 1 and 2 will be selected. 
> This is strange. 
> I use scikit learn to test the same data with the same parameters. 
> When use selectKBest to select: feature 1 will be selected. 
> When use selectFpr to select: feature 1 and 2 will be selected. 
> This result is make sense. because the df of each feature in scikit learn is 
> the same.
> I plan to submit a PR for this problem.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566251#comment-15566251
 ] 

Joseph K. Bradley commented on SPARK-15153:
---

Note I'm setting the target version for 2.1, not 2.0.x, since the fix requires 
a public API change in the preceding PR.

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15153) SparkR spark.naiveBayes throws error when label is numeric type

2016-10-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15153:
--
Target Version/s: 2.1.0

> SparkR spark.naiveBayes throws error when label is numeric type
> ---
>
> Key: SPARK-15153
> URL: https://issues.apache.org/jira/browse/SPARK-15153
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> When the label of dataset is numeric type, SparkR spark.naiveBayes will throw 
> error. This bug is easy to reproduce:
> {code}
> t <- as.data.frame(Titanic)
> t1 <- t[t$Freq > 0, -5]
> t1$NumericSurvived <- ifelse(t1$Survived == "No", 0, 1)
> t2 <- t1[-4]
> df <- suppressWarnings(createDataFrame(sqlContext, t2))
> m <- spark.naiveBayes(df, NumericSurvived ~ .)
> 16/05/05 03:26:17 ERROR RBackendHandler: fit on 
> org.apache.spark.ml.r.NaiveBayesWrapper failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
>   java.lang.ClassCastException: 
> org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to 
> org.apache.spark.ml.attribute.NominalAttribute
>   at 
> org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:66)
>   at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at io.netty.channel.AbstractChannelHandlerContext.invo
> {code}
> In RFormula, the response variable type could be string or numeric. If it's 
> string, RFormula will transform it to label of DoubleType by StringIndexer 
> and set corresponding column metadata; otherwise, RFormula will directly use 
> it as label when training model (and assumes that it was numbered from 0, 
> ..., maxLabelIndex). 
> When we extract labels at ml.r.NaiveBayesWrapper, we should handle it 
> according the type of the response variable (string or numeric).
> cc [~mengxr] [~josephkb]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566244#comment-15566244
 ] 

Cody Koeninger commented on SPARK-17344:


How long would it take CDH to distribute 0.10 if there was a compelling Spark 
client for it?

How are you going to handle SSL?

You can't avoid the complexity of caching consumers if you still want the 
benefits of prefetching, and doing an SSL handshake for every batch will kill 
performance if they aren't cached.

Also note that this is a pretty prime example of what I'm talking about in my 
dev mailing list discussion on SIPs.  This issue has been brought up, and 
decided against continuing support of 0.8, multiple times.

You guys started making promises about structured streaming for Kafka over half 
a year ago, and still don't have it feature complete.  This is a big potential 
detour for uncertain gain.  The real underlying problem is still how you're 
going to do better than simply wrapping a DStream, and I don't see how this is 
directly relevant.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17817) PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

2016-10-11 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-17817.
--
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0

> PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
> ---
>
> Key: SPARK-17817
> URL: https://issues.apache.org/jira/browse/SPARK-17817
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Mike Dusenberry
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Calling {{repartition}} on a PySpark RDD to increase the number of partitions 
> results in highly skewed partition sizes, with most having 0 rows.  The 
> {{repartition}} method should evenly spread out the rows across the 
> partitions, and this behavior is correctly seen on the Scala side.
> Please reference the following code for a reproducible example of this issue:
> {code}
> # Python
> num_partitions = 2
> a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
> l = a.repartition(num_partitions).glom().map(len).collect()  # get length of 
> each partition
> min(l), max(l), sum(l)/len(l), len(l)  # skewed!
> # Scala
> val numPartitions = 2
> val a = sc.parallelize(0 until 1e6.toInt, 2)  # start with 2 even partitions
> val l = a.repartition(numPartitions).glom().map(_.length).collect()  # get 
> length of each partition
> print(l.min, l.max, l.sum/l.length, l.length)  # even!
> {code}
> The issue here is that highly skewed partitions can result in severe memory 
> pressure in subsequent steps of a processing pipeline, resulting in OOM 
> errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-10-11 Thread Harish (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566206#comment-15566206
 ] 

Harish commented on SPARK-17463:


Is this fix is part of the https://github.com/apache/spark/pull/15371 pull 
request?. I have 2.0.1 in my cluster but i am getting both the errors.

16/10/11 00:53:42 WARN NettyRpcEndpointRef: Error sending message [message = 
Heartbeat(2,[Lscala.Tuple2;@43f45f95,BlockManagerId(2, HOST, 43256))] in 1 
attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at 
org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1877)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:766)
at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
at 
java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at 

  1   2   >