date:20161011

[jira] [Commented] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2016-10-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567714#comment-15567714
 ] 

Apache Spark commented on SPARK-14804:
--

User 'apivovarov' has created a pull request for this issue:
https://github.com/apache/spark/pull/15447

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Priority: Minor
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17880) The url linking to `AccumulatorV2` in the document is incorrect.

2016-10-11 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17880.
-
   Resolution: Fixed
 Assignee: Kousuke Saruta
Fix Version/s: 2.1.0
   2.0.2

> The url linking to `AccumulatorV2` in the document is incorrect.
> 
>
> Key: SPARK-17880
> URL: https://issues.apache.org/jira/browse/SPARK-17880
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> In `programming-guide.md`, the url which links to `AccumulatorV2` says 
> `api/scala/index.html#org.apache.spark.AccumulatorV2` but 
> `api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17846) A bad state of Running Applications with spark standalone HA

2016-10-11 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567662#comment-15567662
 ] 

Saisai Shao commented on SPARK-17846:
-

I think this issue should be the same as SPARK-14262.

> A bad state of Running Applications with spark standalone HA 
> -
>
> Key: SPARK-17846
> URL: https://issues.apache.org/jira/browse/SPARK-17846
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: dylanzhou
>Priority: Critical
> Attachments: Problem screenshots.jpg
>
>
> i am use standalone mode,when i use HA from two ways,i found the applications 
> states was "WAITING",Is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567457#comment-15567457
 ] 

Cody Koeninger commented on SPARK-17344:


Given the choice between rewriting underlying kafka consumers and having a 
split codebase, I'd rather have a split codebase.  Of course I'd rather not 
sink development effort into an old version of kafka at all, until the 
structured stream for 0.10 is working for my use cases.

But If you want to wrap the 0.8 rdd in a structured stream, go for it, I'll 
help you figure out how do it.  Seriously.  Don't expect larger project uptake, 
but if you just need something to work for you

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-17837) Disaster recovery of offsets from WAL

2016-10-11 Thread Cody Koeninger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger closed SPARK-17837.
--
Resolution: Duplicate

Duplicate of SPARK-17829

> Disaster recovery of offsets from WAL
> -
>
> Key: SPARK-17837
> URL: https://issues.apache.org/jira/browse/SPARK-17837
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Cody Koeninger
>
> "The SQL offsets are stored in a WAL at $checkpointLocation/offsets/$batchId. 
> As reynold suggests though, we should change this to use a less opaque 
> format."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17720) Static configurations in SQL

2016-10-11 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17720.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15295
[https://github.com/apache/spark/pull/15295]

> Static configurations in SQL
> 
>
> Key: SPARK-17720
> URL: https://issues.apache.org/jira/browse/SPARK-17720
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>
> Spark SQL has two kinds of configuration parameters: dynamic configs and 
> static configs. Dynamic configs can be modified after Spark SQL is launched 
> (after SparkSession is setup), whereas static configs are immutable once the 
> service starts.
> It would be useful to have this separation and tell users if the user tries 
> to set a static config after the service starts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17882:


Assignee: Apache Spark

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Assignee: Apache Spark
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17882:


Assignee: (was: Apache Spark)

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567423#comment-15567423
 ] 

Apache Spark commented on SPARK-17882:
--

User 'jrshust' has created a pull request for this issue:
https://github.com/apache/spark/pull/15446

> RBackendHandler swallowing errors
> -
>
> Key: SPARK-17882
> URL: https://issues.apache.org/jira/browse/SPARK-17882
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.1
>Reporter: James Shuster
>Priority: Minor
>
> RBackendHandler is swallowing general exceptions in handleMethodCall which 
> makes it impossible to debug certain issues that happen when doing an 
> invokeJava call.
> In my case this was the following error
> java.lang.IllegalAccessException: Class 
> org.apache.spark.api.r.RBackendHandler can not access a member of class with 
> modifiers "public final"
> The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17817) PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes

2016-10-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567418#comment-15567418
 ] 

Apache Spark commented on SPARK-17817:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15445

> PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes
> ---
>
> Key: SPARK-17817
> URL: https://issues.apache.org/jira/browse/SPARK-17817
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1
>Reporter: Mike Dusenberry
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> Calling {{repartition}} on a PySpark RDD to increase the number of partitions 
> results in highly skewed partition sizes, with most having 0 rows.  The 
> {{repartition}} method should evenly spread out the rows across the 
> partitions, and this behavior is correctly seen on the Scala side.
> Please reference the following code for a reproducible example of this issue:
> {code}
> # Python
> num_partitions = 2
> a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
> l = a.repartition(num_partitions).glom().map(len).collect()  # get length of 
> each partition
> min(l), max(l), sum(l)/len(l), len(l)  # skewed!
> # Scala
> val numPartitions = 2
> val a = sc.parallelize(0 until 1e6.toInt, 2)  # start with 2 even partitions
> val l = a.repartition(numPartitions).glom().map(_.length).collect()  # get 
> length of each partition
> print(l.min, l.max, l.sum/l.length, l.length)  # even!
> {code}
> The issue here is that highly skewed partitions can result in severe memory 
> pressure in subsequent steps of a processing pipeline, resulting in OOM 
> errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17882) RBackendHandler swallowing errors

2016-10-11 Thread James Shuster (JIRA)

James Shuster created SPARK-17882:
-

 Summary: RBackendHandler swallowing errors
 Key: SPARK-17882
 URL: https://issues.apache.org/jira/browse/SPARK-17882
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.1
Reporter: James Shuster
Priority: Minor


RBackendHandler is swallowing general exceptions in handleMethodCall which 
makes it impossible to debug certain issues that happen when doing an 
invokeJava call.

In my case this was the following error
java.lang.IllegalAccessException: Class org.apache.spark.api.r.RBackendHandler 
can not access a member of class with modifiers "public final"

The getCause message that is written back was basically blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2016-10-11 Thread Jeremy Smith (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567367#comment-15567367
]

Jeremy Smith edited comment on SPARK-17344 at 10/12/16 2:56 AM:

{quote}
By contrast, writing a streaming source shim around the existing simple
consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't have
stuff like SSL, dynamic topics, or offset committing.
{quote}

Serious question: Would it be so bad to have a bifurcated codebase here? People
who are tied to Kafka 0.8/0.9 will typically know that this is a limitation for
them, and are probably not all that concerned about the features you mentioned.
In general, structured streaming already provides a lot of the capabilities
that I for one am concerned about when using Kafka - offsets are tracked
natively by SS, so offset committing isn't that big of a deal; in a CDH cluster
specifically, you are probably using network-level security and aren't viewing
the lack of SSL as a blocker; and finally you're already resigned to static
topic subscriptions because that's what you're getting with the DStream API.

A simple Structured Streaming source for Kafka, even using the same underlying
technology, would be a HUGE step up:

* You won't have "dynamic topics" to the same level, but at least you won't
have to throw away all your checkpoints just to do something with a new topic
in the same application. Currently, you have to do this, because the entire
graph is stored in the checkpoints along with all the topics you're ever going
to look at. Structured streaming at least gives you separate checkpoints per
source, rather than for the entire StreamingContext.
* You're already unable to manually commit offsets; you either have to rewind
to the beginning, or throw away everything from the past, or (as before) rely
on the incredibly fragile StreamingContext checkpoints. Or, commit the
topic/partition/offset to the sink so you can recover the actually processed
messages from there. Again, decoupling each operation from the entire state of
the StreamingContext is a huge step up, because you can actually upgrade your
application code (at least in certain ways) without having to worry about
re-processing stuff due to discarding the checkpoints.
* It will dramatically simplify the usage of Kafka from Spark in general. 9/10
use cases involve some sort of structured data, the processing of which will
have dramatically better performance when being used with tungsten than with
RDD-level operations.

So if the simple-consumer based Kafka source would be so easy, at the expense
of some features, why not introduce it? I have a tremendous amount of respect
for the complexity of Kafka and the work you're doing with it, but I also get a
sense that the conceptual "perfect" here is the enemy of the good. The weekend
project you mentioned would result in a dramatic improvement in the experience
for a large percentage of users who are currently using Spark and Kafka
together. Most companies are using some kind of Hadoop distribution (i.e. HDP
or CDH) and they are slow to update things like Kafka. HDP does have 0.10 (CDH
doesn't), but at what rate are people actually able to update HDP? I don't have
any data on it (ironically) but I'm guessing that 0.9 still represents a fairly
significant portion of the Kafka install base.

Just my two cents on the matter.

was (Author: jeremyrsmith):
> By contrast, writing a streaming source shim around the existing simple
> consumer-based 0.8 spark rdd would be a weekend project, it just wouldn't
> have stuff like SSL, dynamic topics, or offset committing.

1 2 >

1 - 100 of 189 matches

Mail list logo