Re: java.lang.NoClassDefFoundError, is this a bug?

2016-09-22 Thread Xiang Gao
Yes, I mean local here. Thanks for pointing this out. Also thanks for explaining the problem. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/java-lang-NoClassDefFoundError-is-this-a-bug-tp18972p19011.html Sent from the Apache Spark Developers List

Re: [SPARK-15717][GraphX] status

2016-09-22 Thread Reynold Xin
Did you try the proposed fix? Would be good to know whether it fixes the issue. On Thu, Sep 22, 2016 at 2:49 PM, Asher Krim wrote: > Does anyone know what the status of SPARK-15717 is? It's a simple enough > looking PR, but there has been no activity on it since June 16th. >

[SPARK-15717][GraphX] status

2016-09-22 Thread Asher Krim
Does anyone know what the status of SPARK-15717 is? It's a simple enough looking PR, but there has been no activity on it since June 16th. I believe that we are hitting that bug with checkpointed distributed LDA. It's a blocker for us and we would really appreciate getting it fixed. Jira:

Re: What's the use of RangePartitioner.hashCode

2016-09-22 Thread Jakob Odersky
Hash codes should try to avoid collisions of objects that are not equal. Integer overflowing is not an issue by itself On Wed, Sep 21, 2016 at 10:49 PM, WangJianfei wrote: > Than you very much sir! but what i want to know is whether the hashcode > overflow will

Re: R docs no longer building for branch-2.0

2016-09-22 Thread Shivaram Venkataraman
I looked into this and found the problem. Will send a PR now to fix this. If you are curious about what is happening here: When we build the docs separately we don't have the JAR files from the Spark build in the same tree. We added a new set of docs recently in SparkR called an R vignette that

Re: R docs no longer building for branch-2.0

2016-09-22 Thread Sean Owen
FWIW it worked for me, but I may not be executing the same thing. I was running the commands given in R/DOCUMENTATION.md It succeeded for me in creating the vignette, on branch-2.0. Maybe it's a version or library issue? what R do you have installed, and are you up to date with packages like

A Spark resource scheduling order question

2016-09-22 Thread Bi Linfeng
Hi, I have a Spark resource scheduling order question when I read this code: github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala In function schedule(), spark start drivers first, then start executors. I’m wondering why we schedule in this order?

Deserializing InternalRow using a case class - how to avoid creating attrs manually?

2016-09-22 Thread Jacek Laskowski
Hi, I've just discovered* that I can SerDe my case classes. What a nice feature which I can use in spark-shell, too! Thanks a lot for offering me so much fun! What I don't really like about the code is the following part (esp. that it conflicts with the implicit for Column): import

Open source Spark based projects

2016-09-22 Thread tahirhn
I am planning to write a thesis on certain aspects (i.e testing, performance optimisation, security) of Apache Spark. I need to study some projects that are based on Apache Spark and are available as open source. If you know any such project (open source Spark based project), Please share it

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-22 Thread Sean Owen
There can be just one published version of the Spark artifacts and they have to depend on something, though in truth they'd be binary-compatible with anything 2.2+. So you merely manage the dependency versions up to the desired version in your . On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <

Re: Memory usage by Spark jobs

2016-09-22 Thread Jörn Franke
You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed. Then,

Re: CSV Reader with row numbers

2016-09-22 Thread Hemant Bhanawat
zipWithIndex is fine. It will give you unique row IDs across your various partitions. You can also use zipWithUniqueId which saves an extra job that is fired by zipWithIndex. However, there are some differences as to how indexes are assigned to the row. You can read more about the two APIs in the

R docs no longer building for branch-2.0

2016-09-22 Thread Reynold Xin
I'm working on packaging 2.0.1 rc but encountered a problem: R doc fails to build. Can somebody take a look at the issue ASAP? ** knitting documentation of write.parquet ** knitting documentation of write.text ** knitting documentation of year ~/workspace/spark-release-docs/spark/R

Memory usage by Spark jobs

2016-09-22 Thread Hemant Bhanawat
I am working on profiling TPCH queries for Spark 2.0. I see lot of temporary object creation (sometimes size as much as the data size) which is justified for the kind of processing Spark does. But, from production perspective, is there a guideline on how much memory should be allocated for