Additional fix for Avro IncompatibleClassChangeError (SPARK-3039)

2015-02-02 Thread M. Dale
SPARK-3039 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API was marked resolved with Spark 1.2.0 release. However, when I download the pre-built Spark distro for Hadoop 2.4 and later (spark-1.2.0-bin-hadoop2.4.tgz) and run it against Avro code compiled against

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs:

Re: Get size of rdd in memory

2015-02-02 Thread Cheng Lian
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM,

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the branch-1.2 branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md Note that if you call schemaRDD.cache() rather than

Performance test for sort shuffle

2015-02-02 Thread Kannan Rajah
Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan

Re: Spark Master Maven with YARN build is broken

2015-02-02 Thread Patrick Wendell
It's my fault, I'm sending a hot fix now. On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/ Is this is a known issue? It seems to have

[spark-sql] JsonRDD

2015-02-02 Thread Daniil Osipov
Hey Spark developers, Is there a good reason for JsonRDD being a Scala object as opposed to class? Seems most other RDDs are classes, and can be extended. The reason I'm asking is that there is a problem with Hive interoperability with JSON DataFrames where jsonFile generates case sensitive

Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
I'm asking from an experimental standpoint; this is not happening anytime soon. Of course, if the experiment turns out very well, Pants would replace both sbt and Maven (like it has at Twitter, for example). Pants also works with IDEs http://pantsbuild.github.io/index.html#using-pants-with. On

Re: Building Spark with Pants

2015-02-02 Thread Stephen Boesch
There is a significant investment in sbt and maven - and they are not at all likely to be going away. A third build tool? Note that there is also the perspective of building within an IDE - which actually works presently for sbt and with a little bit of tweaking with maven as well. 2015-02-02

Re: [spark-sql] JsonRDD

2015-02-02 Thread Reynold Xin
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util methods. The case sensitivity issues seem orthogonal, and would be great to be able to control that with a flag. On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Hey Spark developers, Is

Re: Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
To reiterate, I'm asking from an experimental perspective. I'm not proposing we change Spark to build with Pants or anything like that. I'm interested in trying Pants out and I'm wondering if anyone else shares my interest or already has experience with Pants that they can share. On Mon Feb 02

Building Spark with Pants

2015-02-02 Thread Nicholas Chammas
Does anyone here have experience with Pants http://pantsbuild.github.io/index.html or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic

Temporary jenkins issue

2015-02-02 Thread Patrick Wendell
Hey All, I made a change to the Jenkins configuration that caused most builds to fail (attempting to enable a new plugin), I've reverted the change effective about 10 minutes ago. If you've seen recent build failures like below, this was caused by that change. Sorry about that. ERROR:

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Krishna Sankar
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.0 2.1.

[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
This is cancelled in favor of RC2. On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell pwend...@gmail.com wrote: The windows issue reported only affects actually running Spark on Windows (not job submission). However, I agree it's worth cutting a new RC. I'm going to cancel this vote and propose

Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-02-02 Thread Patrick Wendell
The windows issue reported only affects actually running Spark on Windows (not job submission). However, I agree it's worth cutting a new RC. I'm going to cancel this vote and propose RC3 with a single additional patch. Let's try to vote that through so we can ship Spark 1.2.1. - Patrick On Sat,

[VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-02 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc.

IDF for ml pipeline

2015-02-02 Thread masaki rikitoku
Hi all I am trying the ml pipeline for text classfication now. recently, i succeed to execute the pipeline processing in ml packages, which consist of the original Japanese tokenizer, hashingTF, logisticRegression. then, i failed to executed the pipeline with idf in mllib package directly.

Re: Performance test for sort shuffle

2015-02-02 Thread Ewan Higgs
Hi Kannan, I have a branch here: https://github.com/ehiggs/spark/tree/terasort The code is in the examples. I don't do any fancy partitioning so it could be made quicker, I'm sure. But it should be a good baseline. I have a WIP PR for spark-perf but I'm having trouble building it there[1].

Re: Get size of rdd in memory

2015-02-02 Thread ankits
Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But

Can spark provide an option to start reduce stage early?

2015-02-02 Thread Xuelin Cao
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned

Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current

Re: Questions about Spark standalone resource scheduler

2015-02-02 Thread Patrick Wendell
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose

RE: Questions about Spark standalone resource scheduler

2015-02-02 Thread Shao, Saisai
Hi Patrick, Thanks a lot for your detailed explanation. For now we have such requirements: whitelist the application submitter, user resources (CPU, MEMORY) quotas, resources allocations in Spark Standalone mode. These are quite specific requirements for production-use, generally these problem