Re: Biggest spark.akka.framesize possible

2013-12-08 Thread Matei Zaharia
Hey Matt, This setting shouldn’t really affect groupBy operations, because they don’t go through Akka. The frame size setting is for messages from the master to workers (specifically, sending out tasks), and for results that go directly from workers to the application (e.g. collect()). So it

Re: Spark Import Issue

2013-12-08 Thread Matei Zaharia
I’m not sure you can have a star inside that quoted classpath argument (the double quotes may cancel the *). Try using the JAR through its full name, or link to Spark through Maven (http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-java). Matei On Dec 6, 2013,

Re: Newbie questions

2013-12-08 Thread Matei Zaharia
Hi Kenneth, 1. Is Spark suited for online learning algorithms? From what I’ve read so far (mainly from this slide), it seems not but I could be wrong. You can probably use Spark Streaming (http://spark.incubator.apache.org/docs/latest/streaming-programming-guide.html) to implement

Re: Biggest spark.akka.framesize possible

2013-12-08 Thread Matei Zaharia
As I said, it should not affect performance of transformations on RDDs, only of sending tasks to the workers and getting results back. In general, you want the Akka frame size to be as small as possible while still holding your largest task or result; as long as your application isn’t throwing

Re: Biggest spark.akka.framesize possible

2013-12-08 Thread Shangyu Luo
OK. It is clear. But what about collect() and collectAsMap()? Is it possible that Spark throws 'java heap space' error or 'communication error' because of a small spark.akka.framesize? Currently I set it as 1024. Thank you! Best, Shangyu 2013/12/8 Matei Zaharia matei.zaha...@gmail.com As I

Bump: on disk storage formats

2013-12-08 Thread Ankur Chauhan
Hi all, Sorry for posting this again but I am interested in finding out what different on disk data formats for storing timeline event and analytics aggregate data. Currently I am just using newline delimited json gzipped files. I was wondering if there were any recommendations. -- Ankur

Re: Bump: on disk storage formats

2013-12-08 Thread Andrew Ash
LZO compression at a minimum, and using Parquet as a second step, seems like the way to go though I haven't tried either personally yet. Sent from my mobile phone On Dec 8, 2013, at 16:54, Ankur Chauhan achau...@brightcove.com wrote: Hi all, Sorry for posting this again but I am interested

Re: Build Spark with maven

2013-12-08 Thread Azuryy Yu
any thoughs here? I still cannot compile spark using maven, thanks for any inputs. On 2013-12-07 2:31 PM, Azuryy Yu azury...@gmail.com wrote: Hey dears, Can you give me a maven repo, so I can compile Spark with Maven. I'm using http://repo1.maven.org/maven2/ currently but It complains

Re: Build Spark with maven

2013-12-08 Thread Matei Zaharia
Yeah, maybe you have weird versions of something published locally. Try deleting your ~/.m2 and ~/.ivy2 directories and redoing the build. Unfortunately this will take a while to re-download stuff, but it should work out. Matei On Dec 8, 2013, at 5:21 PM, Mark Hamstra m...@clearstorydata.com

Re: Build Spark with maven

2013-12-08 Thread Azuryy Yu
I am not check out from repository, I download source package and build. On 2013-12-09 9:22 AM, Mark Hamstra m...@clearstorydata.com wrote: I don't believe that is true of the Spark 0.8.1 code. I just got done building Spark from the v0.8.1-incubating tag after first removing anything to do

Re: Build Spark with maven

2013-12-08 Thread Azuryy Yu
Hi Mark, I build the current releast candidate, It complained during build: Downloading: http://repo1.maven.org/maven2/com/typesafe/akka/akka-actor/2.0.5/akka-actor-2.0.5.pom [WARNING] The POM for com.typesafe.akka:akka-actor:jar:2.0.5 is missing, no dependency information available Downloading:

Re: Build Spark with maven

2013-12-08 Thread Azuryy Yu
@Mark, It works now after I changed the seetings.xml, but It would be better if improve a little Spark document in the section of Building Spark with Mavenhttp://spark.incubator.apache.org/docs/latest/building-with-maven.html On Mon, Dec 9, 2013 at 10:45 AM, Azuryy Yu azury...@gmail.com wrote:

Re: Bump: on disk storage formats

2013-12-08 Thread Ankur Chauhan
Hi Patrick, I agree this is a very open ended question but I was trying to get a general answer anyway but I think you did hint on some nuances. 1. My work load is definitely bottlenecked by disk IO just beacause even with a project on a single column(mostly 2-3 out of 20) there is a lot of

Re: Build Spark with maven

2013-12-08 Thread Rajika Kumarasiri
Try to see if that dependency comes via a transitive dependency using a mvn dependency tree. Rajika On Sat, Dec 7, 2013 at 1:31 AM, Azuryy Yu azury...@gmail.com wrote: Hey dears, Can you give me a maven repo, so I can compile Spark with Maven. I'm using http://repo1.maven.org/maven2/

Re: Bump: on disk storage formats

2013-12-08 Thread Patrick Wendell
Parquet might be a good fit for you then... it's pretty new and I don't have a lot of direct experience working with it. But I've seen examples of people using Spark with Parquet. You might want to checkout Matt Massie's post here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ This

Re: Bump: on disk storage formats

2013-12-08 Thread Azuryy Yu
Thanks for sharing. On 2013-12-09 11:50 AM, Patrick Wendell pwend...@gmail.com wrote: Parquet might be a good fit for you then... it's pretty new and I don't have a lot of direct experience working with it. But I've seen examples of people using Spark with Parquet. You might want to checkout

Key Class - NotSerializableException

2013-12-08 Thread Archit Thakur
Hi, When I did sc.sequenceFile(file, classOf[Text], classOf[Text]).flatMap(map_func).count() It gave me result of 365. However, when I did sc.sequenceFile(file, classOf[Text], classOf[Text]).flatMap(map_func).sortByKey().count(), It threw java.io.NotSerializableException for Key Class returned

Re: Spark Import Issue

2013-12-08 Thread Andrew Ash
Also note that when you add parameters to the -cp flag on the JVM and want to include multiple jars, the only way to do that is by including an entire directory with dir/* -- you can't use dir/*jar or dir/spark*jar or anything else like that.

Re: Key Class - NotSerializableException

2013-12-08 Thread Archit Thakur
I did make the classes Serialized. But now running the same command sc.sequenceFile(file, classOf[Text], classOf[Text]).flatMap(map_ func).sortByKey().count(), gives me java.lang.NoSuchMethodError. For the Collection class which I made Serialized accesses one static variable that static

Re: Key Class - NotSerializableException

2013-12-08 Thread Archit Thakur
And Since sortByKey serializes the classes, I guess it has something to do with Serialization thing. On Mon, Dec 9, 2013 at 11:19 AM, Archit Thakur archit279tha...@gmail.comwrote: I did make the classes Serialized. But now running the same command sc.sequenceFile(file, classOf[Text],

Fwd: Key Class - NotSerializableException

2013-12-08 Thread Archit Thakur
Hi Nick, Yeah I saw that. I actually used sc.sequenceFile file to load data from HDFS to RDD. Also both my key class and value class implements WritableComparable of Hadoop. Still I got the error java.io.NotSerializableException, When I used sortByKey. Hierarchy of my classes: Collection