log4j question

2014-01-09 Thread Shay Seng
Hey, In https://spark.incubator.apache.org/docs/0.8.1/configuration.html#configuring-logging It states: *Spark uses log4j http://logging.apache.org/log4j/ for logging. You can configure it by adding a log4j.properties file in the conf directory. One way to start is to copy the existing

What version of protobuf should I be using?

2014-01-03 Thread Shay Seng
Hi. I'm using Spark 0.8.0 and launching a AWS cluster using spark-ec2 script. Typically I can launch spark-shell with no problem. However if I add protobuf-java-2.5.0.jar to my spark-classpath, I am not able to launch spark - the workers fail to connect and spark-shell quits. If I do not add

Re: What version of protobuf should I be using?

2014-01-03 Thread Shay Seng
On Fri, Jan 3, 2014 at 12:11 PM, Shay Seng s...@1618labs.com wrote: Hi. I'm using Spark 0.8.0 and launching a AWS cluster using spark-ec2 script. Typically I can launch spark-shell with no problem. However if I add protobuf-java-2.5.0.jar to my spark-classpath, I am not able to launch

Re: Where can I find more information about the R interface for Spark?

2014-01-02 Thread Shay Seng
I've been using JRI to communicate with R from Spark, with some utils to convert from Scala data types into R datatypes/dataframes etc. http://www.rforge.net/JRI/ I've been using mapPartitions to push R closures thru JRI and collecting back the results in Spark. This works reasonably well, though

DataFrame RDDs

2013-11-15 Thread Shay Seng
Hi, Is there some way to get R-style Data.Frame data structures into RDDs? I've been using RDD[Seq[]] but this is getting quite error-prone and the code gets pretty hard to read especially after a few joins, maps etc. Rather than access columns by index, I would prefer to access them by name.

Re: DataFrame RDDs

2013-11-15 Thread Shay Seng
on that in-memory table structure. We're planning to harmonize that with the MLBase work in the near future. Just a matter of prioritization on limited resources. If there's enough interest we'll accelerate that. Sent while mobile. Pls excuse typos etc. On Nov 16, 2013 1:11 AM, Shay Seng s

Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Shay Seng
Hi, Just wondering what people suggest for joining of 2 RDDs of very different sizes I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB - 800MB that typically has a couple hundred partitions. After that I want to join that rdd with 2 smaller rdds 1 will be 50MB

Re: Recommended way to join 2 RDDs - one large, the other small

2013-11-14 Thread Shay Seng
on a cellphone so I'm not sure why RDDs are involved. On Thu, Nov 14, 2013 at 11:14 AM, Shay Seng s...@1618labs.com wrote: Hi, Just wondering what people suggest for joining of 2 RDDs of very different sizes I have a sequence of map reduce that will in the end yield me a RDD ~ 500MB

Re: suppressing logging in REPL

2013-11-07 Thread Shay Seng
It seems that I need to have the log4j.properties file in the current directory So if I launch spark-shell in spark/conf I see that INFO is not displayed. On Thu, Nov 7, 2013 at 2:16 PM, Shay Seng s...@1618labs.com wrote: When is the log4j.properties file read... and how can I verify

Re: suppressing logging in REPL

2013-11-06 Thread Shay Seng
available as sc. Type in expressions to have them evaluated. Type :help for more information. scala sc.parallelize(1 to 10, 2).count res0: Long = 10 On Tue, Nov 5, 2013 at 2:36 PM, Shay Seng s...@1618labs.com wrote: Hi, I added a log4j.properties file in spark/conf more ./spark/conf

value join is not a member of org.apache.spark.rdd.RDD

2013-11-06 Thread Shay Seng
Hi, I'm having some problem getting a piece of code that I can run in the REPL to compile val aDay = day.map( n= ... ((aInt,bInt),(cInt,dInt,eDbl,fInt,gDbl)) ) val seg = segments.map( n = ... ((aInt,bInt), (..)) ) val allSegs = aDay.join(seg) error: value join is not a member of

Save RDDs as CSV

2013-10-30 Thread Shay Seng
What's the recommended way to save a RDD as a CSV on say HDFS? Do I have to collect the RDD and save it from the master, or is there someway I can write out the CSV file in parallel to HDFS? tks shay

Spark REPL produces error on a piece of scala code that works in pure Scala REPL

2013-10-11 Thread Shay Seng
Hey, I seeing a funny situation where a piece of code executes in a pure Scala REPL but not in a Spark-shell. I'm using Scala 2.9.3 with Spark 0.8.0 In Spark I see: class Animal() { def says():String = ??? } val amimal = new Animal amimal: this.Animal = Animal@df27cd5 class Zoo[A :

How would I start writing a RDD[ProtoBuf] and/or sc.newAPIHadoopFile??

2013-10-08 Thread Shay Seng
Hi, I would like to store some data as a seq of protobuf objects. I would of course need to beable to read that into an RDD and write the RDD back out in some binary format. First of all, is this supported natively (or through some download)? If not, are there examples on how I might write my

spark-ec2 launch script ... some issues and comments

2013-10-04 Thread Shay Seng
Hi, I've been trying to use the spark-ec2 launch scripts have have some comments on it, not sure if this is the best place to post ... (1) On the AMI image, most of the modeule's init.sh file has the following idiom: if [ -d spark ]; then echo Spark seems to be installed. Exiting. exit 0

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
Inlined. On Wed, Oct 2, 2013 at 1:00 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Hi Shangyu, (1) When we read in a local file by SparkContext.textFile and do some map/reduce job on it, how will spark decide to send data to which worker node? Will the data be divided/partitioned equally

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
called on a local file will magically turn that local file into a distributed file and allow more than just the node where the file is local to process that file. On Thu, Oct 3, 2013 at 11:05 AM, Shay Seng s...@1618labs.com wrote: Inlined. On Wed, Oct 2, 2013 at 1:00 PM, Matei Zaharia

Re: Some questions about task distribution and execution in Spark

2013-10-03 Thread Shay Seng
PM, Mark Hamstra m...@clearstorydata.comwrote: But the worker has to be on a node that has local access to the file. On Thu, Oct 3, 2013 at 12:30 PM, Shay Seng s...@1618labs.com wrote: Ok, even if my understanding of allowLocal is incorrect, nevertheless (1) I'm loading a local file (2

Re: RemoteClientError@akka://spark@10.232.35.179:44283: Error[java.net.ConnectException:Connection refused

2013-09-25 Thread Shay Seng
can ssh to the worker machine, and look at the work folder in Spark. -- Reynold Xin, AMPLab, UC Berkeley http://rxin.org On Sat, Sep 21, 2013 at 12:30 PM, Shay Seng s...@1618labs.com wrote: Hey, I've been struggling to set up a work flow with spark. I'm basically using the AMI

Configuring memory used in Spark

2013-09-20 Thread Shay Seng
Hey all. I've been getting OutOfMemory Java errors. 13/09/20 18:54:37 ERROR actor.ActorSystemImpl: Uncaught error from thread [spark-akka.actor.default-dispatcher-5] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled java.lang.OutOfMemoryError: Java heap space at

Re: Configuring memory used in Spark

2013-09-20 Thread Shay Seng
Please ignore this, was being dumb, mixture of typo and mis(not)reading the docs. On Fri, Sep 20, 2013 at 12:04 PM, Shay Seng s...@1618labs.com wrote: Hey all. I've been getting OutOfMemory Java errors. 13/09/20 18:54:37 ERROR actor.ActorSystemImpl: Uncaught error from thread [spark