customized comparator in groupByKey

2014-05-06 Thread Ameet Kini
I'd like to override the logic of comparing keys for equality in groupByKey. Kinda like how combineByKey allows you to pass in the combining logic for "values", I'd like to do the same for keys. My code looks like this: val res = rdd.groupBy(myPartitioner) Here, rdd is of type RDD[(MyKey, MyValue)

Re: How to read a multipart s3 file?

2014-05-06 Thread Andre Kuhnen
Try using s3n instead of s3 Em 06/05/2014 21:19, "kamatsuoka" escreveu: > I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt. > > Behind the scenes, the S3 driver creates a bunch of files like > s3://mybucket//mydir/myfile.txt/part-, as well as the block files like > s3

Re: Easy one

2014-05-06 Thread Aaron Davidson
If you're using standalone mode, you need to make sure the Spark Workers know about the extra memory. This can be configured in spark-env.sh on the workers as export SPARK_WORKER_MEMORY=4g On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira wrote: > Hi there, > > Why can’t I seem to kick the executor

Easy one

2014-05-06 Thread Ian Ferreira
Hi there, Why can¹t I seem to kick the executor memory higher? See below from EC2 deployment using m1.large And in the spark-env.sh export SPARK_MEM=6154m And in the spark context sconf.setExecutorEnv("spark.executor.memory", "4g²) Cheers - Ian

Re: How to read a multipart s3 file?

2014-05-06 Thread Matei Zaharia
There’s a difference between s3:// and s3n:// in the Hadoop S3 access layer. Make sure you use the right one when reading stuff back. In general s3n:// ought to be better because it will create things that look like files in other S3 tools. s3:// was present when the file size limit in S3 was mu

How to read a multipart s3 file?

2014-05-06 Thread kamatsuoka
I have a Spark app that writes out a file, s3://mybucket/mydir/myfile.txt. Behind the scenes, the S3 driver creates a bunch of files like s3://mybucket//mydir/myfile.txt/part-, as well as the block files like s3://mybucket/block_3574186879395643429. How do I construct an url to use this file

Re: Spark and Java 8

2014-05-06 Thread Dean Wampler
Cloudera customers will need to put pressure on them to support Java 8. They only officially supported Java 7 when Oracle stopped supporting Java 6. dean On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia wrote: > Java 8 support is a feature in Spark, but vendors need to decide for > themselves when

Re: maven for building scala simple program

2014-05-06 Thread Ryan Compton
I've been using this (you'll need maven 3). http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 com.mycompany.app my-app 1.0-SN

maven for building scala simple program

2014-05-06 Thread Laeeq Ahmed
Hi all,   If anyone is using maven for building scala classes with all dependencies for spark, please provide a sample pom.xml here. I have having trouble using maven for scala simple job though it was working properly for java. I have added scala maven plugin but still getting some issues.   La

Re: logging in pyspark

2014-05-06 Thread Nicholas Chammas
I think you're looking for RDD.foreach() . According to the programming guide : Run a function func on each element of the dataset. This is usually

logging in pyspark

2014-05-06 Thread Diana Carroll
What should I do if I want to log something as part of a task? This is what I tried. To set up a logger, I followed the advice here: http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off logger = logging.getLogger("py4j") logger.setLevel(logging.INFO) logger.addHandler(logging.StreamHa

Spark Summit 2014 (Hotel suggestions)

2014-05-06 Thread Jerry Lam
Hi Spark users, Do you guys plan to go the spark summit? Can you recommend any hotel near the conference? I'm not familiar with the area. Thanks! Jerry

Re: Spark and Java 8

2014-05-06 Thread Matei Zaharia
Java 8 support is a feature in Spark, but vendors need to decide for themselves when they’d like support Java 8 commercially. You can still run Spark on Java 7 or 6 without taking advantage of the new features (indeed our builds are always against Java 6). Matei On May 6, 2014, at 8:59 AM, Ian

Re: No space left on device error when pulling data from s3

2014-05-06 Thread Han JU
After some investigation, I found out that there's lots of temp files under /tmp/hadoop-root/s3/ But this is strange since in both conf files, ~/ephemeral-hdfs/conf/core-site.xml and ~/spark/conf/core-site.xml, the setting `hadoop.tmp.dir` is set to `/mnt/ephemeral-hdfs/`. Why spark jobs still wr

Re: No space left on device error when pulling data from s3

2014-05-06 Thread Akhil Das
I wonder why is your / is full. Try clearing out /tmp and also make sure in the spark-env.sh you have put SPARK_JAVA_OPTS+=" -Dspark.local.dir=/mnt/spark" Thanks Best Regards On Tue, May 6, 2014 at 9:35 PM, Han JU wrote: > Hi, > > I've a `no space left on device` exception when pulling some 22

RE: run spark0.9.1 on yarn with hadoop CDH4

2014-05-06 Thread Andrew Lee
Please check JAVA_HOME. Usually it should point to /usr/java/default on CentOS/Linux. or FYI: http://stackoverflow.com/questions/1117398/java-home-directory > Date: Tue, 6 May 2014 00:23:02 -0700 > From: sln-1...@163.com > To: u...@spark.incubator.apache.org > Subject: run spark0.9.1 on yarn wit

Re: is Mesos falling out of favor?

2014-05-06 Thread deric
I guess it's due to missing documentation and quite complicated setup. Continuous integration would be nice! Btw. is it possible to use spark as a shared library and not to fetch spark tarball for each task? Do you point SPARK_EXECUTOR_URI to HDFS url? -- View this message in context: http:/

No space left on device error when pulling data from s3

2014-05-06 Thread Han JU
Hi, I've a `no space left on device` exception when pulling some 22GB data from s3 block storage to the ephemeral HDFS. The cluster is on EC2 using spark-ec2 script with 4 m1.large. The code is basically: val in = sc.textFile("s3://...") in.saveAsTextFile("hdfs://...") Spark creates 750 inpu

Re: Spark and Java 8

2014-05-06 Thread Ian O'Connell
I think the distinction there might be they never said they ran that code under CDH5, just that spark supports it and spark runs under CDH5. Not that you can use these features while running under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM,

Re: Spark and Java 8

2014-05-06 Thread Marcelo Vanzin
Hi Kristoffer, You're correct that CDH5 only supports up to Java 7 at the moment. But Yarn apps do not run in the same JVM as Yarn itself (and I believe MR1 doesn't either), so it might be possible to pass arguments in a way that tells Yarn to launch the application master / executors with the Jav

Re: Comprehensive Port Configuration reference?

2014-05-06 Thread Jacob Eisinger
Howdy Scott, Please see the discussions about securing the Spark network [1] [2]. In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job. These dynamic ports make securing the Spark network difficult. Jacob [1] http:

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-06 Thread Jacob Eisinger
Howdy, You might find the discussion Andrew and I have been having about Docker and network security [1] applicable. Also, I posted an answer [2] to your stackoverflow question. [1] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-fire

Spark and Java 8

2014-05-06 Thread Kristoffer Sjögren
Hi I just read an article [1] about Spark, CDH5 and Java 8 but did not get exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark using a separate JVM that run on data nodes or is it reusing the YARN JVM runtime somehow, like hadoop1? CDH5 only supports Java 7 [2] as far as I kno

Re: If it due to my file has been breakdown?

2014-05-06 Thread Sophia
I have modified it in spark-env.sh,but it turns out that it does not work.So coufused. Best Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/If-it-due-to-my-file-has-been-breakdown-tp5438p5442.html Sent from the Apache Spark User List mailing list arc

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-06 Thread Jacob Eisinger
Howdy Andrew, Agreed - if that subnet is configured to only allow THOSE docker images onto it, than, yeah, I figure it would be secure. Great setup, in my opinion! (And, I think we both agree - a better one would be to have Spark only listen on well known ports to allow for a secured firewall/n

Re: If it due to my file has been breakdown?

2014-05-06 Thread Mayur Rustagi
Mostly your JAVA_HOME variable is wrong. Can you configure that in sparkenv file. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, May 6, 2014 at 5:53 PM, Sophia wrote: > Hi all, > [root@sophia spark-0.9.1]# > > SPA

Re: about broadcast

2014-05-06 Thread randylu
i found that the small broadcast variable always took about 10s, not 5s or else. If there is some property/conf(which is default 10) that control the timeout? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/about-broadcast-tp5416p5439.html Sent from the Apac

If it due to my file has been breakdown?

2014-05-06 Thread Sophia
Hi all, [root@sophia spark-0.9.1]# SPARK_JAR=.assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar ./bin/spark-class org.apache.spark.deploy.yarn.Client\--jar examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.1.jar\--class org.apache.spark.examples.SparkPi\--args yarn-stan

Re: "sbt/sbt run" command returns a JVM problem

2014-05-06 Thread Carter
Hi Akhil, Thanks for your reply. I have tried this option with different values, but it still doesn't work. The Java version I am using is jre1.7.0_55, does the java version matter in this problem? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

KryoSerializer Exception

2014-05-06 Thread Andrea Esposito
Hi there, sorry if i'm posting a lot lately. i'm trying to add the KryoSerializer but i receive this exception: 2014 - 05 - 06 11: 45: 23 WARN TaskSetManager: 62 - Loss was due to java.io.EOFException java.io.EOFException at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSer

Re: Incredible slow iterative computation

2014-05-06 Thread Andrea Esposito
Thanks all for helping. Following the Earthson's tip i resolved. I have to report that if you materialized the RDD and after you try to checkpoint it the operation doesn't perform. newRdd = oldRdd.map(myFun).persist(myStorageLevel) newRdd.foreach(x => myFunLogic(x)) // Here materialized for other

Re: Spark GCE Script

2014-05-06 Thread Akhil Das
Hi Matei, Will clean up the code a little bit and send the pull request :) Thanks Best Regards On Tue, May 6, 2014 at 1:00 AM, François Le lay wrote: > Has anyone considered using jclouds tooling to support multiple cloud > providers? Maybe using Pallet? > > François > > On May 5, 2014, at 3:

Re: How can I run sbt?

2014-05-06 Thread Akhil Das
Hi Sophia, Make sure your installation wasn't corrupted. It may happen that while downloading it didn't download it completely. Thanks Best Regards On Tue, May 6, 2014 at 1:53 PM, Sophia wrote: > Hi all, > #./sbt/sbt assembly > Launching sbt from sbt/sbt-launch-0.12.4.jar > Invalid or corrupt

Re: Storage information about an RDD from the API

2014-05-06 Thread Andras Nemeth
Thanks Koert, very useful! On Tue, Apr 29, 2014 at 6:41 PM, Koert Kuipers wrote: > SparkContext.getRDDStorageInfo > > > On Tue, Apr 29, 2014 at 12:34 PM, Andras Nemeth < > andras.nem...@lynxanalytics.com> wrote: > >> Hi, >> >> Is it possible to know from code about an RDD if it is cached, and m

Re: Spark's behavior

2014-05-06 Thread Eduardo Costa Alfaia
Ok Andrew, Thanks I sent informations of test with 8 worker and the gap is grown up. On May 4, 2014, at 2:31, Andrew Ash wrote: >>> From the logs, I see that the print() starts printing stuff 10 seconds >>> after the context is started. And that 10 seconds is taken by the initial >>> empty

How can I run sbt?

2014-05-06 Thread Sophia
Hi all, #./sbt/sbt assembly Launching sbt from sbt/sbt-launch-0.12.4.jar Invalid or corrupt jarfile sbt/sbt-launch-0.12.4.jar Why cannot I run sbt well? Best regards, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-run-sbt-tp5429.html Sent from the

RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-06 Thread Andrew Lee
Hi Jacob, I agree, we need to address both driver and workers bidirectionally. If the subnet is isolated and self-contained, only limited ports are configured to access the driver via a dedicated gateway for the user, could you explain your concern? or what doesn't satisfy the security criteria?

run spark0.9.1 on yarn with hadoop CDH4

2014-05-06 Thread Sophia
Hi all, I have make HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster. The command to launch the YARN Client which I run is like this: # SPARK_JAR=./~/spark-0.9.1/assembly/target/scala-2.10/spark-assembly_2.10-0.9