Re: Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-21 Thread Tathagata Das
There is not artifact call spark-streaming-algebird . To use the algebird, you will have add the following dependency (in maven format) dependency groupIdcom.twitter/groupId artifactIdalgebird-core_${scala.binary.version}/artifactId version0.1.11/version /dependency This is

Re: Distributed dictionary building

2014-09-21 Thread Sean Owen
Reference - https://issues.apache.org/jira/browse/SPARK-3098 I imagine zipWithUniqueID is also affected, but may not happen to have exhibited in your test. On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das debasish.da...@gmail.com wrote: Some more debug revealed that as Sean said I have to keep the

Saving RDD with array of strings

2014-09-21 Thread Sarath Chandra
Hi All, If my RDD is having array/sequence of strings, how can I save them as a HDFS file with each string on separate line? For example if I write code as below, the output should get saved as hdfs file having one string per line ... ... var newLines = lines.map(line = myfunc(line));

Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja M
Hi: I am trying to setup Spark 1.1 on a Windows 7 box and I am running the sbt assembly command and this is the error that I am seeing. [error] (streaming-flume-sink/*:update) sbt.ResolveException: unresolved depende ncy: commons-lang#commons-lang;2.6: configuration not found in

Issues with partitionBy: FetchFailed

2014-09-21 Thread Julien Carme
Hello, I am facing an issue with partitionBy, it is not clear whether it is a problem with my code or with my spark setup. I am using Spark 1.1, standalone, and my other spark projects work fine. So I have to repartition a relatively large file (about 70 million lines). Here is a minimal version

RE: Issues with partitionBy: FetchFailed

2014-09-21 Thread Shao, Saisai
Hi, I’ve also met this problem before, I think you can try to set “spark.core.connection.ack.wait.timeout” to a large value to avoid ack timeout, default is 60 seconds. Sometimes because of GC pause or some other reasons, acknowledged message will be timeout, which will lead to this

Re: Saving RDD with array of strings

2014-09-21 Thread Julien Carme
Just use flatMap, it does exactly what you need: newLines.flatMap { lines = lines }.saveAsTextFile(...) 2014-09-21 11:26 GMT+02:00 Sarath Chandra sarathchandra.jos...@algofusiontech.com: Hi All, If my RDD is having array/sequence of strings, how can I save them as a HDFS file with each

Re: Issues with partitionBy: FetchFailed

2014-09-21 Thread David Rowe
Hi, I've seen this problem before, and I'm not convinced it's GC. When spark shuffles it writes a lot of small files to store the data to be sent to other executors (AFAICT). According to what I've read around the place the intention is that these files be stored in disk buffers, and since

Re: Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja Mohideen
I was able to move past this error by deleting the .ivy2/cache folder. However, I am running into an out of memory error [error] java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Jav a heap space [error] Use 'last' for the full log. This is despite the fact that I have set

Re: Avoid broacasting huge variables

2014-09-21 Thread octavian.ganea
Using mapPartitions and passing the big index object as a parameter to it was not the best option, given the size of the big object and my RAM. The workers died before starting the actual computation. Anyway, creating a singleton object worked for me:

Re: Setting up Spark 1.1 on Windows 7

2014-09-21 Thread Khaja Mohideen
Setting java_opts helped me fix the problem. Thanks, -Khaja On Sun, Sep 21, 2014 at 9:25 AM, Khaja Mohideen kha...@gmail.com wrote: I was able to move past this error by deleting the .ivy2/cache folder. However, I am running into an out of memory error [error]

Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Grega Kešpret
Hi, I am seeing different shuffle write sizes when using SchemaRDD (versus normal RDD). I'm doing the following: case class DomainObj(a: String, b: String, c: String, d: String) val logs: RDD[String] = sc.textFile(...) val filtered: RDD[String] = logs.filter(...) val myDomainObjects:

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
zipWithUniqueId is also affected... I had to persist the dictionaries to make use of the indices lower down in the flow... On Sun, Sep 21, 2014 at 1:15 AM, Sean Owen so...@cloudera.com wrote: Reference - https://issues.apache.org/jira/browse/SPARK-3098 I imagine zipWithUniqueID is also

Can SparkContext shared across nodes/drivers

2014-09-21 Thread 林武康
Hi all, So far as I known, a SparkContext instance take in charge of some resources of a cluster the master assigned to. And It is hardly shared with different sparkcontexts. meanwhile, schedule between applications is also not easier. To address this without introducing extra resource schedule

Re: pyspark on yarn - lost executor

2014-09-21 Thread Sandy Ryza
Hi Oleg, Those parameters control the number and size of Spark's daemons on the cluster. If you're interested in how these daemons relate to each other and interact with YARN, I wrote a post on this a little while ago -

Re: Shuffle size difference - operations on RDD vs. operations on SchemaRDD

2014-09-21 Thread Michael Armbrust
Spark SQL always uses a custom configuration of Kryo under the hood to improve shuffle performance: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala Michael On Sun, Sep 21, 2014 at 9:04 AM, Grega Kešpret gr...@celtra.com

java.lang.ClassNotFoundException on driver class in executor

2014-09-21 Thread Barrington Henry
Hi, I am running spark from my IDE (InteliJ) using YARN as my cluster manager. However, the executor node is not able to find my main driver class “LascoScript”. I keep getting java.lang.ClassNotFoundException. I tried adding the jar of the main class by running the snippet below val

Re: Found both spark.driver.extraClassPath and SPARK_CLASSPATH

2014-09-21 Thread Koert Kuipers
i have found no way around this. basically this makes SPARK_CLASSPATH unusable. and the alternative for enabling lzo on a cluster is not reasonable. one has to set in spark-defaults.conf: spark.executor.extraClassPath /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar

Re: Spark and disk usage.

2014-09-21 Thread Andrew Ash
Thanks for the info Burak! I filed a bug on myself at https://issues.apache.org/jira/browse/SPARK-3631 to turn this information into a new section on the programming guide. Thanks for the explanation it's very helpful. Andrew On Wed, Sep 17, 2014 at 12:08 PM, Burak Yavuz bya...@stanford.edu

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-21 Thread Andrew Or
Hi Didata, An alternative to what Sandy proposed is to set the Spark properties in a special file `conf/spark-defaults.conf`. That way you don't have to specify all the configs through the command line every time. The `--conf` option is mostly intended to change one or two parameters, but it

Re: java.lang.ClassNotFoundException on driver class in executor

2014-09-21 Thread Andrew Or
Hi Barrington, Have you tried running it from the command line? (i.e. bin/spark-submit --master yarn-client --class YOUR_CLASS YOUR_JAR) Does it still fail? I am not super familiar with running Spark through intellij, but the AFAIK the classpaths are setup a little differently there. Also, Spark