Re: CDH 5.0 and Spark 0.9.0

2014-05-01 Thread Sean Owen
This codec does require native libraries to be installed, IIRC, but they are installed with CDH 5. The error you show does not look related though. Are you sure your HA setup is working and that you have configured it correctly in whatever config spark is seeing? -- Sean Owen | Director, Data

Re: Multiple Streams with Spark Streaming

2014-05-01 Thread Mayur Rustagi
File as a stream? I think you are confusing Spark Streaming with buffer reader. Spark streaming is meant to process batches of data (files, packets, messages) as they come in, infact utilizing time of packet reception as a way to create windows etc. In your case you are better off reading the

Re: Broadcst RDD Lookup

2014-05-01 Thread Mayur Rustagi
Mostly none of the items in PairRDD match your input. Hence the error. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, May 1, 2014 at 2:06 PM, vivek.ys vivek...@gmail.com wrote: Hi All, I am facing an issue

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-01 Thread Mayur Rustagi
Broadcast variable is meant to be shared across each node not map tasks. The process you are using should work, however having 6GB of broadcast variable could be an issue. Does the broadcast variable finally move or always stays stuck? Mayur Rustagi Ph: +1 (760) 203 3257

Re: Broadcst RDD Lookup

2014-05-01 Thread Vivek YS
No I am sure the items match. Because userCluster productCluster are prepared from data . Cross product of userCluster productCluster is a super set of data. On Thu, May 1, 2014 at 3:41 PM, Mayur Rustagi mayur.rust...@gmail.comwrote: Mostly none of the items in PairRDD match your input.

Re: update of RDDs

2014-05-01 Thread Mayur Rustagi
RDD are immutable so cannot be updated. You can create new RDD containing updated entries(often not what you want to do). Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, May 1, 2014 at 4:42 AM, narayanabhatla

RE: update of RDDs

2014-05-01 Thread NN Murthy
Thanks a lot for very prompt response. Then next questions are the following. 1. Can we conclude that Spark is NOT the solution for our requirement? Or 2. Is there a design approach to meet such requirements using Spark? From: Mayur Rustagi [mailto:mayur.rust...@gmail.com]

Re: GraphX. How to remove vertex or edge?

2014-05-01 Thread Daniel Darabos
Graph.subgraph() allows you to apply a filter to edges and/or vertices. On Thu, May 1, 2014 at 8:52 AM, Николай Кинаш peroksi...@gmail.com wrote: Hello. How to remove vertex or edges from graph in GraphX?

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Daniel Darabos
Cool intro, thanks! One question. On slide 23 it says Standalone (local mode). That sounds a bit confusing without hearing the talk. Standalone mode is not local. It just does not depend on a cluster software. I think it's the best mode for EC2/GCE, because they provide a distributed filesystem

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Dean Wampler
Thanks for the clarification. I'll fix the slide. I've done a lot of Scalding/Cascading programming where the two concepts are synonymous, but clearly I was imposing my prejudices here ;) dean On Thu, May 1, 2014 at 8:18 AM, Daniel Darabos daniel.dara...@lynxanalytics.com wrote: Cool intro,

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread ZhangYi
Very Useful material. Currently, I am trying to persuade my client choose Spark instead of Hadoop MapReduce. Your slide give me more evidence to support my opinion. -- ZhangYi (张逸) Developer tel: 15023157626 blog: agiledon.github.com weibo: tw张逸 Sent with Sparrow

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Dean Wampler
That's great! Thanks. Let me know if it works ;) or what I could improve to make it work. dean On Thu, May 1, 2014 at 8:45 AM, ZhangYi yizh...@thoughtworks.com wrote: Very Useful material. Currently, I am trying to persuade my client choose Spark instead of Hadoop MapReduce. Your slide give

sbt/sbt run command returns a JVM problem

2014-05-01 Thread Carter
Hi, I have a very simple spark program written in Scala: /*** testApp.scala ***/ object testApp { def main(args: Array[String]) { println(Hello! World!) } } Then I use the following command to compile it: $ sbt/sbt package The compilation finished successfully and I got a JAR file. But

Re: sbt/sbt run command returns a JVM problem

2014-05-01 Thread Sean Owen
Here's how I configure SBT, which I think is the usual way: export SBT_OPTS=-XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256m -Xmx1g See if that takes. But your error is that you're already asking for too much memory for your machine. So maybe you are setting the value successfully, but it's

RE: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Adrian Mocanu
So Seq[V] contains only new tuples. I initially thought that whenever a new tuple was found, it would add it to Seq and call the update function immediately so there wouldn't be more than 1 update to Seq per function call. Say I want to sum tuples with the same key is an RDD using

Spark Training

2014-05-01 Thread Nicholas Chammas
There are many freely-available resources for the enterprising individual to use if they want to Spark up their life. For others, some structured training is in order. Say I want everyone from my department at my company to get something like the AMP Camphttp://ampcamp.berkeley.edu/experience,

Spark profiler

2014-05-01 Thread Punya Biswal
Hi all, I am thinking of starting work on a profiler for Spark clusters. The current idea is that it would collect jstacks from executor nodes and put them into a central index (either a database or elasticsearch), and it would present them to people in a UI that would let people slice and dice

RE: Spark Training

2014-05-01 Thread Huang, Roger
If you're in the Bay Area, the Spark Summit would be a great source of information. http://spark-summit.org/2014 -Roger From: Nicholas Chammas [mailto:nicholas.cham...@gmail.com] Sent: Thursday, May 01, 2014 10:12 AM To: u...@spark.incubator.apache.org Subject: Spark Training There are many

Re: Spark profiler

2014-05-01 Thread Mayur Rustagi
Some thing like Twitter Ambrose would be lovely to integrate :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Thu, May 1, 2014 at 8:44 PM, Punya Biswal pbis...@palantir.com wrote: Hi all, I am thinking of starting

Re: Spark Training

2014-05-01 Thread Denny Lee
You may also want to check out Paco Nathan's Introduction to Spark courses: http://liber118.com/pxn/ On May 1, 2014, at 8:20 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Nicholas, We provide training on spark, hands-on also associated ecosystem. We gave it recently at a

Equally weighted partitions in Spark

2014-05-01 Thread deenar.toraskar
Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently I partition my RDD of tasks randomly. There is a large variation in how long each of the jobs take to complete, leading to most partitions being processed quickly and a couple of partitions take

Re: Spark Training

2014-05-01 Thread Dean Wampler
I'm working on a 1-day workshop that I'm giving in Australia next week and a few other conferences later in the year. I'll post a link when it's ready. dean On Thu, May 1, 2014 at 10:30 AM, Denny Lee denny.g@gmail.com wrote: You may also want to check out Paco Nathan's Introduction to

Re: Efficient Aggregation over DB data

2014-05-01 Thread Andrea Esposito
Hi Sai, i don't sincerely figure out where you are using the RDDs (because the split method isn't defined in them) by the way you should use the map function instead of the foreach due the fact it is NOT idempotent and some partitions could be recomputed executing the function multiple times.

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Peter
Thank you Patrick.  I took a quick stab at it:     val s3Client = new AmazonS3Client(...)     val copyObjectResult = s3Client.copyObject(upload, outputPrefix + /part-0, rolled-up-logs, 2014-04-28.csv)     val objectListing = s3Client.listObjects(upload, outputPrefix)    

Spark streaming

2014-05-01 Thread Mohit Singh
Hi, I guess Spark is using streaming in context of streaming live data but what I mean is something more on the lines of hadoop streaming.. where one can code in any programming language? Or is something among that lines on the cards? Thanks -- Mohit When you want success as badly as you

Re: Spark streaming

2014-05-01 Thread Tathagata Das
Take a look at the RDD.pipe() operation. That allows you to pipe the data in a RDD to any external shell command (just like Unix Shell pipe). On May 1, 2014 10:46 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I guess Spark is using streaming in context of streaming live data but what I mean

permition problem

2014-05-01 Thread Livni, Dana
I'm working with spark 0.9.0 on cdh5. I'm running a spark application written in java in yarn-client mode. Cause of the OP installed on the cluster I need to run the application using the hdfs user, otherwise I have a permission problem and getting the following error:

Re: permition problem

2014-05-01 Thread Sean Owen
Yeah actually it's hdfs that has superuser privileges on HDFS, not root. It looks like you're trying to access a nonexistent user directory like /user/foo, and it fails because root can't create it, and you inherit privileges for root since that is what your app runs as. I don't think you want to

Re: Reading multiple S3 objects, transforming, writing back one

2014-05-01 Thread Nicholas Chammas
The fastest way to save to S3 should be to leave the RDD with many partitions, because all partitions will be written out in parallel. Then, once the various parts are in S3, somehow concatenate the files together into one file. If this can be done within S3 (I don't know if this is possible),

ClassNotFoundException

2014-05-01 Thread Joe L
Hi, I am getting the following error. How could I fix this problem? Joe 14/05/02 03:51:48 WARN TaskSetManager: Lost TID 12 (task 2.0:1) 14/05/02 03:51:48 INFO TaskSetManager: Loss was due to java.lang.ClassNotFoundException: org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$4

Can't be built on MAC

2014-05-01 Thread Zhige Xin
Hi dear all, When I tried to build Spark 0.9.1 on my Mac OS X 10.9.2 with Java 8, I found the following errors: [error] error while loading CharSequence, class file '/Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk/Contents/Home/jre/lib/rt.jar(java/lang/CharSequence.class)' is broken [error]

Re: Can't be built on MAC

2014-05-01 Thread Ian Ferreira
HI Zhige, I had the same issue and revert to using JDK 1.7.055 From: Zhige Xin xinzhi...@gmail.com Reply-To: user@spark.apache.org Date: Thursday, May 1, 2014 at 12:32 PM To: user@spark.apache.org Subject: Can't be built on MAC Hi dear all, When I tried to build Spark 0.9.1 on my Mac OS X

Re: Can't be built on MAC

2014-05-01 Thread Zhige Xin
Thank you! Ian. Zhige On Thu, May 1, 2014 at 12:35 PM, Ian Ferreira ianferre...@hotmail.comwrote: HI Zhige, I had the same issue and revert to using JDK 1.7.055 From: Zhige Xin xinzhi...@gmail.com Reply-To: user@spark.apache.org Date: Thursday, May 1, 2014 at 12:32 PM To:

updateStateByKey example not using correct input data?

2014-05-01 Thread Adrian Mocanu
I'm trying to understand updateStateByKey. Here's an example I'm testing with: Input data: DStream( RDD( (a,2) ), RDD( (a,3) ), RDD( (a,4) ), RDD( (a,5) ), RDD( (a,6) ), RDD( (a,7) ) ) Code: val updateFunc = (values: Seq[Int], state: Option[StateClass]) = { val previousState =

Running Spark jobs via oozie

2014-05-01 Thread Shivani Rao
Hello Spark Fans, I am trying to run a spark job via oozie as a java action. The spark code is packaged as a MySparkJob.jar compiled using sbt assembly (excluding spark and hadoop dependencies). I am able to invoke the spark job from any client using java -cp

Setting the Scala version in the EC2 script?

2014-05-01 Thread Ian Ferreira
Is this possible, it is very annoying to have such a great script, but still have to manually update stuff afterwards.

Re: Equally weighted partitions in Spark

2014-05-01 Thread Andrew Ash
The problem is that equally-sized partitions take variable time to complete based on their contents? Sent from my mobile phone On May 1, 2014 8:31 AM, deenar.toraskar deenar.toras...@db.com wrote: Hi I am using Spark to distribute computationally intensive tasks across the cluster. Currently

range partitioner with updateStateByKey

2014-05-01 Thread Adrian Mocanu
If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered but not all) -Adrian

java.lang.ClassNotFoundException

2014-05-01 Thread İbrahim Rıza HALLAÇ
HelIo. I followed A Standalone App in Java part of the tutorial https://spark.apache.org/docs/0.8.1/quick-start.html Spark standalone cluster looks it's running without a problem : http://i.stack.imgur.com/7bFv8.png I have built a fat jar for running this JavaApp on the cluster. Before maven

Re: How to handle this situation: Huge File Shared by All maps and Each Computer Has one copy?

2014-05-01 Thread PengWeiPRC
Thanks, Rustagi. Yes, the global data is read-only and stays from the beginning to the end of the whole Spark task. Actually, it is not only identical for one Map/Reduce task, but used by a lot of map/reduce tasks of mine. That's why I intend to put the data into each node of my cluster, and hope

Task not serializable: collect, take

2014-05-01 Thread SK
Hi, I have the following code structure. I compiles ok, but at runtime it aborts with the error: Exception in thread main org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: I am running in local (standalone) mode. trait A{ def input(...):

Re: Task not serializable: collect, take

2014-05-01 Thread Marcelo Vanzin
Have you tried making A extend Serializable? On Thu, May 1, 2014 at 3:47 PM, SK skrishna...@gmail.com wrote: Hi, I have the following code structure. I compiles ok, but at runtime it aborts with the error: Exception in thread main org.apache.spark.SparkException: Job aborted: Task not

Re: Opinions stratosphere

2014-05-01 Thread Christopher Nguyen
Someone (Ze Ni, https://www.sics.se/people/ze-ni) has actually attempted such a comparative study as a Masters thesis: http://www.diva-portal.org/smash/get/diva2:605106/FULLTEXT01.pdf According to this snapshot (c. 2013), Stratosphere is different from Spark in not having an explicit concept of

Question regarding doing aggregation over custom partitions

2014-05-01 Thread Arun Swami
Hi, I am a newbie to Spark. I looked for documentation or examples to answer my question but came up empty handed. I don't know whether I am using the right terminology but here goes. I have a file of records. Initially, I had the following Spark program (I am omitting all the surrounding code

configure spark history server for running on Yarn

2014-05-01 Thread Jenny Zhao
Hi, I have installed spark 1.0 from the branch-1.0, build went fine, and I have tried running the example on Yarn client mode, here is my command: /home/hadoop/spark-branch-1.0/bin/spark-submit /home/hadoop/spark-branch-1.0/examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar --master

Re: same partition id means same location?

2014-05-01 Thread wxhsdp
anyone talk something about this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/same-partition-id-means-same-location-tp5136p5200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

YARN issues with resourcemanager.scheduler.address

2014-05-01 Thread zsterone
Hi, I'm trying to connect to a YARN cluster by running these commands: export HADOOP_CONF_DIR=/hadoop/var/hadoop/conf/ export YARN_CONF_DIR=$HADOOP_CONF_DIR export SPARK_YARN_MODE=true export SPARK_JAR=./assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar export

Re: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Tathagata Das
Depends on your code. Referring to the earlier example, if you do words.map(x = (x,1)).updateStateByKey() then for a particular word, if a batch contains 6 occurrences of that word, then the Seq[V] will be [1, 1, 1, 1, 1, 1] Instead if you do words.map(x = (x,1)).reduceByKey(_ +

Re: range partitioner with updateStateByKey

2014-05-01 Thread Tathagata Das
Ordered by what? arrival order? sort order? TD On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu amoc...@verticalscope.comwrote: If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered

Getting the following error using EC2 deployment

2014-05-01 Thread Ian Ferreira
I have a custom app that was compiled with scala 2.10.3 which I believe is what the latest spark-ec2 script installs. However running it on the master yields this cryptic error which according to the web implies incompatible jar versions. Exception in thread main java.lang.NoClassDefFoundError: