Re: Using Spark on Data size larger than Memory size

2014-05-30 Thread Vibhor Banga
Some inputs will be really helpful. Thanks, -Vibhor On Fri, May 30, 2014 at 7:51 PM, Vibhor Banga wrote: > Hi all, > > I am planning to use spark with HBase, where I generate RDD by reading > data from HBase Table. > > I want to know that in the case when the size of HBase Table grows larger >

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Mayur Rustagi
We are migrating our scripts to r3. Thr lineage is in spark-ec2 would be happy to migrate those too. Having trouble with ganglia setup currently :) Regards Mayur On 31 May 2014 09:07, "Patrick Wendell" wrote: > Hi Jeremy, > > That's interesting, I don't think anyone has ever reported an issue > r

Re: Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Patrick Wendell
The main change here was refactoring the SparkListener interface which is where we expose internal state about a Spark job to other applications. We've cleaned up these API's a bunch and also added a way to log all data as JSON for post-hoc analysis: https://github.com/apache/spark/blob/master/cor

Re: Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Patrick Wendell
Hi Jeremy, That's interesting, I don't think anyone has ever reported an issue running these scripts due to Python incompatibility, but they may require Python 2.7+. I regularly run them from the AWS Ubuntu 12.04 AMI... that might be a good place to start. But if there is a straightforward way to

possible typos in spark 1.0 documentation

2014-05-30 Thread Yadid Ayzenberg
Congrats on the new 1.0 release. Amazing work ! It looks like there may some typos in the latest http://spark.apache.org/docs/latest/sql-programming-guide.html in the "Running SQL on RDDs" section when choosing the java example: 1. ctx is an instance of JavaSQLContext but the textFile method

Failed to remove RDD error

2014-05-30 Thread Michael Chang
I'm running a some kafka streaming spark contexts (on 0.9.1), and they seem to be dying after 10 or so minutes with a lot of these errors. I can't really tell what's going on here, except that maybe the driver is unresponsive somehow? Has anyone seen this before? 14/05/31 01:13:30 ERROR BlockMan

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-30 Thread Nicholas Chammas
YES, your hunches were correct. I’ve identified at least one file among the hundreds I’m processing that is indeed not a valid gzip file. Does anyone know of an easy way to exclude a specific file or files when calling sc.textFile() on a pattern? e.g. Something like: sc.textFile('s3n://bucket/stuf

Re: apache whirr for spark

2014-05-30 Thread Matei Zaharia
I don’t think Whirr provides support for this, but Spark’s own EC2 scripts also launch a Hadoop cluster: http://spark.apache.org/docs/latest/ec2-scripts.html. Matei On May 30, 2014, at 12:59 PM, chirag lakhani wrote: > Does anyone know if it is possible to use Whirr to setup a Spark cluster on

apache whirr for spark

2014-05-30 Thread chirag lakhani
Does anyone know if it is possible to use Whirr to setup a Spark cluster on AWS. I would like to be able to use Whirr to setup a cluster that has all of the standard Hadoop and Spark tools. I want to automate this process because I anticipate I will have to create and destroy often enough that I

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
Hi Sandy, We have more than 20 nodes each with around 80GB of RAM and the memory usage across the cluster is low, <20GB. The FairScheduler is being used. Thanks, Dave On Fri, May 30, 2014 at 12:42 PM, Sandy Ryza wrote: > Hi Dave, > > How many nodes do you have in your cluster and how much mem

Re: Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread Sandy Ryza
Hi Dave, How many nodes do you have in your cluster and how much memory on each? If you go to the RM web UI at port 8088, how much memory is used? Which YARN scheduler are you using? -Sandy On Fri, May 30, 2014 at 12:38 PM, David Belling wrote: > Hi, > > I'm running CDH5 and its bundled Spa

Re: A Standalone App in Scala: Standalone mode issues

2014-05-30 Thread T.J. Alumbaugh
These tips were very helpful! By setting SPARK_MASTER_IP as you suggest, I was able to make progress. Unfortunately, it is unclear to me how to specify the hadoop-client dependency for a pyspark stand-alone application. So, I still get the EOFException, since I used a non-default Hadoop distributio

Spark shell never leaves ACCEPTED state in YARN CDH5

2014-05-30 Thread David Belling
Hi, I'm running CDH5 and its bundled Spark (0.9.0). The Spark shell has been coming up fine over the last couple of weeks. However today it doesn't come up and I just see this message over and over: 14/05/30 12:06:05 INFO YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort

Re: Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Thanks jey I was hellpful. On Sat, May 31, 2014 at 12:45 AM, Rahul Bhojwani < rahulbhojwani2...@gmail.com> wrote: > Thanks Marcelo, > > It actually made my few concepts clear. (y). > > > On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin > wrote: > >> Hello there, >> >> On Fri, May 30, 2014 at 9

Re: Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Thanks Marcelo, It actually made my few concepts clear. (y). On Fri, May 30, 2014 at 10:14 PM, Marcelo Vanzin wrote: > Hello there, > > On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin > wrote: > > workbook = xlsxwriter.Workbook('output_excel.xlsx') > > worksheet = workbook.add_worksheet() > >

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-30 Thread Andrei
Thanks, Stephen. I have eventually decided to go with assembly, but put away Spark and Hadoop jars, and instead use `spark-submit` to automatically provide these dependencies. This way no resource conflicts arise and mergeStrategy needs no modification. To memorize this stable setup and also share

Trouble with EC2

2014-05-30 Thread PJ$
Hey Folks, I'm really having quite a bit of trouble getting spark running on ec2. I'm not using scripts the https://github.com/apache/spark/tree/master/ec2 because I'd like to know how everything works. But I'm going a little crazy. I think that something about the networking configuration must be

Re: Local file being refrenced in mapper function

2014-05-30 Thread Jey Kottalam
Hi Rahul, Marcelo's explanation is correct. Here's a possible approach to your program, in pseudo-Python: # connect to Spark cluster sc = SparkContext(...) # load input data input_data = load_xls(file("input.xls")) input_rows = input_data['Sheet1'].rows # create RDD on cluster input_rdd = sc.p

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin wrote: > workbook = xlsxwriter.Workbook('output_excel.xlsx') > worksheet = workbook.add_worksheet() > > data = sc.textFile("xyz.txt") > # xyz.txt is a file whose each line contains string delimited by > > row=0 > > def mapperFunc(x): >

Re: Local file being refrenced in mapper function

2014-05-30 Thread Marcelo Vanzin
Hi Rahul, I'll just copy & paste your question here to aid with context, and reply afterwards. - Can I write the RDD data in excel file along with mapping in apache-spark? Is that a correct way? Isn't that a writing will be a local function and can't be passed over the clusters?? Below is g

Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Aaron Davidson
Also, the Spark examples can run out of the box on a single machine, as well as a cluster. See the "Master URLs" heading here: http://spark.apache.org/docs/latest/submitting-applications.html#master-urls On Fri, May 30, 2014 at 9:24 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Wi

Re: Spark 1.0.0 - Java 8

2014-05-30 Thread Surendranauth Hiraman
With respect to virtual hosts, my team uses Vagrant/Virtualbox. We have 3 CentOS VMs with 4 GB RAM each - 2 worker nodes and a master node. Everything works fine, though if you are using MapR, you have to make sure they are all on the same subnet. -Suren On Fri, May 30, 2014 at 12:20 PM, Upend

Spark 1.0.0 - Java 8

2014-05-30 Thread Upender Nimbekar
Great News ! I've been awaiting this release to start doing some coding with Spark using Java 8. Can I run Spark 1.0 examples on a virtual host with 16 GB ram and fair descent amount of hard disk ? Or do I reaaly need to use a cluster of machines. Second, are there any good exmaples of using MLIB o

RE: Announcing Spark 1.0.0

2014-05-30 Thread giive chen
Great work! On May 30, 2014 10:15 PM, "Ian Ferreira" wrote: > Congrats > > Sent from my Windows Phone > -- > From: Dean Wampler > Sent: 5/30/2014 6:53 AM > To: user@spark.apache.org > Subject: Re: Announcing Spark 1.0.0 > > Congratulations!! > > > On Fri, May 30,

Subscribing to news releases

2014-05-30 Thread Nick Chammas
Is there a way to subscribe to news releases ? That would be swell. Nick -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Subscribing-to-news-releases-tp6592.html Sent from the Apache Spark User List mailing list arc

Re: Announcing Spark 1.0.0

2014-05-30 Thread Nicholas Chammas
You guys were up late, eh? :) I'm looking forward to using this latest version. Is there any place we can get a list of the new functions in the Python API? The release notes don't enumerate them. Nick On Fri, May 30, 2014 at 10:15 AM, Ian Ferreira wrote: > Congrats > > Sent from my Windows

Re: Problem using Spark with Hbase

2014-05-30 Thread Vibhor Banga
Thanks Mayur for the reply. Actually issue was the I was running Spark application on hadoop-2.2.0 and hbase version there was 0.95.2. But spark by default gets build by an older hbase version. So I had to build spark again with hbase version as 0.95.2 in spark build file. And it worked. Thanks,

Using Spark on Data size larger than Memory size

2014-05-30 Thread Vibhor Banga
Hi all, I am planning to use spark with HBase, where I generate RDD by reading data from HBase Table. I want to know that in the case when the size of HBase Table grows larger than the size of RAM available in the cluster, will the application fail, or will there be an impact in performance ? An

RE: Announcing Spark 1.0.0

2014-05-30 Thread Ian Ferreira
Congrats Sent from my Windows Phone From: Dean Wampler Sent: ‎5/‎30/‎2014 6:53 AM To: user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick

Monitoring / Instrumenting jobs in 1.0

2014-05-30 Thread Daniel Siegmann
The Spark 1.0.0 release notes state "Internal instrumentation has been added to allow applications to monitor and instrument Spark jobs." Can anyone point me to the docs for this? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, N

Local file being refrenced in mapper function

2014-05-30 Thread Rahul Bhojwani
Hi, I recently posted a question on stackoverflow but didn't get any reply. I joined the mailing list now. Can anyone of you guide me a way for the problem mentioned in http://stackoverflow.com/questions/23923966/writing-the-rdd-data-in-excel-file-along-mapping-in-apache-spark Thanks in advance

Re: Announcing Spark 1.0.0

2014-05-30 Thread Dean Wampler
Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the 1.0 line of releases, > providing API stability for Spark's core interfaces. > > Spark 1.0.0 is Spark's

Re: KryoSerializer Exception

2014-05-30 Thread Andrea Esposito
Hi, i just migrate to 1.0. Still having the same issue. Either with or without the custom registrator. Just the usage of the KryoSerializer triggers the exception immediately. I set the kryo settings through the property: System.setProperty("spark.serializer", "org.apache.spark.serializer. KryoS

Re: SparkContext startup time out

2014-05-30 Thread Pierre B
I was annoyed by this as well. It appears that just permuting the order of decencies inclusion solves this problem: first spark, than your cdh hadoop distro. HTH, Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-tp1753p

Re: Announcing Spark 1.0.0

2014-05-30 Thread Chanwit Kaewkasi
Congratulations !! -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Fri, May 30, 2014 at 5:12 PM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the 1.0 line of releases, > providing API stability f

Re: Announcing Spark 1.0.0

2014-05-30 Thread Ognen Duzlevski
How exciting! Congratulations! :-) Ognen On 5/30/14, 5:12 AM, Patrick Wendell wrote: I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-30 Thread jamborta
thanks for the reply. I am definitely running 1.0.0, I set it up manually. To answer my question, I found out from the examples that it would need a new data type called LabeledPoint instead of numpy array. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/py

Yay for 1.0.0! EC2 Still has problems.

2014-05-30 Thread Jeremy Lee
Hi there! I'm relatively new to the list, so sorry if this is a repeat: I just wanted to mention there are still problems with the EC2 scripts. Basically, they don't work. First, if you run the scripts on Amazon's own suggested version of linux, they break because amazon installs Python2.6.9, and

Re: Selecting first ten values in a RDD/partition

2014-05-30 Thread nilmish
My primary goal : To get top 10 hashtag for every 5 mins interval. I want to do this efficiently. I have already done this by using reducebykeyandwindow() and then sorting all hashtag in 5 mins interval taking only top 10 elements. But this is very slow. So I now I am thinking of retaining only

Re: Announcing Spark 1.0.0

2014-05-30 Thread John Omernik
By the way: This is great work. I am new to the spark world, and have been like a kid in a candy store learnign all it can do. Is there a good list of build variables? What I me is like the SPARK_HIVE variable described on the Spark SQL page. I'd like to include that, but once I found that I won

Re: Announcing Spark 1.0.0

2014-05-30 Thread jose farfan
Awesome work On Fri, May 30, 2014 at 12:12 PM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the 1.0 line of releases, > providing API stability for Spark's core interfaces. > > Spark 1.0.0 is Spark's lar

Re: Announcing Spark 1.0.0

2014-05-30 Thread John Omernik
All: In the pom.xml file I see the MapR repository, but it's not included in the ./project/SparkBuild.scala file. Is this expected? I know to build I have to add it there otherwise sbt hates me with evil red messages and such. John On Fri, May 30, 2014 at 6:24 AM, Kousuke Saruta wrote: > Hi

RE: Announcing Spark 1.0.0

2014-05-30 Thread Kousuke Saruta
Hi all In , the URL for release note of 1.0.0 seems to be wrong. The URL should be https://spark.apache.org/releases/spark-release-1-0-0.html but links to https://spark.apache.org/releases/spark-release-1.0.0.html Best Regards, Kousuke From:

Re: Announcing Spark 1.0.0

2014-05-30 Thread prabeesh k
I forgot to hard refresh. thanks On Fri, May 30, 2014 at 4:18 PM, Patrick Wendell wrote: > It is updated - try holding "Shift + refresh" in your browser, you are > probably caching the page. > > On Fri, May 30, 2014 at 3:46 AM, prabeesh k wrote: > > Please update the http://spark.apache.org/d

Re: Announcing Spark 1.0.0

2014-05-30 Thread Margusja
Now I can download. Thanks. Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x -h ldap.sk.ee -b c=EE "(serialNumber=37303140314)" On 30/05/14 13:48, Patrick Wendell wrote: It is updated - try holding "Shift +

Re: Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
It is updated - try holding "Shift + refresh" in your browser, you are probably caching the page. On Fri, May 30, 2014 at 3:46 AM, prabeesh k wrote: > Please update the http://spark.apache.org/docs/latest/ link > > > On Fri, May 30, 2014 at 4:03 PM, Margusja wrote: >> >> Is it possible to downl

Re: Announcing Spark 1.0.0

2014-05-30 Thread prabeesh k
Please update the http://spark.apache.org/docs/latest/ link On Fri, May 30, 2014 at 4:03 PM, Margusja wrote: > Is it possible to download pre build package? > http://mirror.symnds.com/software/Apache/incubator/ > spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 > > Best regards, Ma

Re: Announcing Spark 1.0.0

2014-05-30 Thread Margusja
Is it possible to download pre build package? http://mirror.symnds.com/software/Apache/incubator/spark/spark-1.0.0/spark-1.0.0-bin-hadoop2.tgz - gives me 404 Best regards, Margus (Margusja) Roo +372 51 48 780 http://margus.roo.ee http://ee.linkedin.com/in/margusroo skype: margusja ldapsearch -x

Re: Announcing Spark 1.0.0

2014-05-30 Thread Christopher Nguyen
Awesome work, Pat et al.! -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Fri, May 30, 2014 at 3:12 AM, Patrick Wendell wrote: > I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 > is a milestone release as the first in the

Announcing Spark 1.0.0

2014-05-30 Thread Patrick Wendell
I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0 is a milestone release as the first in the 1.0 line of releases, providing API stability for Spark's core interfaces. Spark 1.0.0 is Spark's largest release ever, with contributions from 117 developers. I'd like to thank everyon

Exception failure: java.lang.ClassNotFoundException: org.apache.spark.streaming.kafka.KafkaReceiver

2014-05-30 Thread Margusja
Hi spark version I am using is spark-0.9.1-bin-hadoop2 I build spark-assembly_2.10-0.9.1-hadoop2.2.0.jar I moved JavaKafkaWordCount.java from examples to new directory to play with it. My compile commands: javac -cp libs/spark-streaming_2.10-0.9.1.jar:libs/spark-assembly_2.10-0.9.1-hadoop2.2

Web frontend issue

2014-05-30 Thread Limbeck, Philip
Hi! I am currently facing an issue with the Spark 0.9.1 Web Frontend. Although the cluster seems to work fine and all workers are registered, the links pointing to the worker specific frontend part do not work because the link targets look like this: http://:8081 However, I am still able to ma