Re: Requirements for Spark cluster

2014-07-08 Thread Akhil Das
You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud quickly. If you want to set it up on your own then these are the things that you will need to do: 1. Make sure you have java (7) installed on all machines. 2. Install and configure spark (add all slave nodes in conf/slaves

Need advice to create an objectfile of set of images from Spark

2014-07-08 Thread Jaonary Rabarisoa
Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao

Re: spark Driver

2014-07-08 Thread amin mohebbi
This is exactly what I got  Spark Executor Command: "java" "-cp" ":: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf" " -XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrained

Re: Spark Installation

2014-07-08 Thread 田毅
Hi Srikrishna the reason to this issue is you had uploaded assembly jar to HDFS twice. paste your command could be better diagnosis 田毅 === 橘云平台产品线 大数据产品部 亚信联创科技(中国)有限公司 手机:13910177261 电话:010-82166322 传真:010-82166617 Q Q:20057509 MSN:yi.t...@hotmail.c

Re: spark Driver

2014-07-08 Thread Akhil Das
Can you also paste a little bit more stacktrace? Thanks Best Regards On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi wrote: > I have the following in spark-env.sh > > SPARK_MASTER_IP=master > SPARK_MASTER_port=7077 > > > Best Regards > > ... >

Re: spark Driver

2014-07-08 Thread amin mohebbi
I have the following in spark-env.sh  SPARK_MASTER_IP=master SPARK_MASTER_port=7077   Best Regards ... Amin Mohebbi PhD candidate in Software Engineering   at university of Malaysia   H/P : +60 18 2040 017 E-Mail : tp025...@ex.api

Re: spark Driver

2014-07-08 Thread Akhil Das
Can you try setting SPARK_MASTER_IP in the spark-env.sh file? Thanks Best Regards On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi wrote: > > Hi all, > I have one master and two slave node, I did not set any ip for spark > driver because I thought it uses its default ( localhost). In my etc/host

Standalone cluster on Windows

2014-07-08 Thread Chitturi Padma
Hi, I wanted to set up standalone cluster on windows machine. But unfortunately, spark-master.cmd file is not available. Can someone suggest how to proceed or is spark-1.0.0 has missed spark-master.cmd file ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

spark Driver

2014-07-08 Thread amin mohebbi
 Hi all,  I have one master and two slave node, I did not set any ip for spark driver because I thought it uses its default ( localhost). In my etc/hosts I got the following : 192.168.0.1 master, 192.168.0.2 slave, 192.168.03 slave2 127.0.0.0 local host and 127.0.1.1 virtualbox . Should I do s

Kryo is slower, and the size saving is minimal

2014-07-08 Thread innowireless TaeYun Kim
Hi, For my test case, using Kryo serializer does not help. It is slower than default Java serializer, and the size saving is minimal. I've registered almost all classes to the Kryo registrator. What is happening to my test case? Have Anyone experienced a case like this?

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
By the way, you can run the sc.getConf.get("spark.driver.host") thing inside spark-shell, whether or not the Executors actually start up successfully. On Tue, Jul 8, 2014 at 8:23 PM, Aaron Davidson wrote: > You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I > thought you mig

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
You actually should avoid setting SPARK_PUBLIC_DNS unless necessary, I thought you might have preemptively done so. I think the issue is actually related to your network configuration, as Spark probably failed to find your driver's ip address. Do you see a warning on the driver that looks something

Re: Document page load fault

2014-07-08 Thread Matei Zaharia
Thanks for catching this. For now you can just access the page through http:// instead of https:// to avoid this. Matei On Jul 8, 2014, at 10:46 PM, binbinbin915 wrote: > https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression > on Chrome 35 with MacX > > Says: > [b

Document page load fault

2014-07-08 Thread binbinbin915
https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression on Chrome 35 with MacX Says: [blocked] The page at 'https://spark.apache.org/docs/latest/mllib-linear-methods.html' was loaded over HTTPS, but ran insecure content from 'http://cdn.mathjax.org/mathjax/latest/MathJax

slower worker node in the cluster

2014-07-08 Thread haopu
In a standalone cluster, is there way to specify the stage to be running on a faster worker? That stage is reading HDFS file and then doing some filter operations. The tasks are assigned to the slower worker also, but the slower worker delays to launch because it's running some tasks from other s

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Cheney Sun
Hi Sandy, We are also going to grep data from a security enabled (with kerberos) HDFS in our Spark application. Per you answer, we have to switch Spark on YARN to achieve this. We plan to deploy a different Hadoop cluster(with YARN) only to run Spark. Is it necessary to deploy YARN with security e

RE: Spark Streaming and Storm

2014-07-08 Thread Shao, Saisai
You may get the performance comparison results from Spark Streaming paper and meetup ppt, just google it. Actually performance comparison is case by case and relies on your work load design, hardware and software configurations. There is no actual winner for the whole scenarios. Thanks Jerry F

Requirements for Spark cluster

2014-07-08 Thread Robert James
I have a Spark app which runs well on local master. I'm now ready to put it on a cluster. What needs to be installed on the master? What needs to be installed on the workers? If the cluster already has Hadoop or YARN or Cloudera, does it still need an install of Spark?

Purpose of spark-submit?

2014-07-08 Thread Robert James
What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ?

RE: error when spark access hdfs with Kerberos enable

2014-07-08 Thread 许晓炜
Thanks a lot Marcelo and Sandy. I will try spark on yarn . Xiaowei From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: Wednesday, July 09, 2014 4:20 AM To: user@spark.apache.org Subject: Re: error when spark access hdfs with Kerberos enable That's correct. Only Spark on YARN supports Kerber

Spark Streaming and Storm

2014-07-08 Thread xichen_tju@126
hi all I am a newbie to Spark Streaming, and used Strom before.Have u test the performance both of them and which one is better? xichen_tju@126

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
Yes, that would be the Java equivalence to use static class member, but you should carefully program to prevent resource leakage. A good choice is to use third-party DB connection library which supports connection pool, that will alleviate your programming efforts. Thanks Jerry From: Juan Rodr

Re: Comparative study

2014-07-08 Thread Keith Simmons
Santosh, To add a bit more to what Nabeel said, Spark and Impala are very different tools. Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop e

Spark Streaming using File Stream in Java

2014-07-08 Thread Aravind
Hi all, I am trying to run the NetworkWordCount.java file in the streaming examples. The example shows how to read from a network socket. But my usecase is that , I have a local log file which is a stream and continuously updated (say /Users/.../Desktop/mylog.log). I would like to write the same

Re: Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi Tobias, Currently, I do not use bundle any dependency into my application jar. I will try that. Thanks a lot! Bill On Tue, Jul 8, 2014 at 5:22 PM, Tobias Pfeiffer wrote: > Bill, > > have you packaged "org.apache.spark" % "spark-streaming-kafka_2.10" % > "1.0.0" into your application jar? I

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Tobias Pfeiffer
Bill, do the additional 100 nodes receive any tasks at all? (I don't know which cluster you use, but with Mesos you could check client logs in the web interface.) You might want to try something like repartition(N) or repartition(N*2) (with N the number of your nodes) after you receive your data.

Re: Spark-streaming-kafka error

2014-07-08 Thread Tobias Pfeiffer
Bill, have you packaged "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.0.0" into your application jar? If I remember correctly, it's not bundled with the downloadable compiled version of Spark. Tobias On Wed, Jul 9, 2014 at 8:18 AM, Bill Jay wrote: > Hi all, > > I used sbt to package

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Thanks Andrew, ./bin/spark-shell --master spark://10.2.1.5:7077 --total-executor-cores 30 --executor-memory 20g --driver-memory 10g works well, just wanted to make sure that I'm not missing anything -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-wi

Re: Cannot create dir in Tachyon when running Spark with OFF_HEAP caching (FileDoesNotExistException)

2014-07-08 Thread Teng Long
More updates: Seems in TachyonBlockManager.scala(line 118) of Spark 1.1.0, the TachyonFS.mkdir() method is called, which creates a directory in Tachyon. Right after that, TachyonFS.getFile() method is called. In all the versions of Tachyon I tried (0.4.1, 0.4.0), the second method will return a nu

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
>> "The proper way to specify this is through "spark.master" in your config or the "--master" parameter to spark-submit." By "this" I mean configuring which master the driver connects to (not which port and address the standalone Master binds to). 2014-07-08 16:43 GMT-07:00 Andrew Or : > Hi Mik

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
Hi Mikhail, It looks like the documentation is a little out-dated. Neither is true anymore. In general, we try to shift away from short options ("-em", "-dm" etc.) in favor of more explicit ones ("--executor-memory", "--driver-memory"). These options, and "--cores", refer to the arguments passed i

issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month ago). The setup is a standalone cluster with 4 worker machines and 1 master machine. I used to run spark shell like this: ./bin/spark-shell -c 30 -em 20g -dm 10g Today I've finally updated to Spark 1.0 release. Now I can

Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when I submitted to yarn, the job ran for 10 seconds and there was an error in the log file as follows: Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

Re: Spark job tracker.

2014-07-08 Thread abhiguruvayya
Hello Mayur, How can I implement these methods mentioned below. Do u you have any clue on this pls et me know. public void onJobStart(SparkListenerJobStart arg0) { } @Override public void onStageCompleted(SparkListenerStageCompleted arg0) { }

Re: Spark: All masters are unresponsive!

2014-07-08 Thread Andrew Or
It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the ex

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 2f1dc868e5714882cf40d2633fb66772baf34789) Hi All, When I enabled the spark-defaults.conf for the eventLog, spark-shell broke while spark-submit works. I'm trying to create a separate directory per user to keep track with their own Spark job event

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and run

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Hi Aaron,Would really appreciate your help if you can point me to the documentation. Is this something that I need to do with /etc/hosts on each of the worker machines ? Or do I set SPARK_PUBLIC_DNS (if yes, what is the format?) or something else? I have the following set up: master node: pzxnv

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Have you tried the obvious (increase the heap size of your JVM)? On Tue, Jul 8, 2014 at 2:02 PM, Rahul Bhojwani wrote: > Thanks Marcelo. > I was having another problem. My code was running properly and then it > suddenly stopped with the error: > > java.lang.OutOfMemoryError: Java heap space >

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Thanks Marcelo. I was having another problem. My code was running properly and then it suddenly stopped with the error: java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be "what did I do

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Sorry, that would be sc.stop() (not close). On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin wrote: > Hi Rahul, > > Can you try calling "sc.close()" at the end of your program, so Spark > can clean up after itself? > > On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani > wrote: >> Here I am adding my

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Thanks Xiangrui. You have solved almost all my problems :) On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng wrote: > 1) The feature dimension should be a fixed number before you run > NaiveBayes. If you use bag of words, you need to handle the > word-to-index dictionary by yourself. You can either

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui for the help. On Wed, Jul 9, 2014 at 1:39 AM, Xiangrui Meng wrote: > Well, I believe this is a correct implementation but please let us > know if you run into problems. The NaiveBayes implementation in MLlib > v1.0 supports sparse data, which is usually the case for text >

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Hi Rahul, Can you try calling "sc.close()" at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani wrote: > Here I am adding my code. If you can have a look to help me out. > Thanks > ### > > import tokenizer > import ge

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote: > Hi Rahul, > > We plan to add online model updates with Spark Streaming, perhaps in > v1.1, starting with linear methods. Please open a JIRA for Naive > Bayes. For Naive Bayes, we need to update the pri

OutOfMemory : Java heap space error

2014-07-08 Thread Rahul Bhojwani
Hi, My code was running properly but then it suddenly gave this error. Can you just put some light on it. ### 0 KB, free: 38.7 MB) 14/07/09 01:46:12 INFO BlockManagerMaster: Updated info of block rdd_2212_4 14/07/09 01:46:13 INFO PythonRDD: Times: total = 1486, boot = 698, ini

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Cool Thanks Michael! Message sent from a mobile device - excuse typos and abbreviations > Le 8 juil. 2014 à 22:17, Michael Armbrust [via Apache Spark User List] > a écrit : > >> On Tue, Jul 8, 2014 at 12:43 PM, Pierre B <[hidden email]> wrote: >> 1/ Is there a way to convert a SchemaRDD (for i

Re: Help for the large number of the input data files

2014-07-08 Thread Xiangrui Meng
You can either use sc.wholeTextFiles and then a flatMap to reduce the number of partitions, or give more memory to the driver process by using --driver-memory 20g and then call RDD.repartition(small number) after you load the data in. -Xiangrui On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun K

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Sandy Ryza
That's correct. Only Spark on YARN supports Kerberos. -Sandy On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin wrote: > Someone might be able to correct me if I'm wrong, but I don't believe > standalone mode supports kerberos. You'd have to use Yarn for that. > > On Tue, Jul 8, 2014 at 1:40 AM,

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Xiangrui Meng
1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly pr

Re: Comparative study

2014-07-08 Thread Aaron Davidson
> > Not sure exactly what is happening but perhaps there are ways to > restructure your program for it to work better. Spark is definitely able to > handle much, much larger workloads. +1 @Reynold Spark can handle big "big data". There are known issues with informing the user about what went wro

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Michael Armbrust
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B < pierre.borckm...@realimpactanalytics.com> wrote: > > 1/ Is there a way to convert a SchemaRDD (for instance loaded from a > parquet > file) back to a RDD of a given case class? > There may be someday, but doing so will either require a lot of reflection

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs. I don't kno

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
Yin (cc-ed) is working on it as we speak. We'll post to the JIRA as soon as a PR is up. On Tue, Jul 8, 2014 at 1:04 PM, Ionized wrote: > Thanks for the heads-up. > > In the meantime, we'd like to test this out ASAP - are there any open PR's > we could take to try it out? (or do you have an est

Re: got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread Xiangrui Meng
try sbt/sbt clean first On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 wrote: > Hi guys, > when i try to compile the latest source by sbt/sbt compile, I got an error. > Can any one help me? > > The following is the detail: it may cause by TestSQLContext.scala > [error] > [error] while compiling: > /d

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Xiangrui Meng
Well, I believe this is a correct implementation but please let us know if you run into problems. The NaiveBayes implementation in MLlib v1.0 supports sparse data, which is usually the case for text classificiation. I would recommend upgrading to v1.0. -Xiangrui On Tue, Jul 8, 2014 at 7:20 AM, Rah

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Ionized
Thanks for the heads-up. In the meantime, we'd like to test this out ASAP - are there any open PR's we could take to try it out? (or do you have an estimate on when some will be available?) On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust wrote: > This is on the roadmap for the next release (

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Xiangrui Meng
Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open a JIRA for Naive Bayes. For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for the

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio wrote: > Hi, > > Sailthru is also using Spark. Could you please add us to the Powered By > Spark > page > when you have a chance? > > Organization

Re: Scheduling in spark

2014-07-08 Thread Andrew Or
Here's the most updated version of the same page: http://spark.apache.org/docs/latest/job-scheduling 2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi : > This is a good start: > > http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html > > > On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed m

[Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Hi there! 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? 2/ Even better, is there a way to get the schema information from a SchemaRDD ? I am trying to figure out how to properly get the various fields of the Rows of a Schem

Re: Scheduling in spark

2014-07-08 Thread Sujeet Varakhedi
This is a good start: http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek wrote: > Hi, > I am a post graduate student, new to spark. I want to understand how > Spark scheduler works. I just have theoretical understanding of DAG >

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
I have pasted the logs below: PS F:\spark-0.9.1\codes\sentiment analysis> pyspark .\naive_bayes_analyser.py Running python with PYTHONPATH=F:\spark-0.9.1\spark-0.9.1\bin\..\python; SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/F:/spark-0.9.1/spark-0.9.1/as

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Here I am adding my code. If you can have a look to help me out. Thanks ### import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster("loc

Re: Comparative study

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > > Libraries like Scoobi, Scrunch and Scalding (and their associated Java > versions) provide a Spark-like wrapper around Map/Reduce but my guess is > that, since they are limited to Map/Reduce under the covers,

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
We kind of hijacked Santos' original thread, so apologies for that and let me try to get back to Santos' original question on Map/Reduce versus Spark. I would say it's worth migrating from M/R, with the following thoughts. Just my opinion but I would summarize the latest emails in this thread as

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Note I didn't say that was your problem - it would be if (i) you're running your job on Yarn and (ii) you look at the Yarn NodeManager logs and see that it's actually killing your process. I just said that the exception shows up in those kinds of situations. You haven't provided enough information

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread sstilak
Hi Aaron, I have 4 nodes - 1 master and 3 workers. I am not setting up driver public dns name anywhere. I didn't see that step in the documentation -- may be I missed it. Can you please point me in the right direction? Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Origina

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Hi Marcelo. Thanks for the quick reply. Can you suggest me how to increase the memory limits or how to tackle this problem. I am a novice. If you want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin wrote: > This is generally a side effect of your executor bein

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani wrote: > HI, > > I am getting this error. Can anyone help out to explain why is this error > coming. > >

Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
HI, I am getting this error. Can anyone help out to explain why is this error coming. Exception in thread "delete Spark temp dir C:\Users\shawn\AppData\Local\Temp\spark-27f60467-36d4-4081-aaf5-d0ad42dda560" java.io.IOException: Failed to delete: C:\Users\shawn\AppData\Local\Temp\spark-

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin
Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 wrote: > Hi all, > > > > I encounter a strange issue when using spark 1.0 to access hdfs with > Kerberos > > I just have on

Re: Spark Installation

2014-07-08 Thread Srikrishna S
Hi All, I tried the make distribution script and it worked well. I was able to compile the spark binary on our CDH5 cluster. Once I compiled Spark, I copied over the binaries in the dist folder to all the other machines in the cluster. However, I run into an issue while submit a job in yarn-clie

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
Hmm, looks like the Executor is trying to connect to the driver on localhost, from this line: 14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler What is your setup? Standalone mode with 4 separate machines? Are yo

CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Dear All, When I look inside the following directory on my worker node:$SPARK_HOME/work/app-20140708110707-0001/3 I see the following error message: log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system properly.log

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > How wide are the rows of data, either the raw input data or any generated > intermediate data? > > We are at a loss as to why our

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey wrote: > Nothing particularly custom. We've

Re: Comparative study

2014-07-08 Thread Kevin Markey
Nothing particularly custom.  We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total me

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer wrote: > Bill, > > can't you just add more nodes in order to speed up the processing? > > Tobias > > > On Thu, J

Further details on spark cluster set up

2014-07-08 Thread Sameer Tilak
Hi All, I used ip addresses in my scripts (spark-env.sh) and slaves contain ip addresses of master and slave nodes respectively. However, I still have no luck. Here is the relevant log file snippet: Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree w

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice. Ther

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
This seems almost equivalent to a heap size error -- since GCs are stop-the-world events, the fact that we were unable to release more than 2% of the heap suggests that almost all the memory is *currently in use *(i.e., live). Decreasing the number of cores is another solution which decreases memo

Re: Comparative study

2014-07-08 Thread Kevin Markey
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only.  While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time.  We've eva

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here: http:/

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId -> profile). I think it's hard to know ho

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Mark Hamstra
See Working with Key-Value Pairs . In particular: "In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Sean Owen
If your RDD contains pairs, like an RDD[(String,Integer)] or something, then you get to use the functions in PairRDDFunctions as if they were declared on RDD. On Tue, Jul 8, 2014 at 6:25 PM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Hi all, > > sorry for fooly question,

how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing it to perform leftOuterJoin aftewards currently I do in this was (it seems incorrect): val parRDD = new PairRDDFunctions( oldRdd.map(i => (i.key, i)) ) I guess this constructor is definitely wrong... Thank you,

RE: Spark: All masters are unresponsive!

2014-07-08 Thread Sameer Tilak
Hi Akhil et al.,I made the following changes: In spark-env.sh I added the following three entries (standalone mode) export SPARK_MASTER_IP=pzxnvm2018.x.y.name.orgexport SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3 I then use start-master and start-slaves commands to start the services. Anoth

Re: NoSuchMethodError in KafkaReciever

2014-07-08 Thread Michael Chang
To be honest I'm a scala newbie too. I just copied it from createStream. I assume it's the canonical way to convert a java map (JMap) to a scala map (Map) On Mon, Jul 7, 2014 at 1:40 PM, mcampbell wrote: > xtrahotsauce wrote > > I had this same problem as well. I ended up just adding the nec

Please add Talend to "Powered By Spark" page

2014-07-08 Thread Daniel Kulp
We are looking to add a note about Talend Open Studio's support for Spark components to the page at: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Name: Talend Open Studio URL: http://www.talendforge.org/exchange/ Description: Talend Labs are building open source tooling t

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
We're doing similar thing to lunch spark job in tomcat, and I opened a JIRA for this. There are couple technical discussions there. https://issues.apache.org/jira/browse/SPARK-2100 In this end, we realized that spark uses jetty not only for Spark WebUI, but also for distributing the jars and task

Re: Spark Installation

2014-07-08 Thread Sandy Ryza
Hi Srikrishna, The binaries are built with something like mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 -Dyarn.version=2.3.0-cdh5.0.1 -Sandy On Tue, Jul 8, 2014 at 3:14 AM, 田毅 wrote: > try this command: > > make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive > > > > > 田毅

java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySeria

Scheduling in spark

2014-07-08 Thread rapelly kartheek
Hi, I am a post graduate student, new to spark. I want to understand how Spark scheduler works. I just have theoretical understanding of DAG scheduler and the underlying task scheduler. I want to know, given a job to the framework, after the DAG scheduler phase, how the scheduling happens?? Can

  1   2   >