Re: Spark and HDFS ( Worker and Data Nodes Combination )

2015-06-22 Thread Akhil Das
Option 1 should be fine, Option 2 would bound a lot on network as the data increase in time. Thanks Best Regards On Mon, Jun 22, 2015 at 5:59 PM, Ashish Soni asoni.le...@gmail.com wrote: Hi All , What is the Best Way to install and Spark Cluster along side with Hadoop Cluster , Any

Re: Spark Titan

2015-06-21 Thread Akhil Das
Have a look at http://s3.thinkaurelius.com/docs/titan/0.5.0/titan-io-format.html You could use those Input/Output formats with newAPIHadoopRDD api call. Thanks Best Regards On Sun, Jun 21, 2015 at 8:50 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi, How to connect TItan

Re: Local spark jars not being detected

2015-06-20 Thread Akhil Das
Not sure, but try removing the provided or create a lib directory in the project home and bring that jar over there. On 20 Jun 2015 18:08, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I'm using IntelliJ ide for my spark project. I've compiled spark 1.3.0 for scala 2.11.4 and

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-19 Thread Akhil Das
restarting from checkpoint even for graceful shutdown. I think usually the file is expected to be processed only once. Maybe this is a bug in fileStream? or do you know any approach to workaround it? Much thanks! -- *From:* Akhil Das [mailto:ak

Re: Build spark application into uber jar

2015-06-19 Thread Akhil Das
This is how i used to build a assembly jar with sbt: Your build.sbt file would look like this: *import AssemblyKeys._* *assemblySettings* *name := FirstScala* *version := 1.0* *scalaVersion := 2.10.4* *libraryDependencies += org.apache.spark %% spark-core % 1.3.1* *libraryDependencies +=

Re: how to change /tmp folder for spark ut use sbt

2015-06-19 Thread Akhil Das
You can try setting these properties: .set(spark.local.dir,/mnt/spark/) .set(java.io.tmpdir,/mnt/spark/) Thanks Best Regards On Fri, Jun 19, 2015 at 8:28 AM, yuemeng (A) yueme...@huawei.com wrote: hi,all if i want to change the /tmp folder to any other folder for spark ut use

Re: N kafka topics vs N spark Streaming

2015-06-19 Thread Akhil Das
Like this? val add_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(add).toSet) val delete_msgs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Array(delete).toSet) val

Re: kafka spark streaming working example

2015-06-18 Thread Akhil Das
.setMaster(local) set it to local[2] or local[*] Thanks Best Regards On Thu, Jun 18, 2015 at 5:59 PM, Bartek Radziszewski bar...@scalaric.com wrote: hi, I'm trying to run simple kafka spark streaming example over spark-shell: sc.stop import org.apache.spark.SparkConf import

Re: connect mobile app with Spark backend

2015-06-18 Thread Akhil Das
Why not something like your mobile app pushes data to your webserver which pushes the data to Kafka or Cassandra or any other database and have a Spark streaming job running all the time operating on the incoming data and pushes the calculated values back. This way, you don't have to start a spark

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-18 Thread Akhil Das
Which version of spark? and what is your data source? For some reason, your processing delay is exceeding the batch duration. And its strange that you are not seeing any scheduling delay. Thanks Best Regards On Thu, Jun 18, 2015 at 7:29 AM, Mike Fang chyfan...@gmail.com wrote: Hi, I have a

Re: Machine Learning on GraphX

2015-06-18 Thread Akhil Das
This might give you a good start http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html its a bit old though. Thanks Best Regards On Thu, Jun 18, 2015 at 2:33 PM, texol t.rebo...@gmail.com wrote: Hi, I'm new to GraphX and I'd like to use Machine Learning

Re: Web UI vs History Server Bugs

2015-06-18 Thread Akhil Das
You could possibly open up a JIRA and shoot an email to the dev list. Thanks Best Regards On Wed, Jun 17, 2015 at 11:40 PM, jcai jonathon@yale.edu wrote: Hi, I am running this on Spark stand-alone mode. I find that when I examine the web UI, a couple bugs arise: 1. There is a

Re: Shuffle produces one huge partition

2015-06-17 Thread Akhil Das
Can you try repartitioning the rdd after creating the K,V. And also, while calling the rdd1.join(rdd2, Pass the # partition argument too) Thanks Best Regards On Wed, Jun 17, 2015 at 12:15 PM, Al M alasdair.mcbr...@gmail.com wrote: I have 2 RDDs I want to Join. We will call them RDD A and RDD

Re: ClassNotFound exception from closure

2015-06-17 Thread Akhil Das
Not sure why spark-submit isn't shipping your project jar (may be try with --jars), You can do a sc.addJar(/path/to/your/project.jar) also, it should solve it. Thanks Best Regards On Wed, Jun 17, 2015 at 6:37 AM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty

Re: Spark History Server pointing to S3

2015-06-16 Thread Akhil Das
Not quiet sure, but try pointing the spark.history.fs.logDirectory to your s3 Thanks Best Regards On Tue, Jun 16, 2015 at 6:26 PM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: In Spark website it’s stated in the View After the Fact section (

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-16 Thread Akhil Das
, thanks again! -- *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, June 15, 2015 3:48 PM *To:* Haopu Wang *Cc:* user *Subject:* Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine

Re: tasks won't run on mesos when using fine grained

2015-06-16 Thread Akhil Das
Did you look inside all logs? Mesos logs and executor logs? Thanks Best Regards On Mon, Jun 15, 2015 at 7:09 PM, Gary Ogden gog...@gmail.com wrote: My Mesos cluster has 1.5 CPU and 17GB free. If I set: conf.set(spark.mesos.coarse, true); conf.set(spark.cores.max, 1); in the SparkConf

Re: Optimizing Streaming from Websphere MQ

2015-06-16 Thread Akhil Das
am not experiencing any performance benefit from it. Is it something related to the bottleneck of MQ or Reliable Receiver? *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Saturday, June 13, 2015 1:10 AM *To:* Chaudhary, Umesh *Cc:* user@spark.apache.org *Subject:* Re

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das ak

Re: settings from props file seem to be ignored in mesos

2015-06-16 Thread Akhil Das
Whats in your executor (that .tgz file) conf/spark-default.conf file? Thanks Best Regards On Mon, Jun 15, 2015 at 7:14 PM, Gary Ogden gog...@gmail.com wrote: I'm loading these settings from a properties file: spark.executor.memory=256M spark.cores.max=1 spark.shuffle.consolidateFiles=true

Re: About HostName display in SparkUI

2015-06-15 Thread Akhil Das
In the conf/slaves file, are you having the ip addresses? or the hostnames? Thanks Best Regards On Sat, Jun 13, 2015 at 9:51 PM, Sea 261810...@qq.com wrote: In spark 1.4.0, I find that the Address is ip (it was hostname in v1.3.0), why? who did it?

Re: How to set up a Spark Client node?

2015-06-15 Thread Akhil Das
I'm assuming by spark-client you mean the spark driver program. In that case you can pick any machine (say Node 7), create your driver program in it and use spark-submit to submit it to the cluster or if you create the SparkContext within your driver program (specifying all the properties) then

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should

Re: Spark DataFrame Reduce Job Took 40s for 6000 Rows

2015-06-15 Thread Akhil Das
Have a look here https://spark.apache.org/docs/latest/tuning.html Thanks Best Regards On Mon, Jun 15, 2015 at 11:27 AM, Proust GZ Feng pf...@cn.ibm.com wrote: Hi, Spark Experts I have played with Spark several weeks, after some time testing, a reduce operation of DataFrame cost 40s on a

Re: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Akhil Das
I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:54 AM, Haopu Wang hw...@qilinsoft.com wrote: Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu

Re: Reliable SQS Receiver for Spark Streaming

2015-06-13 Thread Akhil Das
Yes, if you have enabled WAL and checkpointing then after the store, you can simply delete the SQS Messages from your receiver. Thanks Best Regards On Sat, Jun 13, 2015 at 6:14 AM, Michal Čizmazia mici...@gmail.com wrote: I would like to have a Spark Streaming SQS Receiver which deletes SQS

Re: Contribution

2015-06-13 Thread Akhil Das
This is a good start, if you haven't seen this already https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Thanks Best Regards On Sat, Jun 13, 2015 at 8:46 AM, srinivasraghavansr71 sreenivas.raghav...@gmail.com wrote: Hi everyone, I am interest to

Re: How to split log data into different files according to severity

2015-06-13 Thread Akhil Das
Are you looking for something like filter? See a similar example here https://spark.apache.org/examples.html Thanks Best Regards On Sat, Jun 13, 2015 at 3:11 PM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is

Re: Are there ways to restrict what parameters users can set for a Spark job?

2015-06-13 Thread Akhil Das
I think the straight answer would be No, but yes you can actually hardcode these parameters if you want. Look in the SparkContext.scala https://github.com/apache/spark/blob/master/core%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2FSparkContext.scala#L364 where all these properties are being

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-12 Thread Akhil Das
Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, you can actually add jets3t-0.9.0.jar to the classpath (sc.addJar(/path/to/jets3t-0.9.0.jar). Thanks Best Regards On Thu, Jun 11, 2015 at 6:44 PM, shahab shahab.mok...@gmail.com wrote: Hi, I tried to read a csv file

Re: spark stream and spark sql with data warehouse

2015-06-12 Thread Akhil Das
This is a good start, if you haven't read it already http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations Thanks Best Regards On Thu, Jun 11, 2015 at 8:17 PM, 唐思成 jadetan...@qq.com wrote: Hi all: We are trying to using spark to do some real

Re: Limit Spark Shuffle Disk Usage

2015-06-12 Thread Akhil Das
You can disable shuffle spill (spark.shuffle.spill http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior) if you are having enough memory to hold that much data. I believe adding more resources would be your only choice. Thanks Best Regards On Thu, Jun 11, 2015 at 9:46 PM, Al M

Re: --jars not working?

2015-06-12 Thread Akhil Das
You can verify if the jars are shipped properly by looking at the driver UI (running on 4040) Environment tab. Thanks Best Regards On Sat, Jun 13, 2015 at 12:43 AM, Jonathan Coveney jcove...@gmail.com wrote: Spark version is 1.3.0 (will upgrade as soon as we upgrade past mesos 0.19.0)...

Re: Optimizing Streaming from Websphere MQ

2015-06-12 Thread Akhil Das
How many cores are you allocating for your job? And how many receivers are you having? It would be good if you can post your custom receiver code, it will help people to understand it better and shed some light. Thanks Best Regards On Fri, Jun 12, 2015 at 12:58 PM, Chaudhary, Umesh

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
4040 is your driver port, you need to run some application. Login to your cluster start a spark-shell and try accessing 4040. Thanks Best Regards On Wed, Jun 10, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote: Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is

Re: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Akhil Das
RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd = { val file = ssc.sparkContext.textFile(/sigmoid/) val kvFile = file.map(x = (x.split(,)(0), x)) rdd.join(kvFile) }) Thanks Best Regards On

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) will work for you then . Thanks Best Regards On Wed, Jun 10, 2015 at 5:10 PM, mrm ma...@skimlinks.com wrote: Hi Akhil, Thanks for your reply! I still cannot see port

Re: About akka used in spark

2015-06-10 Thread Akhil Das
If you look at the maven repo, you can see its from typesafe only http://mvnrepository.com/artifact/org.spark-project.akka/akka-actor_2.10/2.3.4-spark For sbt, you can download the sources by adding withSources() like: libraryDependencies += org.spark-project.akka % akka-actor_2.10 % 2.3.4-spark

Re: Spark's Scala shell killing itself

2015-06-10 Thread Akhil Das
May be you should update your spark version to the latest one. Thanks Best Regards On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I have configured Spark to run on YARN. Whenever I start spark shell using 'spark-shell' command, it

Re: How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread Akhil Das
Hope Swig http://www.swig.org/index.php and JNA https://github.com/twall/jna/ might help for accessing c++ libraries from Java. Thanks Best Regards On Wed, Jun 10, 2015 at 11:50 AM, mahesht mahesh.s.tup...@gmail.com wrote: There is C++ component which uses some model which we want to replace

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
standalone mode. Any ideas? Thanks Dong Lei *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Tuesday, June 9, 2015 4:46 PM *To:* Dong Lei *Cc:* user@spark.apache.org *Subject:* Re: ClassNotDefException when using spark-submit with multiple jars and files located

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Akhil Das
Delete the checkpoint directory, you might have modified your driver program. Thanks Best Regards On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, If checkpoint data is already present in HDFS, driver fails to load as it is performing lookup on previous

Re: Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread Akhil Das
Looks like libphp version is 5.6 now, which version of spark are you using? Thanks Best Regards On Thu, Jun 11, 2015 at 3:46 AM, barmaley o...@solver.com wrote: Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ... Shutting down GANGLIA

Re: Spark standalone mode and kerberized cluster

2015-06-10 Thread Akhil Das
This might help http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html Thanks Best Regards On Wed, Jun 10, 2015 at 6:49 PM, kazeborja kazebo...@gmail.com wrote: Hello all. I've been reading some old mails and

Re: Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
+ /pgs/sample/samplechr + i + .txt /opt/data/shellcompare/data/user + j + /pgs/intermediateResult/result + i + .txt 600) pipeModify2.collect() sc.stop() } } Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
? On Tue, Jun 9, 2015 at 1:07 PM, amit tewari amittewar...@gmail.com wrote: Thanks Akhil, as you suggested, I have to go keyBy(route) as need the columns intact. But wil keyBy() take accept multiple fields (eg x(0), x(1))? Thanks Amit On Tue, Jun 9, 2015 at 12:26 PM, Akhil Das ak

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
downloaded the application jar but not the other jars specified by “—jars”. Or I misunderstand the usage of “--jars”, and the jars should be already in every worker, driver will not download them? Is there some useful docs? Thanks Dong Lei *From:* Akhil Das [mailto:ak

Re: Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
Regards On Tue, Jun 9, 2015 at 2:09 PM, luohui20...@sina.com wrote: Only 1 minor GC, 0.07s. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: How

Re: Spark error value join is not a member of org.apache.spark.rdd.RDD[((String, String), String, String)]

2015-06-09 Thread Akhil Das
Try this way: scalaval input1 = sc.textFile(/test7).map(line = line.split(,).map(_.trim)); scalaval input2 = sc.textFile(/test8).map(line = line.split(,).map(_.trim)); scalaval input11 = input1.map(x=(*(x(0) + x(1)*),x(2),x(3))) scalaval input22 = input2.map(x=(*(x(0) + x(1)*),x(2),x(3))) scala

Re: Saving compressed textFiles from a DStream in Scala

2015-06-09 Thread Akhil Das
like this? myDStream.foreachRDD(rdd = rdd.saveAsTextFile(/sigmoid/, codec )) Thanks Best Regards On Mon, Jun 8, 2015 at 8:06 PM, Bob Corsaro rcors...@gmail.com wrote: It looks like saveAsTextFiles doesn't support the compression parameter of RDD.saveAsTextFile. Is there a way to add the

Re: How to decrease the time of storing block in memory

2015-06-09 Thread Akhil Das
May be you should check in your driver UI and see if there's any GC time involved etc. Thanks Best Regards On Mon, Jun 8, 2015 at 5:45 PM, luohui20...@sina.com wrote: hi there I am trying to descrease my app's running time in worker node. I checked the log and found the most

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-09 Thread Akhil Das
Once you submits the application, you can check in the driver UI (running on port 4040) Environment Tab to see whether those jars you added got shipped or not. If they are shipped and still you are getting NoClassDef exceptions then it means that you are having a jar conflict which you can resolve

Re: Driver crash at the end with InvocationTargetException when running SparkPi

2015-06-08 Thread Akhil Das
Can you look in your worker logs for more detailed stack-trace? If its about winutils.exe you can look at these links to get it resolved. - http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7 - https://issues.apache.org/jira/browse/SPARK-2356 Thanks Best Regards On Mon, Jun 8,

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
it just lets me straight in. On Mon, Jun 8, 2015 at 11:58 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin

Re: spark ssh to slave

2015-06-08 Thread Akhil Das
Can you do *ssh -v 192.168.1.16* from the Master machine and make sure its able to login without password? Thanks Best Regards On Mon, Jun 8, 2015 at 2:51 PM, James King jakwebin...@gmail.com wrote: I have two hosts 192.168.1.15 (Master) and 192.168.1.16 (Worker) These two hosts have

Re: Scheduler question: stages with non-arithmetic numbering

2015-06-07 Thread Akhil Das
Are you seeing the same behavior on the driver UI? (that running on port 4040), If you click on the stage id header you can sort the stages based on IDs. Thanks Best Regards On Fri, Jun 5, 2015 at 10:21 PM, Mike Hynes 91m...@gmail.com wrote: Hi folks, When I look at the output logs for an

Re: Monitoring Spark Jobs

2015-06-07 Thread Akhil Das
It could be a CPU, IO, Network bottleneck, you need to figure out where exactly its chocking. You can use certain monitoring utilities (like top) to understand it better. Thanks Best Regards On Sun, Jun 7, 2015 at 4:07 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi All, I have a Spark

Re: Accumulator map

2015-06-07 Thread Akhil Das
​Another approach would be to use a zookeeper. If you have zookeeper running somewhere in the cluster you can simply create a path like */dynamic-list*​ in it and then write objects/values to it, you can even create/access nested objects. Thanks Best Regards On Fri, Jun 5, 2015 at 7:06 PM,

Re: Spark Streaming Stuck After 10mins Issue...

2015-06-07 Thread Akhil Das
Which consumer are you using? If you can paste the complete code then may be i can try reproducing it. Thanks Best Regards On Sun, Jun 7, 2015 at 1:53 AM, EH eas...@gmail.com wrote: And here is the Thread Dump, where seems every worker is waiting for Executor #6 Thread 95:

Re: Setting S3 output file grantees for spark output files

2015-06-05 Thread Akhil Das
You could try adding the configuration in the spark-defaults.conf file. And once you run the application you can actually check on the driver UI (runs on 4040) Environment tab to see if the configuration is set properly. Thanks Best Regards On Thu, Jun 4, 2015 at 8:40 PM, Justin Steigel

Re: Saving calculation to single local file

2015-06-05 Thread Akhil Das
you can simply do rdd.repartition(1).saveAsTextFile(...), it might not be efficient if your output data is huge since one task will be doing the whole writing. Thanks Best Regards On Fri, Jun 5, 2015 at 3:46 PM, marcos rebelo ole...@gmail.com wrote: Hi all I'm running spark in a single local

Re: StreamingListener, anyone?

2015-06-04 Thread Akhil Das
Hi Here's a working example: https://gist.github.com/akhld/b10dc491aad1a2007183 [image: Inline image 1] Thanks Best Regards On Wed, Jun 3, 2015 at 10:09 PM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I've got a Spark Streaming driver job implemented and in it, I register a streaming

Re: Python Image Library and Spark

2015-06-04 Thread Akhil Das
Replace this line: img_data = sc.parallelize( list(im.getdata()) ) With: img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have ) Thanks Best Regards On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur jmspar...@gmail.com wrote: Hi all, I'm playing around with

Re: Adding new Spark workers on AWS EC2 - access error

2015-06-04 Thread Akhil Das
That's because you need to add the master's public key (~/.ssh/id_rsa.pub) to the newly added slaves ~/.ssh/authorized_keys. I add slaves this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master

Re: Spark Client

2015-06-03 Thread Akhil Das
is, Is there an alternate api though which a spark application can be launched which can return a exit status back to the caller as opposed to initiating JVM halt. On Wed, Jun 3, 2015 at 12:58 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Run it as a standalone application. Create an sbt project

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee patcharee.thong...@uni.no wrote: Hi, What can be the cause of this ERROR cluster.YarnScheduler: Lost executor? How can I fix it? Best, Patcharee

Re: Scripting with groovy

2015-06-03 Thread Akhil Das
I think when you do a ssc.stop it will stop your entire application and by update a transformation function you mean modifying the driver program? In that case even if you checkpoint your application, it won't be able to recover from its previous state. A simpler approach would be to add certain

Re: Spark Client

2015-06-03 Thread Akhil Das
Run it as a standalone application. Create an sbt project and do sbt run? Thanks Best Regards On Wed, Jun 3, 2015 at 11:36 AM, pavan kumar Kolamuri pavan.kolam...@gmail.com wrote: Hi guys , i am new to spark . I am using sparksubmit to submit spark jobs. But for my use case i don't want it

Re: ERROR cluster.YarnScheduler: Lost executor

2015-06-03 Thread Akhil Das
) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more Best, Patcharee On 03. juni 2015 09:21, Akhil Das wrote: You need to look into your executor/worker logs to see whats going on. Thanks Best Regards On Wed, Jun 3, 2015 at 12:01 PM, patcharee

Re: using pyspark with standalone cluster

2015-06-02 Thread Akhil Das
If you want to submit applications to a remote cluster where your port 7077 is opened publically, then you would need to set the *spark.driver.host *(with the public ip of your laptop) and *spark.driver.port* (optional, if there's no firewall between your laptop and the remote cluster). Keeping

Re: flatMap output on disk / flatMap memory overhead

2015-06-02 Thread Akhil Das
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think StorageLevel MEMORY_AND_DISK means spark will try to keep the data in memory and if there isn't sufficient space then it will be shipped to the disk. Thanks Best Regards On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea

Re: HDFS Rest Service not available

2015-06-02 Thread Akhil Das
It says your namenode is down (connection refused on 8020), you can restart your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and then sbin/start-dfs.sh Thanks Best Regards On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote: Hello All, A bit scared I did

Re: What is shuffle read and what is shuffle write ?

2015-06-02 Thread Akhil Das
I found an interesting presentation http://www.slideshare.net/colorant/spark-shuffle-introduction and go through this thread also http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html Thanks Best Regards On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)

Re: Spark 1.3.1 bundle does not build - unresolved dependency

2015-06-02 Thread Akhil Das
You can try to skip the tests, try with: mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package Thanks Best Regards On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote: I downloaded the 1.3.1 distro tarball $ll ../spark-1.3.1.tar.gz -rw-r-@ 1 steve staff

Re: Shared / NFS filesystems

2015-06-02 Thread Akhil Das
You can run/submit your code from one of the worker which has access to the file system and it should be fine i think. Give it a try. Thanks Best Regards On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com wrote: Hello! I have Spark running in standalone mode, and there

Re: How to read sequence File.

2015-06-02 Thread Akhil Das
Basically, you need to convert it to a serializable format before doing the collect/take. You can fire up a spark shell and paste this: val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence /sigmoid) *.map(_._2.toString)* sFile.take(5).foreach(println) Use the

Re: SparkSQL can't read S3 path for hive external table

2015-06-01 Thread Akhil Das
This thread http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application has various methods on accessing S3 from spark, it might help you. Thanks Best Regards On Sun, May 24, 2015 at 8:03 AM, ogoh oke...@gmail.com wrote: Hello, I am

Re: RDD boundaries and triggering processing using tags in the data

2015-06-01 Thread Akhil Das
May be you can make use of the Window operations https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#window-operations, Also another approach would be to keep your incoming data in Hbase/Redis/Cassandra kind of database and then whenever you need to average it, you just query the

Re: Cassanda example

2015-06-01 Thread Akhil Das
Here's a more detailed documentation https://github.com/datastax/spark-cassandra-connector from Datastax, You can also shoot an email directly to their mailing list http://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user since its more related to their code. Thanks Best

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
I do this way: - Launch a new instance by clicking on the slave instance and choose *launch more like this * *- *Once its launched, ssh into it and add the master public key to .ssh/authorized_keys - Add the slaves internal IP to the master's conf/slaves file - do sbin/start-all.sh and it will

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
MEMORY_AND_DISK_SER_2. Besides, the driver's memory is also growing. I don't think Kafka messages will be cached in driver. On Thu, May 28, 2015 at 12:24 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Are you using the createStream or createDirectStream api? If its the former, you can try setting

Re: Adding slaves on spark standalone on ec2

2015-05-28 Thread Akhil Das
aggressive? Let's say I have 20 slaves up, and I want to add one more, why should we stop the entire cluster for this? thanks, nizan On Thu, May 28, 2015 at 10:19 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I do this way: - Launch a new instance by clicking on the slave instance

Re: Get all servers in security group in bash(ec2)

2015-05-28 Thread Akhil Das
You can use python boto library for that, in fact spark-ec2 script uses it underneath. Here's the https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L706 call spark-ec2 is making to get all machines under a given security group. Thanks Best Regards On Thu, May 28, 2015 at 2:22 PM,

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-28 Thread Akhil Das
= rdd.foreachPartition { iter = iter.foreach(count = logger.info(count.toString)) } } It receives messages from Kafka, parse the json, filter and count the records, and then print it to logs. Thanks. On Thu, May 28, 2015 at 3:07 PM, Akhil Das ak

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-27 Thread Akhil Das
After submitting the job, if you do a ps aux | grep spark-submit then you can see all JVM params. Are you using the highlevel consumer (receiver based) for receiving data from Kafka? In that case if your throughput is high and the processing delay exceeds batch interval then you will hit this

Re: How to give multiple directories as input ?

2015-05-27 Thread Akhil Das
How about creating two and union [ sc.union(first, second) ] them? Thanks Best Regards On Wed, May 27, 2015 at 11:51 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have this piece sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](

Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread Akhil Das
files like result1.txt,result2.txt...result21.txt. Sorry for not adding some comments for my code. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉 luohui20...@sina.com 抄送人:user user@spark.apache.org 主题:Re: Re

Re: Remove COMPLETED applications and shuffle data

2015-05-26 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) You can also look into

Re: Re: Re: how to distributed run a bash shell in spark

2015-05-25 Thread Akhil Das
;Best regards! San.Luo - 原始邮件 - 发件人:madhu phatak phatak@gmail.com 收件人:luohui20...@sina.com 抄送人:Akhil Das ak...@sigmoidanalytics.com, user user@spark.apache.org 主题:Re: Re: how to distributed run a bash shell in spark 日期:2015年05月25日 14点11分 Hi, You can use pipe operator, if you

Re: Using Log4j for logging messages inside lambda functions

2015-05-25 Thread Akhil Das
Try this way: object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass.getName)} val someRdd = spark.parallelize(List(1, 2, 3)) someRdd.map { element = Holder.*log.info http://log.info/(s$element will be processed)* element + 1

Re: How to use zookeeper in Spark Streaming

2015-05-25 Thread Akhil Das
If you want to notify after every batch is completed, then you can simply implement the StreamingListener https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener interface, which has methods like onBatchCompleted, onBatchStarted etc in which

Re: IPv6 support

2015-05-25 Thread Akhil Das
Hi Kevin, Did you try adding a host name for the ipv6? I have a few ipv6 boxes, spark failed for me when i use just the ipv6 addresses, but it works fine when i use the host names. Here's an entry in my /etc/hosts: 2607:5300:0100:0200::::0a4d hacked.work My spark-env.sh file:

Re: how to distributed run a bash shell in spark

2015-05-24 Thread Akhil Das
You mean you want to execute some shell commands from spark? Here's something i tried a while back. https://github.com/akhld/spark-exploit Thanks Best Regards On Sun, May 24, 2015 at 4:53 PM, luohui20...@sina.com wrote: hello there I am trying to run a app in which part of it needs to

Re: Trying to connect to many topics with several DirectConnect

2015-05-24 Thread Akhil Das
I used to hit a NPE when i don't add all the dependency jars to my context while running it in standalone mode. Can you try adding all these dependencies to your context? sc.addJar(/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/spark-streaming-kafka_2.10-1.3.1.jar)

Re: Spark Memory management

2015-05-22 Thread Akhil Das
You can look at the logic for offloading data from Memory by looking at ensureFreeSpace https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L416 call. And dropFromMemory

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Akhil Das
without problem. Best Regards, Allan On 21 May 2015 at 01:30, Akhil Das ak...@sigmoidanalytics.com wrote: This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20

Re: How to set the file size for parquet Part

2015-05-21 Thread Akhil Das
How many part files are you having? Did you try re-partitioning to a smaller number so that you will have bigger files of smaller number. Thanks Best Regards On Wed, May 20, 2015 at 3:06 AM, Richard Grossman richie...@gmail.com wrote: Hi I'm using spark 1.3.1 and now I can't set the size of

Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory. We have modified the sparkstreaming code base to read nested files from S3, you can check this function

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
This thread happened a year back, can you please share what issue you are facing? which version of spark you are using? What is your system environment? Exception stack-trace? Thanks Best Regards On Thu, May 21, 2015 at 12:19 PM, Keerthi keerthi.reddy1...@gmail.com wrote: Hi , I had tried

Re: Resource usage of a spark application

2015-05-21 Thread Akhil Das
Yes Peter that's correct, you need to identify the processes and with that you can pull the actual usage metrics. Thanks Best Regards On Thu, May 21, 2015 at 2:52 PM, Peter Prettenhofer peter.prettenho...@gmail.com wrote: Thanks Akhil, Ryan! @Akhil: YARN can only tell me how much vcores my

<    1   2   3   4   5   6   7   8   9   10   >