Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
* and edit *Path* Variable to add *bin* directory of *HADOOP_HOME* (say*C:\hadoop\bin*). fix this issue in my env 2015-05-21 9:55 GMT+03:00 Akhil Das ak...@sigmoidanalytics.com: This thread happened a year back, can you please share what issue you are facing? which version of spark you are using

Re: java program got Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you try commenting the saveAsTextFile and do a simple count()? If its a broadcast issue, then it would throw up the same error. On 21 May 2015 14:21, allanjie allanmcgr...@gmail.com wrote: Sure, the code is very simple. I think u guys can understand from the main function. public class

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
-packages.org/package/dibbhatt/kafka-spark-consumer Regards, Dibyendu On Tue, May 19, 2015 at 9:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: On Tue, May 19, 2015 at 8:10 PM, Shushant Arora shushantaror...@gmail.com wrote: So for Kafka+spark streaming, Receiver based streaming used

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
of JavaPairRDD is as expected. It is when we are calling collect() or toArray() methods, the exception is coming. Something to do with Text class even though I haven't used it in the program. Regards Tapan On Tue, May 19, 2015 at 6:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Try something

Re: spark streaming doubt

2015-05-20 Thread Akhil Das
the rate with spark.streaming.kafka.maxRatePerPartition)​ Read more here https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md​ On Wed, May 20, 2015 at 12:36 PM, Akhil Das ak...@sigmoidanalytics.com wrote: One receiver basically runs on 1 core, so if your single node is having

Re: Reading Binary files in Spark program

2015-05-20 Thread Akhil Das
-files Regards Tapan On Wed, May 20, 2015 at 12:42 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you can share the complete code and a sample file, may be i can try to reproduce it on my end. Thanks Best Regards On Wed, May 20, 2015 at 7:00 AM, Tapan Sharma tapan.sha

Re: java program Get Stuck at broadcasting

2015-05-20 Thread Akhil Das
This is more like an issue with your HDFS setup, can you check in the datanode logs? Also try putting a new file in HDFS and see if that works. Thanks Best Regards On Wed, May 20, 2015 at 11:47 AM, allanjie allanmcgr...@gmail.com wrote: ​Hi All, The variable I need to broadcast is just 468

Re: Spark users

2015-05-20 Thread Akhil Das
Yes, this is the user group. Feel free to ask your questions in this list. Thanks Best Regards On Wed, May 20, 2015 at 5:58 AM, Ricardo Goncalves da Silva ricardog.si...@telefonica.com wrote: Hi I'm learning spark focused on data and machine learning. Migrating from SAS. There is a group

Re: TwitterUtils on Windows

2015-05-19 Thread Akhil Das
Hi Justin, Can you try with sbt, may be that will help. - Install sbt for windows http://www.scala-sbt.org/0.13/tutorial/Installing-sbt-on-Windows.html - Create a lib directory in your project directory - Place these jars in it: - spark-streaming-twitter_2.10-1.3.1.jar -

Re: group by and distinct performance issue

2015-05-19 Thread Akhil Das
Hi Peer, If you open the driver UI (running on port 4040) you can see the stages and the tasks happening inside it. Best way to identify the bottleneck for a stage is to see if there's any time spending on GC, and how many tasks are there per stage (it should be a number total # cores to achieve

Re: org.apache.spark.shuffle.FetchFailedException :: Migration from Spark 1.2 to 1.3

2015-05-19 Thread Akhil Das
There were some similar discussion happened on JIRA https://issues.apache.org/jira/browse/SPARK-3633 may be that will give you some insights. Thanks Best Regards On Mon, May 18, 2015 at 10:49 PM, zia_kayani zia.kay...@platalytics.com wrote: Hi, I'm getting this exception after shifting my code

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
It will be a single job running at a time by default (you can also configure the spark.streaming.concurrentJobs to run jobs parallel which is not recommended to put in production). Now, your batch duration being 1 sec and processing time being 2 minutes, if you are using a receiver based

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
not be started at its desired interval. And Whats the difference and usage of Receiver vs non-receiver based streaming. Is there any documentation for that? On Tue, May 19, 2015 at 1:35 PM, Akhil Das ak...@sigmoidanalytics.com wrote: It will be a single job running at a time by default (you can also

Re: Reading Real Time Data only from Kafka

2015-05-19 Thread Akhil Das
in result. Deciding where to save offsets (or not) is up to you. You can checkpoint, or store them yourself. On Mon, May 18, 2015 at 12:00 PM, Akhil Das ak...@sigmoidanalytics.com wrote: I have played a bit with the directStream kafka api. Good work cody. These are my findings and also can you

Re: spark streaming doubt

2015-05-19 Thread Akhil Das
specify the number of receiver that you want to spawn for consuming the messages. On Tue, May 19, 2015 at 2:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: spark.streaming.concurrentJobs takes an integer value, not boolean. If you set it as 2 then 2 jobs will run parallel. Default value

Re: Reading Binary files in Spark program

2015-05-19 Thread Akhil Das
Try something like: JavaPairRDDIntWritable, Text output = sc.newAPIHadoopFile(inputDir, org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class, IntWritable.class, Text.class, new Job().getConfiguration()); With the type of input format that you require. Thanks Best

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
assure you that at least as of Spark Streaming 1.2.0, as Evo says Spark Streaming DOES crash in “unceremonious way” when the free RAM available for In Memory Cashed RDDs gets exhausted *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, May 18, 2015 2:03 PM *To:* Evo Eftimov

Re: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
of the condition is: Loss was due to java.lang.Exception java.lang.Exception: *Could not compute split, block* *input-4-1410542878200 not found* *From:* Evo Eftimov [mailto:evo.efti...@isecc.com] *Sent:* Monday, May 18, 2015 12:13 PM *To:* 'Dmitry Goldenberg'; 'Akhil Das' *Cc:* 'user@spark.apache.org

Re: Reading Real Time Data only from Kafka

2015-05-18 Thread Akhil Das
are processed is order ( and offsets commits in order ) .. etc .. So whoever use whichever consumer need to study pros and cons of both approach before taking a call .. Regards, Dibyendu On Tue, May 12, 2015 at 8:10 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Cody, I was just

RE: Spark Streaming and reducing latency

2015-05-18 Thread Akhil Das
Streaming does “NOT” crash UNCEREMNOUSLY – please maintain responsible and objective communication and facts *From:* Akhil Das [mailto:ak...@sigmoidanalytics.com] *Sent:* Monday, May 18, 2015 2:28 PM *To:* Evo Eftimov *Cc:* Dmitry Goldenberg; user@spark.apache.org *Subject:* Re: Spark

Re: Spark streaming over a rest API

2015-05-18 Thread Akhil Das
Why not use sparkstreaming to do the computation and dump the result somewhere in a DB perhaps and take it from there? Thanks Best Regards On Mon, May 18, 2015 at 7:51 PM, juandasgandaras juandasganda...@gmail.com wrote: Hello, I would like to use spark streaming over a REST api to get

Re: number of executors

2015-05-17 Thread Akhil Das
Did you try --executor-cores param? While you submit the job, do a ps aux | grep spark-submit and see the exact command parameters. Thanks Best Regards On Sat, May 16, 2015 at 12:31 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit

Re: Resource usage of a spark application

2015-05-17 Thread Akhil Das
You can either pull the high level information from your resource manager, or if you want more control/specific information you can write a script and pull the resource usage information from the OS. Something like this

Re: Forbidded : Error Code: 403

2015-05-17 Thread Akhil Das
I think you can try this way also: DataFrame df = sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro, com.databricks.spark.avro); Thanks Best Regards On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com wrote: Thanks for the suggestion Steve. I'll try that out.

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Akhil Das
Why not just trigger your batch job with that event? If you really need streaming, then you can create a custom receiver and make the receiver sleep till the event has happened. That will obviously run your streaming pipelines without having any data to process. Thanks Best Regards On Fri, May

Re: textFileStream Question

2015-05-17 Thread Akhil Das
With file timestamp, you can actually see the finding new files logic from here https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L172 Thanks Best Regards On Fri, May 15, 2015 at 2:25 AM, Vadim Bichutskiy

Re: Spark Streaming and reducing latency

2015-05-17 Thread Akhil Das
With receiver based streaming, you can actually specify spark.streaming.blockInterval which is the interval at which the receiver will fetch data from the source. Default value is 200ms and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with sparkstreaming

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
What do you mean by not detected? may be you forgot to trigger some action on the stream to get it executed. Like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString) *list_join_action_stream.count().print()*

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-14 Thread Akhil Das
Did you happened to have a look at the spark job server? https://github.com/ooyala/spark-jobserver Someone wrote a python wrapper https://github.com/wangqiang8511/spark_job_manager around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW

Re: spark-streaming whit flume error

2015-05-14 Thread Akhil Das
Can you share the client code that you used to send the data? May be this discussion would give you some insights http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html Thanks Best Regards On Thu, May 14, 2015 at 8:44 AM, 鹰 980548...@qq.com wrote:

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
at 1:04 PM, lisendong lisend...@163.com wrote: I have action on DStream. because when I put a text file into the hdfs, it runs normally, but if I put a lz4 file, it does nothing. 在 2015年5月14日,下午3:32,Akhil Das ak...@sigmoidanalytics.com 写道: What do you mean by not detected? may be you forgot

Re: Unsubscribe

2015-05-14 Thread Akhil Das
Have a look https://spark.apache.org/community.html Send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal saurabh.agra...@markit.com wrote: How do I unsubscribe from this mailing list please? Thanks!! Regards,

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
: LzoTextInputFormat where is this class? what is the maven dependency? 在 2015年5月14日,下午3:40,Akhil Das ak...@sigmoidanalytics.com 写道: That's because you are using TextInputFormat i think, try with LzoTextInputFormat like: val list_join_action_stream = ssc.fileStream[LongWritable, Text

Re: force the kafka consumer process to different machines

2015-05-13 Thread Akhil Das
With this lowlevel Kafka API https://github.com/dibbhatt/kafka-spark-consumer/, you can actually specify how many receivers that you want to spawn and most of the time it spawns evenly, usually you can put a sleep just after creating the context for the executors to connect to the driver and then

Re: s3 vfs on Mesos Slaves

2015-05-13 Thread Akhil Das
Did you happened to have a look at this https://github.com/abashev/vfs-s3 Thanks Best Regards On Tue, May 12, 2015 at 11:33 PM, Stephen Carman scar...@coldlight.com wrote: We have a small mesos cluster and these slaves need to have a vfs setup on them so that the slaves can pull down the data

Re: How to speed up data ingestion with Spark

2015-05-12 Thread Akhil Das
This article http://www.virdata.com/tuning-spark/ gives you a pretty good start on the Spark streaming side. And this article https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines is for the kafka, it has nice explanation how message size and

Re: Getting Access is denied error while cloning Spark source using Eclipse

2015-05-12 Thread Akhil Das
May be you should check where exactly its throwing up permission denied (possibly trying to write to some directory). Also you can try manually cloning the git repo to a directory and then try opening that in eclipse. Thanks Best Regards On Tue, May 12, 2015 at 3:46 PM, Chandrashekhar Kotekar

Re: how to load some of the files in a dir and monitor new file in that dir in spark streaming without missing?

2015-05-12 Thread Akhil Das
I believe fileStream would pickup the new files (may be you should increase the batch duration). You can see the implementation details for finding new files from here

Re: Spark and RabbitMQ

2015-05-12 Thread Akhil Das
I found two examples Java version https://github.com/deepakkashyap/Spark-Streaming-with-RabbitMQ-/blob/master/example/Spark_project/CustomReceiver.java, and Scala version. https://github.com/d1eg0/spark-streaming-toy Thanks Best Regards On Tue, May 12, 2015 at 2:31 AM, dgoldenberg

Re: Master HA

2015-05-12 Thread Akhil Das
Mesos has a HA option (of course it includes zookeeper) Thanks Best Regards On Tue, May 12, 2015 at 4:53 PM, James King jakwebin...@gmail.com wrote: I know that it is possible to use Zookeeper and File System (not for production use) to achieve HA. Are there any other options now or in the

Re: TwitterPopularTags Long Processing Delay

2015-05-12 Thread Akhil Das
Are you using checkpointing/WAL etc? If yes, then it could be blocking on disk IO. Thanks Best Regards On Mon, May 11, 2015 at 10:33 PM, Seyed Majid Zahedi zah...@cs.duke.edu wrote: Hi, I'm running TwitterPopularTags.scala on a single node. Everything works fine for a while (about 30min),

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark. Thanks Best Regards On Tue, May 12, 2015 at 5:15 PM, James King jakwebin...@gmail.com wrote: What I want is if the driver dies for some

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
before that, only took it down to change code. http://tinypic.com/r/2e4vkht/8 Regarding flexibility, both of the apis available in spark will do what James needs, as I described. On Tue, May 12, 2015 at 8:55 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Cody, If you are so sure

Re: Reading Real Time Data only from Kafka

2015-05-12 Thread Akhil Das
wrote: Very nice! will try and let you know, thanks. On Tue, May 12, 2015 at 2:25 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Yep, you can try this lowlevel Kafka receiver https://github.com/dibbhatt/kafka-spark-consumer. Its much more flexible/reliable than the one comes with Spark

Re: Is it possible to set the akka specify properties (akka.extensions) in spark

2015-05-11 Thread Akhil Das
Try SparkConf.set(spark.akka.extensions,Whatever), underneath i think spark won't ship properties which don't start with spark.* to the executors. Thanks Best Regards On Mon, May 11, 2015 at 8:33 AM, Terry Hole hujie.ea...@gmail.com wrote: Hi all, I'd like to monitor the akka using kamon,

Re: Cassandra number of Tasks

2015-05-11 Thread Akhil Das
Did you try repartitioning? You might end up with a lot of time spending on GC though. Thanks Best Regards On Fri, May 8, 2015 at 11:59 PM, Vijay Pawnarkar vijaypawnar...@gmail.com wrote: I am using the Spark Cassandra connector to work with a table with 3 million records. Using .where() API

Re: EVent generation

2015-05-11 Thread Akhil Das
Have a look over here https://storm.apache.org/community.html Thanks Best Regards On Sun, May 10, 2015 at 3:21 PM, anshu shukla anshushuk...@gmail.com wrote: http://stackoverflow.com/questions/30149868/generate-events-tuples-using-csv-file-with-timestamps -- Thanks Regards, Anshu Shukla

Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread Akhil Das
Have a look at this SO http://stackoverflow.com/questions/24048729/how-to-read-input-from-s3-in-a-spark-streaming-ec2-cluster-application question, it has discussion on various ways of accessing S3. Thanks Best Regards On Fri, May 8, 2015 at 1:21 AM, in4maniac sa...@skimlinks.com wrote: Hi

Re: Master node memory usage question

2015-05-08 Thread Akhil Das
Whats your usecase and what are you trying to achieve? May be there's a better way of doing it. Thanks Best Regards On Fri, May 8, 2015 at 10:20 AM, Richard Alex Hofer rho...@andrew.cmu.edu wrote: Hi, I'm working on a project in Spark and am trying to understand what's going on. Right now to

Re: (无主题)

2015-05-08 Thread Akhil Das
Since its loading 24 records, it could be that your CSV is corrupted? (may be the new line char isn't \n, but \r\n if it comes from a windows environment. You can check this with *cat -v yourcsvfile.csv | more*). Thanks Best Regards On Fri, May 8, 2015 at 11:23 AM, luohui20...@sina.com wrote:

Re: Getting data into Spark Streaming

2015-05-08 Thread Akhil Das
I don't think you can use rawSocketStream since the RSVP is from a web server and you will have to send a GET request first to initialize the communication. You are better off writing a custom receiver https://spark.apache.org/docs/latest/streaming-custom-receivers.html for your usecase. For a

Re: NoClassDefFoundError with Spark 1.3

2015-05-08 Thread Akhil Das
Looks like the jar you provided has some missing classes. Try this: scalaVersion := 2.10.4 libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.3.0, org.apache.spark %% spark-sql % 1.3.0 % provided, org.apache.spark %% spark-mllib % 1.3.0 % provided, log4j % log4j %

Re: Back-pressure for Spark Streaming

2015-05-08 Thread Akhil Das
We had a similar issue while working on one of our usecase where we were processing at a moderate throughput (around 500MB/S). When the processing time exceeds the batch duration, it started to throw up blocknotfound exceptions, i made a workaround for that issue and is explained over here

SparkStreaming Workaround for BlockNotFound Exceptions

2015-05-07 Thread Akhil Das
Hi With Spark streaming (all versions), when my processing delay (around 2-4 seconds) exceeds the batch duration (being 1 second) and on a decent scale/throughput (consuming around 100MB/s on 1+2 node standalone 15GB, 4 cores each) the job will start to throw block not found exceptions when the

Re: Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-06 Thread Akhil Das
You have an issue with your cluster setup. Can you paste your conf/spark-env.sh and the conf/slaves files here? The reason why your job is running fine is because you set the master inside the job as local[*] which runs in local mode (not in standalone cluster mode). Thanks Best Regards On

Re: com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??

2015-05-06 Thread Akhil Das
I don't see spark-streaming dependency at com.datastax.spark http://mvnrepository.com/artifact/com.datastax.spark, but it does has a kafka-streaming dependency though. Thanks Best Regards On Tue, May 5, 2015 at 12:42 AM, Eric Ho eric...@intel.com wrote: Can I specify this in my build file ?

Re: Spark Mongodb connection

2015-05-06 Thread Akhil Das
Here's a complete example https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html Thanks Best Regards On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya godo...@gmail.com wrote: Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-05 Thread Akhil Das
It could be filling up your /tmp directory. You need to set your spark.local.dir or you can also specify SPARK_WORKER_DIR to another location which has sufficient space. Thanks Best Regards On Mon, May 4, 2015 at 7:27 PM, shahab shahab.mok...@gmail.com wrote: Hi, I am getting No space left

Re: Problem in Standalone Mode

2015-05-04 Thread Akhil Das
Can you paste the complete stacktrace? It looks like you are having version incompatibility with hadoop. Thanks Best Regards On Sat, May 2, 2015 at 4:36 PM, drarse drarse.a...@gmail.com wrote: When I run my program with Spark-Submit everythink are ok. But when I try run in satandalone mode I

Re: spark filestrea problem

2015-05-04 Thread Akhil Das
With filestream you can actually pass a filter parameter to avoid loading up .tmp file/directories. Also, when you move/rename a file, the file creation date doesn't change and hence spark won't detect them i believe. Thanks Best Regards On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov

Re: Remoting warning when submitting to cluster

2015-05-04 Thread Akhil Das
Looks like a version incompatibility, just make sure you have the proper version of spark. Also look further in the stacktrace what is causing Futures timed out (it could be a network issue also if the ports aren't opened properly) Thanks Best Regards On Sat, May 2, 2015 at 12:04 AM,

Re: Hardware requirements

2015-05-04 Thread Akhil Das
500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast. :) Thanks Best Regards On Sun, May 3, 2015 at 12:49 PM, sherine ahmed sherine.sha...@hotmail.com wrote: I need to use spark to

Re: Hardware requirements

2015-05-04 Thread Akhil Das
and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote: 500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast

Re: Exiting driver main() method...

2015-05-02 Thread Akhil Das
It used to exit without any problem for me. You can basically check in the driver UI (that runs on 4040) and see what exactly its doing. Thanks Best Regards On Fri, May 1, 2015 at 6:22 PM, James Carman ja...@carmanconsulting.com wrote: In all the examples, it seems that the spark application

Re: spark.logConf with log4j.rootCategory=WARN

2015-05-02 Thread Akhil Das
It could be. Thanks Best Regards On Fri, May 1, 2015 at 9:11 PM, roy rp...@njit.edu wrote: Hi, I have recently enable log4j.rootCategory=WARN, console in spark configuration. but after that spark.logConf=True has becomes ineffective. So just want to confirm if this is because

Re: Spark Streaming Kafka Avro NPE on deserialization of payload

2015-05-02 Thread Akhil Das
There was a similar discussion over here http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccakz4c0s_cuo90q2jxudvx9wc4fwu033kx3-fjujytxxhr7p...@mail.gmail.com%3E Thanks Best Regards On Fri, May 1, 2015 at 7:12 PM, Todd Nist tsind...@gmail.com wrote: *Resending as I do not

Re: how to pass configuration properties from driver to executor?

2015-05-02 Thread Akhil Das
Infact, sparkConf.set(spark.whateverPropertyYouWant,Value) gets shipped to the executors. Thanks Best Regards On Fri, May 1, 2015 at 2:55 PM, Michael Ryabtsev mich...@totango.com wrote: Hi, We've had a similar problem, but with log4j properties file. The only working way we've found, was

Re: Spark worker error on standalone cluster

2015-05-02 Thread Akhil Das
Just make sure your are having the same version of spark in your cluster and the project's build file. Thanks Best Regards On Fri, May 1, 2015 at 2:43 PM, Michael Ryabtsev (Totango) mich...@totango.com wrote: Hi everyone, I have a spark application that works fine on a standalone Spark

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-02 Thread Akhil Das
-memory 12g --executor-cores 4 12G is the limit imposed by YARN cluster, I cant go beyond this. ANY suggestions ? Regards, Deepak On Thu, Apr 30, 2015 at 6:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Did not work. Same problem. On Thu, Apr 30, 2015 at 1:28 PM, Akhil Das ak

Re: default number of reducers

2015-04-30 Thread Akhil Das
This is spark mailing list :/ Yes, you can configure the following in the mapred-site.xml for that: property namemapred.tasktracker.map.tasks.maximum/name value4/value /property Thanks Best Regards On Tue, Apr 28, 2015 at 11:00 PM, Shushant Arora shushantaror...@gmail.com wrote: In

Re: Performance advantage by loading data from local node over S3.

2015-04-30 Thread Akhil Das
If the data is too huge and is in S3, that'll be a lot of network traffic, instead, if the data is available in HDFS (with proper replication available) then it will be faster as most of the time, data will be available as PROCESS_LOCAL/NODE_LOCAL to the executor. Thanks Best Regards On Wed, Apr

Re: Spark - Timeout Issues - OutOfMemoryError

2015-04-30 Thread Akhil Das
You could try increasing your heap space explicitly. like export _JAVA_OPTIONS=-Xmx10g, its not the correct approach but try. Thanks Best Regards On Tue, Apr 28, 2015 at 10:35 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB

Re: rdd.count with 100 elements taking 1 second to run

2015-04-30 Thread Akhil Das
Does this speed up? val rdd = sc.parallelize(1 to 100*, 30)* rdd.count Thanks Best Regards On Wed, Apr 29, 2015 at 1:47 AM, Anshul Singhle ans...@betaglide.com wrote: Hi, I'm running the following code in my cluster (standalone mode) via spark shell - val rdd = sc.parallelize(1 to

Re: How to stream all data out of a Kafka topic once, then terminate job?

2015-04-30 Thread Akhil Das
Have a look at KafkaRDD https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html Thanks Best Regards On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com wrote: Hi, I'm wondering about the use-case where you're not doing continuous,

Re: How to run customized Spark on EC2?

2015-04-30 Thread Akhil Das
This is how i used to do it: - Login to the ec2 cluster (master) - Make changes to the spark, and build it. - Stop the old installation of spark (sbin/stop-all.sh) - Copy old installation conf/* to modified version's conf/ - Rsync modified version to all slaves - do sbin/start-all.sh from the

Re: How to run self-build spark on EC2?

2015-04-30 Thread Akhil Das
You can replace your clusters(on master and workers) assembly jar with your custom build assembly jar. Thanks Best Regards On Tue, Apr 28, 2015 at 9:45 PM, Bo Fu b...@uchicago.edu wrote: Hi all, I have an issue. I added some timestamps in Spark source code and built it using: mvn package

Re: External Application Run Status

2015-04-30 Thread Akhil Das
One way you could try would be, Inside the map, you can have a synchronized thread and you can block the map till the thread finishes up processing. Thanks Best Regards On Wed, Apr 29, 2015 at 9:38 AM, Nastooh Avessta (navesta) nave...@cisco.com wrote: Hi In a multi-node setup, I am

Re: Re: Spark streaming - textFileStream/fileStream - Get file name

2015-04-29 Thread Akhil Das
It is possible to access the filename, its a bit tricky though. val fstream = ssc.fileStream[LongWritable, IntWritable, SequenceFileInputFormat[LongWritable, IntWritable]](/home/akhld/input/) fstream.foreach(x ={ //You can get it with this object.

Re: Spark 1.3.1 JavaStreamingContext - fileStream compile error

2015-04-28 Thread Akhil Das
How about: JavaPairDStreamLongWritable, Text input = jssc.fileStream(inputDirectory, LongWritable.class, Text.class, TextInputFormat.class); See the complete example over here

Re: Understanding Spark's caching

2015-04-28 Thread Akhil Das
Option B would be fine, as in the SO itself the answer says, Since RDD transformations merely build DAG descriptions without execution, in Option A by the time you call unpersist, you still only have job descriptions and not a running execution. Also note, In Option A, you are not specifying any

Re: java.lang.StackOverflowError when recovery from checkpoint in Streaming

2015-04-28 Thread Akhil Das
There's a similar issue reported over here https://issues.apache.org/jira/browse/SPARK-6847 Thanks Best Regards On Tue, Apr 28, 2015 at 7:35 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, String, StringDecoder,

Re: Spark timeout issue

2015-04-27 Thread Akhil Das
You need to look more deep into your worker logs, you may find GC error, IO exceptions etc if you look closely which is triggering the timeout. Thanks Best Regards On Mon, Apr 27, 2015 at 3:18 AM, Deepak Gopalakrishnan dgk...@gmail.com wrote: Hello Patrick, Sure. I've posted this on user as

Re: Understand the running time of SparkSQL queries

2015-04-27 Thread Akhil Das
Isn't it already available on the driver UI (that runs on 4040)? Thanks Best Regards On Mon, Apr 27, 2015 at 9:55 AM, Wenlei Xie wenlei@gmail.com wrote: Hi, I am wondering how should we understand the running time of SparkSQL queries? For example the physical query plan and the running

Re: Convert DStream[Long] to Long

2015-04-25 Thread Akhil Das
Like this? messages.foreachRDD(rdd = { if(rdd.count() 0) //Do whatever you want. }) Thanks Best Regards On Fri, Apr 24, 2015 at 11:20 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a

Re: Contributing Documentation Changes

2015-04-25 Thread Akhil Das
I also want to add mine :/ Everyone wants to add it seems. Thanks Best Regards On Fri, Apr 24, 2015 at 8:58 PM, madhu phatak phatak@gmail.com wrote: Hi, I understand that. The following page http://spark.apache.org/documentation.html has a external tutorials,blogs section which points

Re: DAG

2015-04-25 Thread Akhil Das
May be this will give you a good start https://github.com/apache/spark/pull/2077 Thanks Best Regards On Sat, Apr 25, 2015 at 1:29 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I would like to know if it is possible to build the DAG before actually executing the application. My

Re: StreamingContext.textFileStream issue

2015-04-25 Thread Akhil Das
Make sure you are having =2 core for your streaming application. Thanks Best Regards On Sat, Apr 25, 2015 at 3:02 AM, Yang Lei genia...@gmail.com wrote: I hit the same issue as if the directory has no files at all when running the sample examples/src/main/python/streaming/hdfs_wordcount.py

Re: problem writing to s3

2015-04-24 Thread Akhil Das
, 2015 at 1:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try writing to a different S3 bucket and confirm that? Thanks Best Regards On Thu, Apr 23, 2015 at 12:11 AM, Daniel Mahler dmah...@gmail.com wrote: Hi Akhil, It works fine when outprefix is a hdfs:///localhost/... url

Re: Multiple HA spark clusters managed by 1 ZK cluster?

2015-04-22 Thread Akhil Das
The directory in ZooKeeper to store recovery state (default: /spark). -Jeff From: Sean Owen so...@cloudera.com To: Akhil Das ak...@sigmoidanalytics.com Cc: Michal Klos michal.klo...@gmail.com, User user@spark.apache.org Date: Wed, 22 Apr 2015 11:05:46 +0100 Subject: Re: Multiple HA spark

Re: Graphical display of metrics on application UI page

2015-04-22 Thread Akhil Das
​There were some PR's about graphical representation with D3.js, you can possibly see it on the github. Here's a few of them https://github.com/apache/spark/pulls?utf8=%E2%9C%93q=d3​ Thanks Best Regards On Wed, Apr 22, 2015 at 8:08 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Dear

Re: sparksql - HiveConf not found during task deserialization

2015-04-22 Thread Akhil Das
are in that dir. For me the most confusing thing is that the executor can actually create HiveConf objects but when it cannot find that when the task deserializer is at work. On 20 April 2015 at 14:18, Akhil Das ak...@sigmoidanalytics.com wrote: Can you try sc.addJar(/path/to/your/hive/jar), i

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Akhil Das
You can enable this flag to run multiple jobs concurrently, It might not be production ready, but you can give it a try: sc.set(spark.streaming.concurrentJobs,2) ​Refer to TD's answer here

Re: Spark and accumulo

2015-04-21 Thread Akhil Das
You can simply use a custom inputformat (AccumuloInputFormat) with the hadoop RDDs (sc.newApiHadoopFile etc) for that, all you need to do is to pass the jobConfs. Here's pretty clean discussion:

Re: Understanding the build params for spark with sbt.

2015-04-21 Thread Akhil Das
With maven you could like: mvn -Dhadoop.version=2.3.0 -DskipTests clean package -pl core Thanks Best Regards On Mon, Apr 20, 2015 at 8:10 PM, Shiyao Ma i...@introo.me wrote: Hi. My usage is only about the spark core and hdfs, so no spark sql or mlib or other components invovled. I saw

Re: meet weird exception when studying rdd caching

2015-04-21 Thread Akhil Das
It could be a similar issue as https://issues.apache.org/jira/browse/SPARK-4300 Thanks Best Regards On Tue, Apr 21, 2015 at 8:09 AM, donhoff_h 165612...@qq.com wrote: Hi, I am studying the RDD Caching function and write a small program to verify it. I run the program in a Spark1.3.0

Re: Custom paritioning of DSTream

2015-04-21 Thread Akhil Das
I think DStream.transform is the one that you are looking for. Thanks Best Regards On Mon, Apr 20, 2015 at 9:42 PM, Evo Eftimov evo.efti...@isecc.com wrote: Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising

Re: Running spark over HDFS

2015-04-21 Thread Akhil Das
Your spark master should be spark://swetha:7077 :) Thanks Best Regards On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote: PFA screenshot of my cluster UI Thanks On Monday 20 April 2015 02:27 PM, Akhil Das wrote: Are you seeing your task being submitted to the UI

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
2015 12:28 PM, Akhil Das wrote: In your eclipse, while you create your SparkContext, set the master uri as shown in the web UI's top left corner like: spark://someIPorHost:7077 and it should be fine. Thanks Best Regards On Mon, Apr 20, 2015 at 12:22 PM, madhvi madhvi.gu...@orkash.com

Re: NEWBIE/not able to connect to postgresql using jdbc

2015-04-20 Thread Akhil Das
try doing a sc.addJar(path\to\your\postgres\jar) Thanks Best Regards On Mon, Apr 20, 2015 at 12:26 PM, shashanksoni shashankso...@gmail.com wrote: I am using spark 1.3 standalone cluster on my local windows and trying to load data from one of our server. Below is my code - import os

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
was suspecting some foul play with classloaders. On 20 April 2015 at 12:20, Akhil Das ak...@sigmoidanalytics.com wrote: Looks like a missing jar, try to print the classpath and make sure the hive jar is present. Thanks Best Regards On Mon, Apr 20, 2015 at 11:52 AM, Manku Timma manku.tim

<    2   3   4   5   6   7   8   9   10   11   >