Updating shared data structure between executors

2014-08-19 Thread Tim Smith
Hi, I am writing some Scala code to normalize a stream of logs using an input configuration file (multiple regex patterns). To avoid re-starting the job, I can read in a new config file using fileStream and then turn the config file to a map. But I am unsure about how to update a shared map

Kafka stream receiver stops input

2014-08-27 Thread Tim Smith
Hi, I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1. I have a streaming jobs that reads from a kafka topic and writes output to another kafka topic. The job starts fine but after a while the input stream stops getting any data. I think these messages show no incoming data on the stream:

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause that caused the executor to fail. TD On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote: Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
of partitions (try setting it to 2x the number cores given to the application). Yeah, in 1.0.0, ttl should be unnecessary. On Thu, Aug 28, 2014 at 5:17 PM, Tim Smith secs...@gmail.com wrote: On Thu, Aug 28, 2014 at 4:19 PM, Tathagata Das tathagata.das1...@gmail.com wrote: If you

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-29 Thread Tim Smith
, the too many open files is a sign that you need increase the system-wide limit of open files. Try adding ulimit -n 16000 to your conf/spark-env.sh. TD On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote: Appeared after running for a while. I re-ran the job and this time

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
I set partitions to 64: // kInMsg.repartition(64) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap)) // Still see all activity only on the two nodes that seem to be receiving from Kafka. On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith secs...@gmail.com wrote: TD - Apologies, didn't realize

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
, this app is assigned 252.5GB of memory, 128 VCores and 9 containers. Am I missing something here? Thanks, Tim On Thu, Aug 28, 2014 at 11:55 PM, Tim Smith secs...@gmail.com wrote: I set partitions to 64: // kInMsg.repartition(64) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap

Re: DStream repartitioning, performance tuning processing

2014-08-29 Thread Tim Smith
is timestamped 19:04:51 that tells me the executor was killed for some reason right before the driver noticed that executor/task failure. How come my task failed only after 4 times although my config says failure threshold is 64? On Fri, Aug 29, 2014 at 12:00 PM, Tim Smith secs...@gmail.com

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
Good to see I am not the only one who cannot get incoming Dstreams to repartition. I tried repartition(512) but still no luck - the app stubbornly runs only on two nodes. Now this is 1.0.0 but looking at release notes for 1.0.1 and 1.0.2, I don't see anything that says this was an issue and has

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
for each receiver. You need multiple partitions in the queue, each consumed by a DStream, if you mean to parallelize consuming the queue. On Fri, Aug 29, 2014 at 11:08 PM, Tim Smith secs...@gmail.com wrote: Good to see I am not the only one who cannot get incoming Dstreams to repartition. I tried

Re: Low Level Kafka Consumer for Spark

2014-08-29 Thread Tim Smith
) now which actually has me confused. If Streams are active only on 3 nodes then how/why did a 4th node get work? If a 4th got work why aren't more nodes getting work? On Fri, Aug 29, 2014 at 4:11 PM, Tim Smith secs...@gmail.com wrote: I create my DStream very simply as: val kInMsg

Re: Low Level Kafka Consumer for Spark

2014-08-30 Thread Tim Smith
I'd be interested to understand this mechanism as well. But this is the error recovery part of the equation. Consuming from Kafka has two aspects - parallelism and error recovery and I am not sure how either works. For error recovery, I would like to understand how: - A failed receiver gets

Re: Publishing a transformed DStream to Kafka

2014-09-02 Thread Tim Smith
I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam)) kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) //where writer.ouput is a method that takes a string and writer is an instance of a

Re: Low Level Kafka Consumer for Spark

2014-09-08 Thread Tim Smith
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't the right way. You have to /val partedStream = repartition(...)/. Would be nice to have it fixed in the docs. On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Some thoughts on this

Re: How to scale more consumer to Kafka stream

2014-09-10 Thread Tim Smith
How are you creating your kafka streams in Spark? If you have 10 partitions for a topic, you can call createStream ten times to create 10 parallel receivers/executors and then use union to combine all the dStreams. On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote: Hi (my

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case, the receivers die within an hour because Yarn kills the containers for high memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I don't think stale RDDs are an issue here. I did a jmap -histo on a couple

Re: how to choose right DStream batch interval

2014-09-10 Thread Tim Smith
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 Slide 39 covers it. On Tue, Sep 9, 2014 at 9:23 PM, qihong qc...@pivotal.io wrote: Hi Mayur, Thanks for your response. I did write a simple test that set up a DStream with 5 batches; The

Re: spark-streaming Could not compute split exception

2014-09-10 Thread Tim Smith
I had a similar issue and many others - all were basically symptoms for yarn killing the container for high memory usage. Haven't gotten to root cause yet. On Tue, Sep 9, 2014 at 3:18 PM, Marcelo Vanzin van...@cloudera.com wrote: Your executor is exiting or crashing unexpectedly: On Tue, Sep

Re: spark.cleaner.ttl and spark.streaming.unpersist

2014-09-10 Thread Tim Smith
process. They should be about the same, right? Also, in the heap dump, 99% of the heap seems to be occupied with unreachable objects (and most of it is byte arrays). On Wed, Sep 10, 2014 at 12:06 PM, Tim Smith secs...@gmail.com wrote: Actually, I am not doing any explicit shuffle/updateByKey

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tim Smith
Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the

Stable spark streaming app

2014-09-12 Thread Tim Smith
Hi, Anyone have a stable streaming app running in production? Can you share some overview of the app and setup like number of nodes, events per second, broad stream processing workflow, config highlights etc? Thanks, Tim - To

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-12 Thread Tim Smith
Similar issue (Spark 1.0.0). Streaming app runs for a few seconds before these errors start to pop all over the driver logs: 14/09/12 17:30:23 WARN TaskSetManager: Loss was due to java.lang.Exception java.lang.Exception: Could not compute split, block input-4-1410542878200 not found at

Where do logs go in StandAlone mode

2014-09-12 Thread Tim Smith
Spark 1.0.0 I write logs out from my app using this object: object LogService extends Logging { /** Set reasonable logging levels for streaming if the user has not configured log4j. */ def setStreamingLogLevels() { val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements

Executor garbage collection

2014-09-12 Thread Tim Smith
Hi, Anyone setting any explicit GC options for the executor jvm? If yes, what and how did you arrive at them? Thanks, - Tim - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Low Level Kafka Consumer for Spark

2014-09-15 Thread Tim Smith
Hi Dibyendu, I am a little confused about the need for rate limiting input from kafka. If the stream coming in from kafka has higher message/second rate than what a Spark job can process then it should simply build a backlog in Spark if the RDDs are cached on disk using persist(). Right? Thanks,

Re: Stable spark streaming app

2014-09-17 Thread Tim Smith
! Adding to it, please share experiences of building an enterprise grade product based on Spark Streaming. I am exploring Spark Streaming for enterprise software and am cautiously optimistic about it. I see huge potential to improve debuggability of Spark. - Original Message - From: Tim

Re: Stable spark streaming app

2014-09-17 Thread Tim Smith
PM, Tim Smith secs...@gmail.com wrote: I don't have anything in production yet but I now at least have a stable (running for more than 24 hours) streaming app. Earlier, the app would crash for all sorts of reasons. Caveats/setup: - Spark 1.0.0 (I have no input flow control unlike Spark 1.1

Re: Stable spark streaming app

2014-09-18 Thread Tim Smith
at 5:50 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi Tim Just curious to know ; Which Kafka Consumer you have used ? Dib On Sep 18, 2014 4:40 AM, Tim Smith secs...@gmail.com wrote: Thanks :) On Wed, Sep 17, 2014 at 2:10 PM, Paul Wais pw...@yelp.com wrote: Thanks Tim

Re: Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread Tim Smith
What kafka receiver are you using? Did you build a new jar for your app with the latest streaming-kafka code for 1.1? On Thu, Sep 18, 2014 at 11:47 AM, JiajiaJing jj.jing0...@gmail.com wrote: Hi Spark Users, We just upgrade our spark version from 1.0 to 1.1. And we are trying to re-run all

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
Posting your code would be really helpful in figuring out gotchas. On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote: Hey, Spark 1.1.0 Kafka 0.8.1.1 Hadoop (YARN/HDFS) 2.5.1 I have a five partition Kafka topic. I can create a single Kafka receiver via

Re: Multiple Kafka Receivers and Union

2014-09-23 Thread Tim Smith
); } }); … and futher Spark functions ... On Sep 23, 2014, at 2:55 PM, Tim Smith secs...@gmail.com wrote: Posting your code would be really helpful in figuring out gotchas. On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote: Hey, Spark 1.1.0 Kafka

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Tim Smith
once, I don’t receive any messages. I’ll dig through the logs, but at first glance yesterday I didn’t see anything suspect. I’ll have to look closer. mn On Sep 23, 2014, at 6:14 PM, Tim Smith secs...@gmail.com wrote: Maybe post the before-code as in what was the code before you did

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Tim Smith
(that are purported to work), but no luck. mn On Sep 24, 2014, at 11:27 AM, Tim Smith secs...@gmail.com wrote: Maybe differences between JavaPairDStream and JavaPairReceiverInputDStream? On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell matt.narr...@gmail.com wrote: The part that works

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
of the 'knobs' I describe here to see if that would help? http://www.virdata.com/tuning-spark/ -kr, Gerard. On Thu, Feb 12, 2015 at 8:44 AM, Tim Smith secs...@gmail.com wrote: Just read the thread Are these numbers abnormal for spark streaming? and I think I am seeing similar results

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
On Thu, Feb 12, 2015 at 6:29 PM, Tim Smith secs...@gmail.com wrote: Hi Gerard, Great write-up and really good guidance in there. I have to be honest, I don't know why but setting # of partitions for each dStream to a low number (5-10) just causes the app to choke/crash. Setting it to 20 gets

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
: 3.596 s) 15/02/13 06:27:03 INFO JobScheduler: Total delay: 3.905 s for time 142380882 ms (execution: 3.861 s) 15/02/13 06:27:24 INFO JobScheduler: Total delay: 4.068 s for time 142380884 ms (execution: 4.026 s) On Thu, Feb 12, 2015 at 9:54 PM, Tim Smith secs...@gmail.com wrote: TD - I

Re: Streaming scheduling delay

2015-02-12 Thread Tim Smith
. Besides set partition count to 1 for each dStream means dstream.repartition(1) ? If so I think it will still introduce shuffle and move all the data into one partition. Thanks Saisai 2015-02-13 13:54 GMT+08:00 Tim Smith secs...@gmail.com: TD - I will try count() and report back. Meanwhile

Accumulator in SparkUI for streaming

2015-02-20 Thread Tim Smith
On Spark 1.2: I am trying to capture # records read from a kafka topic: val inRecords = ssc.sparkContext.accumulator(0, InRecords) .. kInStreams.foreach( k = { k.foreachRDD ( rdd = inRecords += rdd.count().toInt ) inRecords.value Question

Re: Streaming scheduling delay

2015-02-11 Thread Tim Smith
at 11:16 PM, Tim Smith secs...@gmail.com wrote: On Spark 1.2 (have been seeing this behaviour since 1.0), I have a streaming app that consumes data from Kafka and writes it back to Kafka (different topic). My big problem has been Total Delay. While execution time is usually window size (in seconds

Streaming scheduling delay

2015-02-11 Thread Tim Smith
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a streaming app that consumes data from Kafka and writes it back to Kafka (different topic). My big problem has been Total Delay. While execution time is usually window size (in seconds), the total delay ranges from a minutes to

Re: Spark Streaming output cannot be used as input?

2015-02-18 Thread Tim Smith
+1 for writing the Spark output to Kafka. You can then hang off multiple compute/storage framework from kafka. I am using a similar pipeline to feed ElasticSearch and HDFS in parallel. Allows modularity, you can take down ElasticSearch or HDFS for maintenance without losing (except for some edge

How to diagnose could not compute split errors and failed jobs?

2015-02-19 Thread Tim Smith
My streaming app runs fine for a few hours and then starts spewing Could not compute split, block input-xx-xxx not found errors. After this, jobs start to fail and batches start to pile up. My question isn't so much about why this error but rather, how do I trace what leads to this error? I

Re: Accumulator in SparkUI for streaming

2015-02-28 Thread Tim Smith
acc.value res1: Int = 1000 The Stage details page shows: On 20.2.2015. 9:25, Tim Smith wrote: On Spark 1.2: I am trying to capture # records read from a kafka topic: val inRecords = ssc.sparkContext.accumulator(0, InRecords) .. kInStreams.foreach( k

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
provide / relevant code? On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith secs...@gmail.com wrote: Update on performance of the new API: the new code using the createDirectStream API ran overnight and when I checked the app state in the morning, there were massive scheduling delays :( Not sure why

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
. I'd remove the repartition. If you weren't doing any shuffles in the old job, and are doing a shuffle in the new job, it's not really comparable. On Fri, Jun 19, 2015 at 8:16 PM, Tim Smith secs...@gmail.com wrote: On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com wrote

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
)) dataout.foreachRDD ( rdd = rdd.foreachPartition(rec = { myOutputFunc.write(rec) }) Thanks, Tim If that's the case I'd try direct stream without the repartitioning. On Fri, Jun 19, 2015 at 6:43 PM, Tim Smith secs...@gmail.com wrote: Essentially, I went from: k = createStream . val

Re: createDirectStream and Stats

2015-06-18 Thread Tim Smith
in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome stats. :) On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote: Hi, I just switched from createStream to the createDirectStream API for kafka and while things otherwise seem happy, the first thing I noticed

createDirectStream and Stats

2015-06-18 Thread Tim Smith
Hi, I just switched from createStream to the createDirectStream API for kafka and while things otherwise seem happy, the first thing I noticed is that stream/receiver stats are gone from the Spark UI :( Those stats were very handy for keeping an eye on health of the app. What's the best way to

Re: createDirectStream and Stats

2015-06-19 Thread Tim Smith
app. Yes, for the record, this is with CDH 5.4.1 and Spark 1.3. On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith secs...@gmail.com wrote: Thanks for the super-fast response, TD :) I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera, are you listening? :D On Thu, Jun 18

Controlling output fileSize in SparkSQL

2015-07-27 Thread Tim Smith
Hi, I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum output file size when writing out from SparkSQL? So far, I have tried: --x- import sqlContext.implicits._ sc.hadoopConfiguration.setBoolean(fs.hdfs.impl.disable.cache,true)

Re: Spark REST Job server feedback?

2015-10-08 Thread Tim Smith
I am curious too - any comparison between the two. Looks like one is Datastax sponsored and the other is Cloudera. Other than that, any major/core differences in design/approach? Thanks, Tim On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal wrote: > Anyone has

Alter table fails to find table

2015-09-02 Thread Tim Smith
Spark 1.3.0 (CDH 5.4.4) scala> sqlContext.sql("SHOW TABLES").collect res18: Array[org.apache.spark.sql.Row] = Array([allactivitydata,true], [sample_07,false], [sample_08,false]) sqlContext.sql("SELECT COUNT(*) from allactivitydata").collect res19: Array[org.apache.spark.sql.Row] =

Consuming AWS Cloudwatch logs from Kinesis into Spark

2017-04-05 Thread Tim Smith
I am sharing this code snippet since I spent quite some time figuring it out and I couldn't find any examples online. Between the Kinesis documentation, tutorial on AWS site and other code snippets on the Internet, I was confused about structure/format of the messages that Spark fetches from

Re: Assigning a unique row ID

2017-04-07 Thread Tim Smith
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson wrote: > Hi, > > What's the best way to assign a truly unique row ID (rather than a hash) > to a

Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-04-27 Thread Tim Smith
Hi, I am trying to figure out the API to initialize a gaussian mixture model using either centroids created by K-means or previously calculated GMM model (I am aware that you can "save" a model and "load" in later but I am not interested in saving a model to a filesystem). The Spark MLlib API

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-02 Thread Tim Smith
sh > we can get this feature in Spark 2.3. > > Thanks > Yanbo > > On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith <secs...@gmail.com> wrote: > >> Hi, >> >> I am trying to figure out the API to initialize a gaussian mixture model >> using either centro

Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Tim Smith
One, I think, you should take this to the spark developer list. Two, I suspect broadcast variables aren't the best solution for the use case, you describe. Maybe an in-memory data/object/file store like tachyon is a better fit. Thanks, Tim On Tue, May 2, 2017 at 11:56 AM, Nipun Arora