Hi,
I am writing some Scala code to normalize a stream of logs using an
input configuration file (multiple regex patterns). To avoid
re-starting the job, I can read in a new config file using fileStream
and then turn the config file to a map. But I am unsure about how to
update a shared map
Hi,
I have Spark (1.0.0 on CDH5) running with Kafka 0.8.1.1.
I have a streaming jobs that reads from a kafka topic and writes
output to another kafka topic. The job starts fine but after a while
the input stream stops getting any data. I think these messages show
no incoming data on the stream:
errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
that caused the executor to fail.
TD
On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote:
Hi,
Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died
of partitions (try setting it to 2x the number
cores given to the application).
Yeah, in 1.0.0, ttl should be unnecessary.
On Thu, Aug 28, 2014 at 5:17 PM, Tim Smith secs...@gmail.com wrote:
On Thu, Aug 28, 2014 at 4:19 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
If you
, the too many open files is a sign that you need increase the
system-wide limit of open files.
Try adding ulimit -n 16000 to your conf/spark-env.sh.
TD
On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote:
Appeared after running for a while. I re-ran the job and this time
I set partitions to 64:
//
kInMsg.repartition(64)
val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
//
Still see all activity only on the two nodes that seem to be receiving
from Kafka.
On Thu, Aug 28, 2014 at 5:47 PM, Tim Smith secs...@gmail.com wrote:
TD - Apologies, didn't realize
, this app is assigned 252.5GB of
memory, 128 VCores and 9 containers. Am I missing something here?
Thanks,
Tim
On Thu, Aug 28, 2014 at 11:55 PM, Tim Smith secs...@gmail.com wrote:
I set partitions to 64:
//
kInMsg.repartition(64)
val outdata = kInMsg.map(x=normalizeLog(x._2,configMap
is timestamped 19:04:51 that tells
me the executor was killed for some reason right before the driver noticed
that executor/task failure.
How come my task failed only after 4 times although my config says failure
threshold is 64?
On Fri, Aug 29, 2014 at 12:00 PM, Tim Smith secs...@gmail.com
Good to see I am not the only one who cannot get incoming Dstreams to
repartition. I tried repartition(512) but still no luck - the app
stubbornly runs only on two nodes. Now this is 1.0.0 but looking at
release notes for 1.0.1 and 1.0.2, I don't see anything that says this
was an issue and has
for each receiver. You need
multiple partitions in the queue, each consumed by a DStream, if you
mean to parallelize consuming the queue.
On Fri, Aug 29, 2014 at 11:08 PM, Tim Smith secs...@gmail.com wrote:
Good to see I am not the only one who cannot get incoming Dstreams to
repartition. I tried
) now which actually has me confused. If Streams are active only
on 3 nodes then how/why did a 4th node get work? If a 4th got work why
aren't more nodes getting work?
On Fri, Aug 29, 2014 at 4:11 PM, Tim Smith secs...@gmail.com wrote:
I create my DStream very simply as:
val kInMsg
I'd be interested to understand this mechanism as well. But this is the
error recovery part of the equation. Consuming from Kafka has two aspects -
parallelism and error recovery and I am not sure how either works. For
error recovery, I would like to understand how:
- A failed receiver gets
I'd be interested in finding the answer too. Right now, I do:
val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))
kafkaOutMsgs.foreachRDD((rdd,time) = { rdd.foreach(rec = {
writer.output(rec) }) } ) //where writer.ouput is a method that takes a
string and writer is an instance of a
Thanks TD. Someone already pointed out to me that /repartition(...)/ isn't
the right way. You have to /val partedStream = repartition(...)/. Would be
nice to have it fixed in the docs.
On Fri, Sep 5, 2014 at 10:44 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Some thoughts on this
How are you creating your kafka streams in Spark?
If you have 10 partitions for a topic, you can call createStream ten
times to create 10 parallel receivers/executors and then use union to
combine all the dStreams.
On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote:
Hi (my
I am using Spark 1.0.0 (on CDH 5.1) and have a similar issue. In my case,
the receivers die within an hour because Yarn kills the containers for high
memory usage. I set ttl.cleaner to 30 seconds but that didn't help. So I
don't think stale RDDs are an issue here. I did a jmap -histo on a couple
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
Slide 39 covers it.
On Tue, Sep 9, 2014 at 9:23 PM, qihong qc...@pivotal.io wrote:
Hi Mayur,
Thanks for your response. I did write a simple test that set up a DStream
with
5 batches; The
I had a similar issue and many others - all were basically symptoms for
yarn killing the container for high memory usage. Haven't gotten to root
cause yet.
On Tue, Sep 9, 2014 at 3:18 PM, Marcelo Vanzin van...@cloudera.com wrote:
Your executor is exiting or crashing unexpectedly:
On Tue, Sep
process.
They should be about the same, right?
Also, in the heap dump, 99% of the heap seems to be occupied with
unreachable objects (and most of it is byte arrays).
On Wed, Sep 10, 2014 at 12:06 PM, Tim Smith secs...@gmail.com wrote:
Actually, I am not doing any explicit shuffle/updateByKey
Thanks for all the good work. Very excited about seeing more features and
better stability in the framework.
On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the
Hi,
Anyone have a stable streaming app running in production? Can you
share some overview of the app and setup like number of nodes, events
per second, broad stream processing workflow, config highlights etc?
Thanks,
Tim
-
To
Similar issue (Spark 1.0.0). Streaming app runs for a few seconds
before these errors start to pop all over the driver logs:
14/09/12 17:30:23 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block
input-4-1410542878200 not found
at
Spark 1.0.0
I write logs out from my app using this object:
object LogService extends Logging {
/** Set reasonable logging levels for streaming if the user has not
configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
Hi,
Anyone setting any explicit GC options for the executor jvm? If yes,
what and how did you arrive at them?
Thanks,
- Tim
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
Hi Dibyendu,
I am a little confused about the need for rate limiting input from
kafka. If the stream coming in from kafka has higher message/second
rate than what a Spark job can process then it should simply build a
backlog in Spark if the RDDs are cached on disk using persist().
Right?
Thanks,
!
Adding to it, please share experiences of building an enterprise grade
product based on Spark Streaming.
I am exploring Spark Streaming for enterprise software and am cautiously
optimistic about it. I see huge potential to improve debuggability of Spark.
- Original Message -
From: Tim
PM, Tim Smith secs...@gmail.com wrote:
I don't have anything in production yet but I now at least have a
stable (running for more than 24 hours) streaming app. Earlier, the
app would crash for all sorts of reasons. Caveats/setup:
- Spark 1.0.0 (I have no input flow control unlike Spark 1.1
at 5:50 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi Tim
Just curious to know ; Which Kafka Consumer you have used ?
Dib
On Sep 18, 2014 4:40 AM, Tim Smith secs...@gmail.com wrote:
Thanks :)
On Wed, Sep 17, 2014 at 2:10 PM, Paul Wais pw...@yelp.com wrote:
Thanks Tim
What kafka receiver are you using? Did you build a new jar for your
app with the latest streaming-kafka code for 1.1?
On Thu, Sep 18, 2014 at 11:47 AM, JiajiaJing jj.jing0...@gmail.com wrote:
Hi Spark Users,
We just upgrade our spark version from 1.0 to 1.1. And we are trying to
re-run all
Posting your code would be really helpful in figuring out gotchas.
On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote:
Hey,
Spark 1.1.0
Kafka 0.8.1.1
Hadoop (YARN/HDFS) 2.5.1
I have a five partition Kafka topic. I can create a single Kafka receiver
via
);
}
});
… and futher Spark functions ...
On Sep 23, 2014, at 2:55 PM, Tim Smith secs...@gmail.com wrote:
Posting your code would be really helpful in figuring out gotchas.
On Tue, Sep 23, 2014 at 9:19 AM, Matt Narrell matt.narr...@gmail.com wrote:
Hey,
Spark 1.1.0
Kafka
once, I
don’t receive any messages.
I’ll dig through the logs, but at first glance yesterday I didn’t see
anything suspect. I’ll have to look closer.
mn
On Sep 23, 2014, at 6:14 PM, Tim Smith secs...@gmail.com wrote:
Maybe post the before-code as in what was the code before you did
(that are
purported to work), but no luck.
mn
On Sep 24, 2014, at 11:27 AM, Tim Smith secs...@gmail.com wrote:
Maybe differences between JavaPairDStream and JavaPairReceiverInputDStream?
On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell matt.narr...@gmail.com
wrote:
The part that works
of the 'knobs' I describe here to see if that would
help?
http://www.virdata.com/tuning-spark/
-kr, Gerard.
On Thu, Feb 12, 2015 at 8:44 AM, Tim Smith secs...@gmail.com wrote:
Just read the thread Are these numbers abnormal for spark streaming?
and I think I am seeing similar results
On Thu, Feb 12, 2015 at 6:29 PM, Tim Smith secs...@gmail.com wrote:
Hi Gerard,
Great write-up and really good guidance in there.
I have to be honest, I don't know why but setting # of partitions for
each dStream to a low number (5-10) just causes the app to choke/crash.
Setting it to 20 gets
: 3.596 s)
15/02/13 06:27:03 INFO JobScheduler: Total delay: 3.905 s for time
142380882 ms (execution: 3.861 s)
15/02/13 06:27:24 INFO JobScheduler: Total delay: 4.068 s for time
142380884 ms (execution: 4.026 s)
On Thu, Feb 12, 2015 at 9:54 PM, Tim Smith secs...@gmail.com wrote:
TD - I
.
Besides set partition count to 1 for each dStream means
dstream.repartition(1) ? If so I think it will still introduce shuffle and
move all the data into one partition.
Thanks
Saisai
2015-02-13 13:54 GMT+08:00 Tim Smith secs...@gmail.com:
TD - I will try count() and report back. Meanwhile
On Spark 1.2:
I am trying to capture # records read from a kafka topic:
val inRecords = ssc.sparkContext.accumulator(0, InRecords)
..
kInStreams.foreach( k =
{
k.foreachRDD ( rdd = inRecords += rdd.count().toInt )
inRecords.value
Question
at 11:16 PM, Tim Smith secs...@gmail.com wrote:
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a
streaming app that consumes data from Kafka and writes it back to Kafka
(different topic). My big problem has been Total Delay. While execution
time is usually window size (in seconds
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a
streaming app that consumes data from Kafka and writes it back to Kafka
(different topic). My big problem has been Total Delay. While execution
time is usually window size (in seconds), the total delay ranges from a
minutes to
+1 for writing the Spark output to Kafka. You can then hang off multiple
compute/storage framework from kafka. I am using a similar pipeline to feed
ElasticSearch and HDFS in parallel. Allows modularity, you can take down
ElasticSearch or HDFS for maintenance without losing (except for some edge
My streaming app runs fine for a few hours and then starts spewing Could
not compute split, block input-xx-xxx not found errors. After this,
jobs start to fail and batches start to pile up.
My question isn't so much about why this error but rather, how do I trace
what leads to this error? I
acc.value
res1: Int = 1000
The Stage details page shows:
On 20.2.2015. 9:25, Tim Smith wrote:
On Spark 1.2:
I am trying to capture # records read from a kafka topic:
val inRecords = ssc.sparkContext.accumulator(0, InRecords)
..
kInStreams.foreach( k
provide / relevant code?
On Fri, Jun 19, 2015 at 1:23 PM, Tim Smith secs...@gmail.com wrote:
Update on performance of the new API: the new code using the
createDirectStream API ran overnight and when I checked the app state in
the morning, there were massive scheduling delays :(
Not sure why
.
I'd remove the repartition. If you weren't doing any shuffles in the old
job, and are doing a shuffle in the new job, it's not really comparable.
On Fri, Jun 19, 2015 at 8:16 PM, Tim Smith secs...@gmail.com wrote:
On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com
wrote
))
dataout.foreachRDD ( rdd = rdd.foreachPartition(rec = {
myOutputFunc.write(rec) })
Thanks,
Tim
If that's the case I'd try direct stream without the repartitioning.
On Fri, Jun 19, 2015 at 6:43 PM, Tim Smith secs...@gmail.com wrote:
Essentially, I went from:
k = createStream .
val
in
Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome
stats. :)
On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote:
Hi,
I just switched from createStream to the createDirectStream API for
kafka and while things otherwise seem happy, the first thing I noticed
Hi,
I just switched from createStream to the createDirectStream API for
kafka and while things otherwise seem happy, the first thing I noticed is
that stream/receiver stats are gone from the Spark UI :( Those stats were
very handy for keeping an eye on health of the app.
What's the best way to
app. Yes, for the record, this is with
CDH 5.4.1 and Spark 1.3.
On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith secs...@gmail.com wrote:
Thanks for the super-fast response, TD :)
I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera,
are you listening? :D
On Thu, Jun 18
Hi,
I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum
output file size when writing out from SparkSQL? So far, I have tried:
--x-
import sqlContext.implicits._
sc.hadoopConfiguration.setBoolean(fs.hdfs.impl.disable.cache,true)
I am curious too - any comparison between the two. Looks like one is
Datastax sponsored and the other is Cloudera. Other than that, any
major/core differences in design/approach?
Thanks,
Tim
On Mon, Sep 28, 2015 at 8:32 AM, Ramirez Quetzal wrote:
> Anyone has
Spark 1.3.0 (CDH 5.4.4)
scala> sqlContext.sql("SHOW TABLES").collect
res18: Array[org.apache.spark.sql.Row] = Array([allactivitydata,true],
[sample_07,false], [sample_08,false])
sqlContext.sql("SELECT COUNT(*) from allactivitydata").collect
res19: Array[org.apache.spark.sql.Row] =
I am sharing this code snippet since I spent quite some time figuring it
out and I couldn't find any examples online. Between the Kinesis
documentation, tutorial on AWS site and other code snippets on the
Internet, I was confused about structure/format of the messages that Spark
fetches from
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson
wrote:
> Hi,
>
> What's the best way to assign a truly unique row ID (rather than a hash)
> to a
Hi,
I am trying to figure out the API to initialize a gaussian mixture model
using either centroids created by K-means or previously calculated GMM
model (I am aware that you can "save" a model and "load" in later but I am
not interested in saving a model to a filesystem).
The Spark MLlib API
sh
> we can get this feature in Spark 2.3.
>
> Thanks
> Yanbo
>
> On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith <secs...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to figure out the API to initialize a gaussian mixture model
>> using either centro
One, I think, you should take this to the spark developer list.
Two, I suspect broadcast variables aren't the best solution for the use
case, you describe. Maybe an in-memory data/object/file store like tachyon
is a better fit.
Thanks,
Tim
On Tue, May 2, 2017 at 11:56 AM, Nipun Arora
57 matches
Mail list logo