Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane
Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Daan Debie
That's nice and all, but I'd rather have a solution involving mapWithState of course :) I'm just wondering why it doesn't support this use case yet. On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger wrote: > They're telling you not to use the old function because it's linear

Re: Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Cody Koeninger
http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote: > Hi, > > Is it

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger
What are you expecting? If you want to update every key on every batch, it's going to be linear on the number of keys... there's no real way around that. On Tue, Oct 11, 2016 at 9:49 AM, Daan Debie wrote: > That's nice and all, but I'd rather have a solution involving

mllib model in production web API

2016-10-11 Thread Nicolas Long
Hi all, so I have a model which has been stored in S3. And I have a Scala webapp which for certain requests loads the model and transforms submitted data against it. I'm not sure how to run this quickly on a single instance though. At the moment Spark is being bundled up with the web app in an

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger
Just out of curiosity, have you tried using separate group ids for the separate streams? On Oct 11, 2016 4:46 AM, "Matthias Niehoff" wrote: > I stripped down the job to just consume the stream and print it, without > avro deserialization. When I only consume one

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger
They're telling you not to use the old function because it's linear on the total number of keys, not keys in the batch, so it's slow. But if that's what you really want, go ahead and do it, and see if it performs well enough. On Oct 11, 2016 6:28 AM, "DandyDev" wrote: Hi

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
Hello team. so I found and resolved the issue. In case if someone run into same problem this was the problem. >>Each nodes were allocated 1397MB memory for storages. 16/10/11 13:16:58 INFO storage.MemoryStore: MemoryStore started with capacity 1397.3 MB >> However, my RDD exceeded the storage

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread Cody Koeninger
Similarly a Java DStream has a dstream method you can call to get the underlying dstream. On Oct 11, 2016 2:54 AM, "static-max" wrote: > Hi Cody, > > thanks, rdd.rdd() did the trick. I now have the offsets via > OffsetRange[] offsets = ((HasOffsetRanges)

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff
I re-ran the job with DEBUG Log Level for org.apache.spark, kafka.consumer and org.apache.kafka. Please find the output here: http://pastebin.com/VgtRUQcB most of the delay is introduced by *16/10/11 13:20:12 DEBUG RecurringTimer: Callback for JobGenerator called at time x*, which repeats

Re: Exception while shutting down the Spark master.

2016-10-11 Thread Gimantha Bandara
If I explain, further more, In our server, when it is started, we also start a spark cluster. Our server is an OSGI environment and we are calling the SqlContext.SqlContext().stop() method to stop the master in OSGI deactivate method. On Tue, Oct 11, 2016 at 3:27 PM, Gimantha Bandara

Can mapWithState state func be called every batchInterval?

2016-10-11 Thread DandyDev
Hi there, I've built a Spark Streaming app that accepts certain events from Kafka, and I want to keep some state between the events. So I've successfully used mapWithState for that. The problem is, that I want the state for keys to be updated on every batchInterval, because "lack" of events is

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff
good point, I changed the group id to be unique for the separate streams and now it works. Thanks! Although changing this is ok for us, i am interested in the why :-) With the old connector this was not a problem nor is it afaik with the pure kafka consumer api 2016-10-11 14:30 GMT+02:00 Cody

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger
The new underlying kafka consumer prefetches data and is generally heavier weight, so it is cached on executors. Group id is part of the cache key. I assumed kafka users would use different group ids for consumers they wanted to be distinct, since otherwise would cause problems even with the

one executor runs multiple parallel tasks VS multiple excutors each runs one task

2016-10-11 Thread Xiaoye Sun
Hi, Currently, I am running Spark using the standalone scheduler with 3 machines in our cluster. For these three machines, one runs Spark Master and the other two run Spark Worker. We run a machine learning application on this small-scale testbed. A particular stage in my application is divided

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff
I stripped down the job to just consume the stream and print it, without avro deserialization. When I only consume one topic, everything is fine. As soon as I add a second stream it stucks after about 5 minutes. So I basically bails down to: val kafkaParams = Map[String, String](

Exception while shutting down the Spark master.

2016-10-11 Thread Gimantha Bandara
Hi all, When we are forcefully shutting down the server, we observe the following exception intermittently. What could be the reason? Please note that this exception is printed in logs multiple times and stop our server shutting down as soon as we trigger a forceful shutdown. What could be the

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath
That's a good point about shuffle data compression. Still, it would be good to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I think. For many datasets, even within one partition the gradient sums etc can remain very sparse. For example Criteo DAC data is extremely sparse

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru
@Song, I have called an action but it did not cache as you can see in the provided screenshot on my original email. It has cahced into Disk but not memory. @Chin Wei Low, I have 15GB memory allocated which is more than the dataset size. Any other suggestion please? Kind regards, Guru On 11

Re: ClassCastException while running a simple wordCount

2016-10-11 Thread kant kodali
you should have just tried it and let us know what your experience had been! Anyways, after spending long hours on this problem I realized this is actually a classLoader problem. If you use spark-submit this exception should go away but you haven't told us how you are submitting a Job such that

Spark 2.0.0 Error Caused by: java.lang.IllegalArgumentException: requirement failed: Block broadcast_21_piece0 is already present in the MemoryStore

2016-10-11 Thread sandesh deshmane
I am getting this error some times when I run pyspark with spark 2.0.0 App > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) App > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) App > at

Re: mllib model in production web API

2016-10-11 Thread Sean Owen
I don't believe it will ever scale to spin up a whole distributed job to serve one request. You can look possibly at the bits in mllib-local. You might do well to export as something like PMML either with Spark's export or JPMML and then load it into a web container and score it, without Spark

Recommended way to run spark streaming in production in EMR

2016-10-11 Thread pandees waran
All, We have an use case in which 2 spark streaming jobs in same EMR cluster. I am thinking of allowing multiple streaming contexts and run them as 2 separate spark-submit with wait for app completion set to false. With this, the failure detection and monitoring seems obscure and doesn't seem

unsubscribe

2016-10-11 Thread glen
unsubscribe -- Jacky Wang

Re: Map with state keys serialization

2016-10-11 Thread Joey Echeverria
I tried wrapping my Tuple class (which is generated by Avro) in a class that implements Serializable, but now I'm getting a ClassNotFoundException in my Spark application. The exception is thrown while trying to deserialize checkpoint state:

Spark Docker Container - Jars problem when deploying my app

2016-10-11 Thread doruchiulan
Hi, I have a problem that's bothering me for a few days, and I'm pretty out of ideas. I built a Spark docker container where Spark runs in standalone mode. Both master and worker are started there. Now I tried to deploy my Spark Scala App in a separate container(same machine) where I pass the

Re: Map with state keys serialization

2016-10-11 Thread Shixiong(Ryan) Zhu
There are some known issues in 1.6.0, e.g., https://issues.apache.org/jira/browse/SPARK-12591 Could you try 1.6.1? On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote: > I tried wrapping my Tuple class (which is generated by Avro) in a > class that implements Serializable,

Anyone attending spark summit?

2016-10-11 Thread Andrew James
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

textFileStream dStream to DataFrame issues

2016-10-11 Thread Nick
I have question about how to use textFileStream, I have included a code snippet below. I am trying to read .gz files that are getting put into my bucket. I do not want to specify the schema, I have a similar feature that just does spark.read.json(inputBucket). This works great and if I can get

Re: one executor runs multiple parallel tasks VS multiple excutors each runs one task

2016-10-11 Thread Denis Bolshakov
Look here http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications Probably it will help a bit. Best regards, Denis 11 Окт 2016 г. 23:49 пользователь "Xiaoye Sun" написал: > Hi, > > Currently, I am running Spark using the

Re: Spark Docker Container - Jars problem when deploying my app

2016-10-11 Thread Denis Bolshakov
Try to build a flat (uber) jar which includes all dependencies. 11 Окт 2016 г. 22:11 пользователь "doruchiulan" написал: > Hi, > > I have a problem that's bothering me for a few days, and I'm pretty out of > ideas. > > I built a Spark docker container where Spark runs in

Re: converting hBaseRDD to DataFrame

2016-10-11 Thread Divya Gehlot
Hi Mich , you can create dataframe from RDD in below manner also val df = sqlContext.createDataFrame(rdd,schema) val df = sqlContext.createDataFrame(rdd) The below article also may help you : http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/ On 11 October 2016 at

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff
This Job will fail after about 5 minutes: object SparkJobMinimal { //read Avro schemas var stream = getClass.getResourceAsStream("/avro/AdRequestLog.avsc") val avroSchemaAdRequest = scala.io.Source.fromInputStream(stream).getLines.mkString stream.close stream =

Re: ClassCastException while running a simple wordCount

2016-10-11 Thread vaibhav thapliyal
Hi, I am currently running this code from my IDE(Eclipse). I tried adding the scope "provided" to the dependency without any effect. Should I build this and submit it using the spark-submit command? Thanks Vaibhav On 11 October 2016 at 04:36, Jakob Odersky wrote: > Just

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread static-max
Hi Cody, thanks, rdd.rdd() did the trick. I now have the offsets via OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); But how can you commit the offset to Kafka? Casting the JavaInputDStream throws an CCE: CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream; //

Re: Spark S3

2016-10-11 Thread Abhinay Mehta
Hi Selvam, Is your 35GB parquet file split up into multiple S3 objects or just one big Parquet file? If its just one big file then I believe only one executor will be able to work on it until some job action partitions the data into smaller chunks. On 11 October 2016 at 06:03, Selvam Raman