date:20161011

Re: JSON Arrays and Spark

2016-10-11 Thread sujeet jog

I generally use Play Framework Api's for comple json structures. https://www.playframework.com/documentation/2.5.x/ScalaJson#Json On Wed, Oct 12, 2016 at 11:34 AM, Kappaganthu, Sivaram (ES) < sivaram.kappagan...@adp.com> wrote: > Hi, > > > > Does this mean that handling any Json with kind of bel

Spark Shuffle Issue

2016-10-11 Thread Ankur Srivastava

Hi, I am upgrading my jobs to Spark 1.6 and am running into shuffle issues. I have tried all options and now am falling back to legacy memory model but still running into same issue. I have set spark.shuffle.blockTransferService to nio. 16/10/12 06:00:10 INFO MapOutputTrackerMaster: Size of outp

Re: JSON Arrays and Spark

2016-10-11 Thread Hyukjin Kwon

No, I meant it should be in a single line but it supports array type too as a root wrapper of JSON objects. If you need to parse multiple lines, I have a reference here. http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/ 2016-10-12 15:04 GMT+09:00 Kappaganthu, Siva

RE: JSON Arrays and Spark

2016-10-11 Thread Kappaganthu, Sivaram (ES)

Hi, Does this mean that handling any Json with kind of below schema with spark is not a good fit?? I have requirement to parse the below Json that spans across multiple lines. Whats the best way to parse the jsns of this kind?? Please suggest. root |-- maindate: struct (nullable = true) |

Anyone attending spark summit?

2016-10-11 Thread Andrew James

Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s "Summit16" - I love Brussels and just registered! Who’s coming with me to get their Spark on?! Cheers, Andrew

Re: Map with state keys serialization

2016-10-11 Thread Shixiong(Ryan) Zhu

There are some known issues in 1.6.0, e.g., https://issues.apache.org/jira/browse/SPARK-12591 Could you try 1.6.1? On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote: > I tried wrapping my Tuple class (which is generated by Avro) in a > class that implements Serializable, but now I'm gettin

textFileStream dStream to DataFrame issues

2016-10-11 Thread Nick

I have question about how to use textFileStream, I have included a code snippet below. I am trying to read .gz files that are getting put into my bucket. I do not want to specify the schema, I have a similar feature that just does spark.read.json(inputBucket). This works great and if I can get t

Re: one executor runs multiple parallel tasks VS multiple excutors each runs one task

2016-10-11 Thread Denis Bolshakov

Look here http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications Probably it will help a bit. Best regards, Denis 11 Окт 2016 г. 23:49 пользователь "Xiaoye Sun" написал: > Hi, > > Currently, I am running Spark using the standalone scheduler with 3 > m

one executor runs multiple parallel tasks VS multiple excutors each runs one task

2016-10-11 Thread Xiaoye Sun

Hi, Currently, I am running Spark using the standalone scheduler with 3 machines in our cluster. For these three machines, one runs Spark Master and the other two run Spark Worker. We run a machine learning application on this small-scale testbed. A particular stage in my application is divided i

Re: Spark Docker Container - Jars problem when deploying my app

2016-10-11 Thread Denis Bolshakov

Try to build a flat (uber) jar which includes all dependencies. 11 Окт 2016 г. 22:11 пользователь "doruchiulan" написал: > Hi, > > I have a problem that's bothering me for a few days, and I'm pretty out of > ideas. > > I built a Spark docker container where Spark runs in standalone mode. Both >

Spark Docker Container - Jars problem when deploying my app

2016-10-11 Thread doruchiulan

Hi, I have a problem that's bothering me for a few days, and I'm pretty out of ideas. I built a Spark docker container where Spark runs in standalone mode. Both master and worker are started there. Now I tried to deploy my Spark Scala App in a separate container(same machine) where I pass the Sp

Recommended way to run spark streaming in production in EMR

2016-10-11 Thread pandees waran

All, We have an use case in which 2 spark streaming jobs in same EMR cluster. I am thinking of allowing multiple streaming contexts and run them as 2 separate spark-submit with wait for app completion set to false. With this, the failure detection and monitoring seems obscure and doesn't seem to

unsubscribe

2016-10-11 Thread glen

unsubscribe -- Jacky Wang

Re: Map with state keys serialization

2016-10-11 Thread Joey Echeverria

I tried wrapping my Tuple class (which is generated by Avro) in a class that implements Serializable, but now I'm getting a ClassNotFoundException in my Spark application. The exception is thrown while trying to deserialize checkpoint state: https://gist.github.com/joey/7b374a2d483e25f15e20c0c4cb8

Re: mllib model in production web API

2016-10-11 Thread Sean Owen

I don't believe it will ever scale to spin up a whole distributed job to serve one request. You can look possibly at the bits in mllib-local. You might do well to export as something like PMML either with Spark's export or JPMML and then load it into a web container and score it, without Spark (pos

Spark 2.0.0 Error Caused by: java.lang.IllegalArgumentException: requirement failed: Block broadcast_21_piece0 is already present in the MemoryStore

2016-10-11 Thread sandesh deshmane

I am getting this error some times when I run pyspark with spark 2.0.0 App > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) App > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) App > at java.lang.reflect.Method.invoke(Meth

Re: Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Cody Koeninger

http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote: > Hi, > > Is it possible to limit th

Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane

Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy -

mllib model in production web API

2016-10-11 Thread Nicolas Long

Hi all, so I have a model which has been stored in S3. And I have a Scala webapp which for certain requests loads the model and transforms submitted data against it. I'm not sure how to run this quickly on a single instance though. At the moment Spark is being bundled up with the web app in an ub

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger

What are you expecting? If you want to update every key on every batch, it's going to be linear on the number of keys... there's no real way around that. On Tue, Oct 11, 2016 at 9:49 AM, Daan Debie wrote: > That's nice and all, but I'd rather have a solution involving mapWithState > of course :)

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Daan Debie

That's nice and all, but I'd rather have a solution involving mapWithState of course :) I'm just wondering why it doesn't support this use case yet. On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger wrote: > They're telling you not to use the old function because it's linear on the > total number

Re: Can mapWithState state func be called every batchInterval?

2016-10-11 Thread Cody Koeninger

They're telling you not to use the old function because it's linear on the total number of keys, not keys in the batch, so it's slow. But if that's what you really want, go ahead and do it, and see if it performs well enough. On Oct 11, 2016 6:28 AM, "DandyDev" wrote: Hi there, I've built a Sp

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger

The new underlying kafka consumer prefetches data and is generally heavier weight, so it is cached on executors. Group id is part of the cache key. I assumed kafka users would use different group ids for consumers they wanted to be distinct, since otherwise would cause problems even with the norma

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff

good point, I changed the group id to be unique for the separate streams and now it works. Thanks! Although changing this is ok for us, i am interested in the why :-) With the old connector this was not a problem nor is it afaik with the pure kafka consumer api 2016-10-11 14:30 GMT+02:00 Cody Koe

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Cody Koeninger

Just out of curiosity, have you tried using separate group ids for the separate streams? On Oct 11, 2016 4:46 AM, "Matthias Niehoff" wrote: > I stripped down the job to just consume the stream and print it, without > avro deserialization. When I only consume one topic, everything is fine. As > s

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread Cody Koeninger

Similarly a Java DStream has a dstream method you can call to get the underlying dstream. On Oct 11, 2016 2:54 AM, "static-max" wrote: > Hi Cody, > > thanks, rdd.rdd() did the trick. I now have the offsets via > OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); > > But how ca

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru

Hello team. so I found and resolved the issue. In case if someone run into same problem this was the problem. >>Each nodes were allocated 1397MB memory for storages. 16/10/11 13:16:58 INFO storage.MemoryStore: MemoryStore started with capacity 1397.3 MB >> However, my RDD exceeded the storage lim

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff

I re-ran the job with DEBUG Log Level for org.apache.spark, kafka.consumer and org.apache.kafka. Please find the output here: http://pastebin.com/VgtRUQcB most of the delay is introduced by *16/10/11 13:20:12 DEBUG RecurringTimer: Callback for JobGenerator called at time x*, which repeats multiple

Can mapWithState state func be called every batchInterval?

2016-10-11 Thread DandyDev

Hi there, I've built a Spark Streaming app that accepts certain events from Kafka, and I want to keep some state between the events. So I've successfully used mapWithState for that. The problem is, that I want the state for keys to be updated on every batchInterval, because "lack" of events is als

Re: Exception while shutting down the Spark master.

2016-10-11 Thread Gimantha Bandara

If I explain, further more, In our server, when it is started, we also start a spark cluster. Our server is an OSGI environment and we are calling the SqlContext.SqlContext().stop() method to stop the master in OSGI deactivate method. On Tue, Oct 11, 2016 at 3:27 PM, Gimantha Bandara wrote: > H

Re: ClassCastException while running a simple wordCount

2016-10-11 Thread kant kodali

you should have just tried it and let us know what your experience had been! Anyways, after spending long hours on this problem I realized this is actually a classLoader problem. If you use spark-submit this exception should go away but you haven't told us how you are submitting a Job such that yo

Re: [Spark] RDDs are not persisting in memory

2016-10-11 Thread diplomatic Guru

@Song, I have called an action but it did not cache as you can see in the provided screenshot on my original email. It has cahced into Disk but not memory. @Chin Wei Low, I have 15GB memory allocated which is more than the dataset size. Any other suggestion please? Kind regards, Guru On 11 Oc

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-11 Thread Nick Pentreath

That's a good point about shuffle data compression. Still, it would be good to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I think. For many datasets, even within one partition the gradient sums etc can remain very sparse. For example Criteo DAC data is extremely sparse -

Exception while shutting down the Spark master.

2016-10-11 Thread Gimantha Bandara

Hi all, When we are forcefully shutting down the server, we observe the following exception intermittently. What could be the reason? Please note that this exception is printed in logs multiple times and stop our server shutting down as soon as we trigger a forceful shutdown. What could be the r

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff

I stripped down the job to just consume the stream and print it, without avro deserialization. When I only consume one topic, everything is fine. As soon as I add a second stream it stucks after about 5 minutes. So I basically bails down to: val kafkaParams = Map[String, String]( "bootstrap

Re: Spark S3

2016-10-11 Thread Abhinay Mehta

Hi Selvam, Is your 35GB parquet file split up into multiple S3 objects or just one big Parquet file? If its just one big file then I believe only one executor will be able to work on it until some job action partitions the data into smaller chunks. On 11 October 2016 at 06:03, Selvam Raman wr

Re: Manually committing offset in Spark 2.0 with Kafka 0.10 and Java

2016-10-11 Thread static-max

Hi Cody, thanks, rdd.rdd() did the trick. I now have the offsets via OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges(); But how can you commit the offset to Kafka? Casting the JavaInputDStream throws an CCE: CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream; // Throw

Re: ClassCastException while running a simple wordCount

2016-10-11 Thread vaibhav thapliyal

Hi, I am currently running this code from my IDE(Eclipse). I tried adding the scope "provided" to the dependency without any effect. Should I build this and submit it using the spark-submit command? Thanks Vaibhav On 11 October 2016 at 04:36, Jakob Odersky wrote: > Just thought of another pote

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-11 Thread Matthias Niehoff

This Job will fail after about 5 minutes: object SparkJobMinimal { //read Avro schemas var stream = getClass.getResourceAsStream("/avro/AdRequestLog.avsc") val avroSchemaAdRequest = scala.io.Source.fromInputStream(stream).getLines.mkString stream.close stream = getClass.getResourceAsSt

39 matches

Mail list logo