I generally use Play Framework Api's for comple json structures.
https://www.playframework.com/documentation/2.5.x/ScalaJson#Json
On Wed, Oct 12, 2016 at 11:34 AM, Kappaganthu, Sivaram (ES) <
sivaram.kappagan...@adp.com> wrote:
> Hi,
>
>
>
> Does this mean that handling any Json with kind of bel
Hi,
I am upgrading my jobs to Spark 1.6 and am running into shuffle issues. I
have tried all options and now am falling back to legacy memory model but
still running into same issue.
I have set spark.shuffle.blockTransferService to nio.
16/10/12 06:00:10 INFO MapOutputTrackerMaster: Size of outp
No, I meant it should be in a single line but it supports array type too as
a root wrapper of JSON objects.
If you need to parse multiple lines, I have a reference here.
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/
2016-10-12 15:04 GMT+09:00 Kappaganthu, Siva
Hi,
Does this mean that handling any Json with kind of below schema with spark is
not a good fit?? I have requirement to parse the below Json that spans across
multiple lines. Whats the best way to parse the jsns of this kind?? Please
suggest.
root
|-- maindate: struct (nullable = true)
|
Hey, I just found a promo code for Spark Summit Europe that saves 20%. It’s
"Summit16" - I love Brussels and just registered! Who’s coming with me to
get their Spark on?!
Cheers,
Andrew
There are some known issues in 1.6.0, e.g.,
https://issues.apache.org/jira/browse/SPARK-12591
Could you try 1.6.1?
On Tue, Oct 11, 2016 at 9:55 AM, Joey Echeverria wrote:
> I tried wrapping my Tuple class (which is generated by Avro) in a
> class that implements Serializable, but now I'm gettin
I have question about how to use textFileStream, I have included a code
snippet below. I am trying to read .gz files that are getting put into my
bucket. I do not want to specify the schema, I have a similar feature that
just does spark.read.json(inputBucket). This works great and if I can get
t
Look here
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Probably it will help a bit.
Best regards,
Denis
11 Окт 2016 г. 23:49 пользователь "Xiaoye Sun"
написал:
> Hi,
>
> Currently, I am running Spark using the standalone scheduler with 3
> m
Hi,
Currently, I am running Spark using the standalone scheduler with 3
machines in our cluster. For these three machines, one runs Spark Master
and the other two run Spark Worker.
We run a machine learning application on this small-scale testbed. A
particular stage in my application is divided i
Try to build a flat (uber) jar which includes all dependencies.
11 Окт 2016 г. 22:11 пользователь "doruchiulan"
написал:
> Hi,
>
> I have a problem that's bothering me for a few days, and I'm pretty out of
> ideas.
>
> I built a Spark docker container where Spark runs in standalone mode. Both
>
Hi,
I have a problem that's bothering me for a few days, and I'm pretty out of
ideas.
I built a Spark docker container where Spark runs in standalone mode. Both
master and worker are started there.
Now I tried to deploy my Spark Scala App in a separate container(same
machine) where I pass the Sp
All,
We have an use case in which 2 spark streaming jobs in same EMR cluster.
I am thinking of allowing multiple streaming contexts and run them as 2
separate spark-submit with wait for app completion set to false.
With this, the failure detection and monitoring seems obscure and doesn't
seem to
unsubscribe
--
Jacky Wang
I tried wrapping my Tuple class (which is generated by Avro) in a
class that implements Serializable, but now I'm getting a
ClassNotFoundException in my Spark application. The exception is
thrown while trying to deserialize checkpoint state:
https://gist.github.com/joey/7b374a2d483e25f15e20c0c4cb8
I don't believe it will ever scale to spin up a whole distributed job to
serve one request. You can look possibly at the bits in mllib-local. You
might do well to export as something like PMML either with Spark's export
or JPMML and then load it into a web container and score it, without Spark
(pos
I am getting this error some times when I run pyspark with spark 2.0.0
App > at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
App > at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
App > at java.lang.reflect.Method.invoke(Meth
http://spark.apache.org/docs/latest/configuration.html
"This rate is upper bounded by the values
spark.streaming.receiver.maxRate and
spark.streaming.kafka.maxRatePerPartition if they are set (see
below)."
On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane wrote:
> Hi,
>
> Is it possible to limit th
Hi,
Is it possible to limit the size of the batches returned by the Kafka consumer
for Spark Streaming?
I am asking because the first batch I get has hundred of millions of records
and it takes ages to process and checkpoint them.
Thank you.
Samy
-
Hi all,
so I have a model which has been stored in S3. And I have a Scala webapp
which for certain requests loads the model and transforms submitted data
against it.
I'm not sure how to run this quickly on a single instance though. At the
moment Spark is being bundled up with the web app in an ub
What are you expecting? If you want to update every key on every
batch, it's going to be linear on the number of keys... there's no
real way around that.
On Tue, Oct 11, 2016 at 9:49 AM, Daan Debie wrote:
> That's nice and all, but I'd rather have a solution involving mapWithState
> of course :)
That's nice and all, but I'd rather have a solution involving mapWithState
of course :) I'm just wondering why it doesn't support this use case yet.
On Tue, Oct 11, 2016 at 3:41 PM, Cody Koeninger wrote:
> They're telling you not to use the old function because it's linear on the
> total number
They're telling you not to use the old function because it's linear on the
total number of keys, not keys in the batch, so it's slow.
But if that's what you really want, go ahead and do it, and see if it
performs well enough.
On Oct 11, 2016 6:28 AM, "DandyDev" wrote:
Hi there,
I've built a Sp
The new underlying kafka consumer prefetches data and is generally heavier
weight, so it is cached on executors. Group id is part of the cache key. I
assumed kafka users would use different group ids for consumers they wanted
to be distinct, since otherwise would cause problems even with the norma
good point, I changed the group id to be unique for the separate streams
and now it works. Thanks!
Although changing this is ok for us, i am interested in the why :-) With
the old connector this was not a problem nor is it afaik with the pure
kafka consumer api
2016-10-11 14:30 GMT+02:00 Cody Koe
Just out of curiosity, have you tried using separate group ids for the
separate streams?
On Oct 11, 2016 4:46 AM, "Matthias Niehoff"
wrote:
> I stripped down the job to just consume the stream and print it, without
> avro deserialization. When I only consume one topic, everything is fine. As
> s
Similarly a Java DStream has a dstream method you can call to get the
underlying dstream.
On Oct 11, 2016 2:54 AM, "static-max" wrote:
> Hi Cody,
>
> thanks, rdd.rdd() did the trick. I now have the offsets via
> OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
>
> But how ca
Hello team. so I found and resolved the issue. In case if someone run into
same problem this was the problem.
>>Each nodes were allocated 1397MB memory for storages.
16/10/11 13:16:58 INFO storage.MemoryStore: MemoryStore started with
capacity 1397.3 MB
>> However, my RDD exceeded the storage lim
I re-ran the job with DEBUG Log Level for org.apache.spark, kafka.consumer
and org.apache.kafka. Please find the output here:
http://pastebin.com/VgtRUQcB
most of the delay is introduced by *16/10/11 13:20:12 DEBUG RecurringTimer:
Callback for JobGenerator called at time x*, which repeats multiple
Hi there,
I've built a Spark Streaming app that accepts certain events from Kafka, and
I want to keep some state between the events. So I've successfully used
mapWithState for that. The problem is, that I want the state for keys to be
updated on every batchInterval, because "lack" of events is als
If I explain, further more, In our server, when it is started, we also
start a spark cluster. Our server is an OSGI environment and we are calling
the SqlContext.SqlContext().stop() method to stop the master in OSGI
deactivate method.
On Tue, Oct 11, 2016 at 3:27 PM, Gimantha Bandara wrote:
> H
you should have just tried it and let us know what your experience had
been! Anyways, after spending long hours on this problem I realized this is
actually a classLoader problem.
If you use spark-submit this exception should go away but you haven't told
us how you are submitting a Job such that yo
@Song, I have called an action but it did not cache as you can see in the
provided screenshot on my original email. It has cahced into Disk but not
memory.
@Chin Wei Low, I have 15GB memory allocated which is more than the dataset
size.
Any other suggestion please?
Kind regards,
Guru
On 11 Oc
That's a good point about shuffle data compression. Still, it would be good
to benchmark the ideas behind https://github.com/apache/spark/pull/12761 I
think.
For many datasets, even within one partition the gradient sums etc can
remain very sparse. For example Criteo DAC data is extremely sparse -
Hi all,
When we are forcefully shutting down the server, we observe the following
exception intermittently. What could be the reason? Please note that this
exception is printed in logs multiple times and stop our server shutting
down as soon as we trigger a forceful shutdown. What could be the r
I stripped down the job to just consume the stream and print it, without
avro deserialization. When I only consume one topic, everything is fine. As
soon as I add a second stream it stucks after about 5 minutes. So I
basically bails down to:
val kafkaParams = Map[String, String](
"bootstrap
Hi Selvam,
Is your 35GB parquet file split up into multiple S3 objects or just one big
Parquet file?
If its just one big file then I believe only one executor will be able to
work on it until some job action partitions the data into smaller chunks.
On 11 October 2016 at 06:03, Selvam Raman wr
Hi Cody,
thanks, rdd.rdd() did the trick. I now have the offsets via
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
But how can you commit the offset to Kafka?
Casting the JavaInputDStream throws an CCE:
CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream; // Throw
Hi,
I am currently running this code from my IDE(Eclipse). I tried adding the
scope "provided" to the dependency without any effect. Should I build this
and submit it using the spark-submit command?
Thanks
Vaibhav
On 11 October 2016 at 04:36, Jakob Odersky wrote:
> Just thought of another pote
This Job will fail after about 5 minutes:
object SparkJobMinimal {
//read Avro schemas
var stream = getClass.getResourceAsStream("/avro/AdRequestLog.avsc")
val avroSchemaAdRequest =
scala.io.Source.fromInputStream(stream).getLines.mkString
stream.close
stream = getClass.getResourceAsSt
39 matches
Mail list logo