I detected the error. The final step is to index data in ElasticSearch, The
elasticSearch in one of the cluster is overhelmed and it doesn't work
correctly.
I linked the cluster which doesn't work with another ES and don't get any
delay.
Sorry, it wasn't relationed with Spark!
2015-07-31 9:15
It doesn't make sense to me. Because in the another cluster process all
data in less than a second.
Anyway, I'm going to set that parameter.
2015-07-31 0:36 GMT+02:00 Tathagata Das :
> Yes, and that is indeed the problem. It is trying to process all the data
> in Kafka, and therefore taking 60 se
Yes, and that is indeed the problem. It is trying to process all the data
in Kafka, and therefore taking 60 seconds. You need to set the rate limits
for that.
On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger wrote:
> If you don't set it, there is no maximum rate, it will get everything from
> the
If you don't set it, there is no maximum rate, it will get everything from
the end of the last batch to the maximum available offset
On Thu, Jul 30, 2015 at 10:46 AM, Guillermo Ortiz
wrote:
> The difference is that one recives more data than the others two. I can
> pass thought parameters the to
The difference is that one recives more data than the others two. I can
pass thought parameters the topics, so, I could execute the code trying
with one topic and figure out with one is the topic, although I guess that
it's the topics which gets more data.
Anyway it's pretty weird those delays in
If the jobs are running on different topicpartitions, what's different
about them? Is one of them 120x the throughput of the other, for
instance? You should be able to eliminate cluster config as a difference
by running the same topic partition on the different clusters and comparing
the results.
I have three topics with one partition each topic. So each jobs run about
one topics.
2015-07-30 16:20 GMT+02:00 Cody Koeninger :
> Just so I'm clear, the difference in timing you're talking about is this:
>
> 15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at
> MetricsSpark.scal
Just so I'm clear, the difference in timing you're talking about is this:
15/07/30 14:33:59 INFO DAGScheduler: Job 24 finished: foreachRDD at
MetricsSpark.scala:67, took 60.391761 s
15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at
MetricsSpark.scala:67, took 0.531323 s
Are th
I read about maxRatePerPartition parameter, I haven't set this parameter.
Could it be the problem?? Although this wouldn't explain why it doesn't
work in one of the clusters.
2015-07-30 14:47 GMT+02:00 Guillermo Ortiz :
> They just share the kafka, the rest of resources are independents. I tried
They just share the kafka, the rest of resources are independents. I tried
to stop one cluster and execute just the cluster isn't working but it
happens the same.
2015-07-30 14:41 GMT+02:00 Guillermo Ortiz :
> I have some problem with the JobScheduler. I have executed same code in
> two cluster.
I have some problem with the JobScheduler. I have executed same code in two
cluster. I read from three topics in Kafka with DirectStream so I have
three tasks.
I have check YARN and there aren't more jobs launched.
The cluster where I have troubles I got this logs:
15/07/30 14:32:58 INFO TaskSet
11 matches
Mail list logo