Unsubscribe
lt;https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Thu, Aug 19, 2021 at 5:43 PM Pedro Tuero wrote:
>
>> Hi, I'm sorry , the problem was really silly: In the test the number of
>
://twitter.com/jaceklaskowski>
>
>
> On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero wrote:
>
>>
>> Context: spark-core_2.12-3.1.1
>> Testing with maven and eclipse.
>>
>> I'm modifying a project and a test stops working as expected.
>> The dif
Context: spark-core_2.12-3.1.1
Testing with maven and eclipse.
I'm modifying a project and a test stops working as expected.
The difference is in the parameters passed to the function aggregateByKey
of JavaPairRDD.
JavaSparkContext is created this way:
new JavaSparkContext(new SparkConf()
.setMas
I was reviewing a spark java application running on aws emr.
The code was like:
RDD.reduceByKey(func).coalesce(number).saveAsTextFile()
That stage took hours to complete.
I changed to:
RDD.reduceByKey(func, number).saveAsTextFile()
And it now takes less than 2 minutes, and the final output is the
Good question. What I have read about is that Spark is not a magician and
can't know how many tasks will be better for your input, so it can fail.
Spark set the default parallelism as twice the number of cores on the
cluster.
In my jobs, it seemed that using the parallelism inherited from input pa
* It is not getPartitions() but getNumPartitions().
El mar., 12 de feb. de 2019 a la(s) 13:08, Pedro Tuero (tuerope...@gmail.com)
escribió:
> And this is happening in every job I run. It is not just one case. If I
> add a forced repartitions it works fine, even better than before. But
L https://bit.ly/mastering-spark-sql
>> Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Fri, Feb 8, 2019 at 5:0
ttps://twitter.com/jaceklaskowski
>
>
> On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero wrote:
>
>> I did a repartition to 1 (hardcoded) before the keyBy and it ends in
>> 1.2 minutes.
>> The questions remain open, because I don't want to harcode paralellism.
>&
I did a repartition to 1 (hardcoded) before the keyBy and it ends in
1.2 minutes.
The questions remain open, because I don't want to harcode paralellism.
El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com)
escribió:
> 128 is the default parallelism defined
128 is the default parallelism defined for the cluster.
The question now is why keyBy operation is using default parallelism
instead of the number of partition of the RDD created by the previous step
(5580).
Any clues?
El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com
;>
>> I tested maximizeResourceAllocation option. When it's enabled, it seems
>> Spark utilized their cores fully. However the performance is not so
>> different from default setting.
>>
>> I consider to use s3-distcp for uploading files. And, I think
>>
Hi,
I am running a job in spark (using aws emr) and some stages are taking a
lot more using spark 2.4 instead of Spark 2.3.1:
Spark 2.4:
[image: image.png]
Spark 2.3.1:
[image: image.png]
With Spark 2.4, the keyBy operation take more than 10X what it took with
Spark 2.3.1
It seems to be related
ds for
> performance tuning.
>
> Do you configure dynamic allocation ?
>
> FYI:
>
> https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
>
> I've not tested it yet. I guess spark-submit needs to specify number of
> executors.
>
&
Hi guys,
I use to run spark jobs in Aws emr.
Recently I switch from aws emr label 5.16 to 5.20 (which use Spark 2.4.0).
I've noticed that a lot of steps are taking longer than before.
I think it is related to the automatic configuration of cores by executor.
In version 5.16, some executors toke mo
Hi,
I'm using spark 2.1.0 in aws emr. Kryo Serializer.
I'm broadcasting a java class :
public class NameMatcher {
private static final Logger LOG =
LoggerFactory.getLogger(NameMatcher.class);
private final Splitter splitter;
private final SetMultimap itemsByWord;
private final Mu
Hi, I'm trying to broadcast a map of 2.6GB but I'm getting a weird Kryo
exception.
I tried to set -XX:hashCode=0 in executor and driver, following this
copmment:
https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808
But it didn't change anything.
Are you aware of this problem?
Hi guys,
I'm trying to do a a job with Spark, using Java.
The thing is I need to have an index of words of about 3 GB in each
machine, so I'm trying to broadcast custom objects to represent the index
and the interface with it.
I'm using java standard serialization, so I tried to implement serial
18 matches
Mail list logo