from:"Pedro Tuero"

Unsubscribe

2022-11-07 Thread Pedro Tuero

Unsubscribe

Re: Java : Testing RDD aggregateByKey

2021-08-23 Thread Pedro Tuero

lt;https://books.japila.pl/> > Follow me on https://twitter.com/jaceklaskowski > > <https://twitter.com/jaceklaskowski> > > > On Thu, Aug 19, 2021 at 5:43 PM Pedro Tuero wrote: > >> Hi, I'm sorry , the problem was really silly: In the test the number of >

Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Pedro Tuero

://twitter.com/jaceklaskowski> > > > On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero wrote: > >> >> Context: spark-core_2.12-3.1.1 >> Testing with maven and eclipse. >> >> I'm modifying a project and a test stops working as expected. >> The dif

Java : Testing RDD aggregateByKey

2021-08-17 Thread Pedro Tuero

Context: spark-core_2.12-3.1.1 Testing with maven and eclipse. I'm modifying a project and a test stops working as expected. The difference is in the parameters passed to the function aggregateByKey of JavaPairRDD. JavaSparkContext is created this way: new JavaSparkContext(new SparkConf() .setMas

Coalesce vs reduce operation parameter

2021-03-18 Thread Pedro Tuero

I was reviewing a spark java application running on aws emr. The code was like: RDD.reduceByKey(func).coalesce(number).saveAsTextFile() That stage took hours to complete. I changed to: RDD.reduceByKey(func, number).saveAsTextFile() And it now takes less than 2 minutes, and the final output is the

Re: Spark 2.4 partitions and tasks

2019-02-25 Thread Pedro Tuero

Good question. What I have read about is that Spark is not a magician and can't know how many tasks will be better for your input, so it can fail. Spark set the default parallelism as twice the number of cores on the cluster. In my jobs, it seemed that using the parallelism inherited from input pa

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero

* It is not getPartitions() but getNumPartitions(). El mar., 12 de feb. de 2019 a la(s) 13:08, Pedro Tuero (tuerope...@gmail.com) escribió: > And this is happening in every job I run. It is not just one case. If I > add a forced repartitions it works fine, even better than before. But

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero

L https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Fri, Feb 8, 2019 at 5:0

Re: Spark 2.4 partitions and tasks

2019-02-12 Thread Pedro Tuero

ttps://twitter.com/jaceklaskowski > > > On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero wrote: > >> I did a repartition to 1 (hardcoded) before the keyBy and it ends in >> 1.2 minutes. >> The questions remain open, because I don't want to harcode paralellism. >&

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero

I did a repartition to 1 (hardcoded) before the keyBy and it ends in 1.2 minutes. The questions remain open, because I don't want to harcode paralellism. El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (tuerope...@gmail.com) escribió: > 128 is the default parallelism defined

Re: Spark 2.4 partitions and tasks

2019-02-08 Thread Pedro Tuero

128 is the default parallelism defined for the cluster. The question now is why keyBy operation is using default parallelism instead of the number of partition of the RDD created by the previous step (5580). Any clues? El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (tuerope...@gmail.com

Re: Aws

2019-02-08 Thread Pedro Tuero

;> >> I tested maximizeResourceAllocation option. When it's enabled, it seems >> Spark utilized their cores fully. However the performance is not so >> different from default setting. >> >> I consider to use s3-distcp for uploading files. And, I think >>

Spark 2.4 partitions and tasks

2019-02-07 Thread Pedro Tuero

Hi, I am running a job in spark (using aws emr) and some stages are taking a lot more using spark 2.4 instead of Spark 2.3.1: Spark 2.4: [image: image.png] Spark 2.3.1: [image: image.png] With Spark 2.4, the keyBy operation take more than 10X what it took with Spark 2.3.1 It seems to be related

Re: Aws

2019-02-01 Thread Pedro Tuero

ds for > performance tuning. > > Do you configure dynamic allocation ? > > FYI: > > https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation > > I've not tested it yet. I guess spark-submit needs to specify number of > executors. > &

Aws

2019-01-31 Thread Pedro Tuero

Hi guys, I use to run spark jobs in Aws emr. Recently I switch from aws emr label 5.16 to 5.20 (which use Spark 2.4.0). I've noticed that a lot of steps are taking longer than before. I think it is related to the automatic configuration of cores by executor. In version 5.16, some executors toke mo

Broadcasted Object is empty in executors.

2017-05-22 Thread Pedro Tuero

Hi, I'm using spark 2.1.0 in aws emr. Kryo Serializer. I'm broadcasting a java class : public class NameMatcher { private static final Logger LOG = LoggerFactory.getLogger(NameMatcher.class); private final Splitter splitter; private final SetMultimap itemsByWord; private final Mu

Kryo Exception: NegativeArraySizeException

2016-11-24 Thread Pedro Tuero

Hi, I'm trying to broadcast a map of 2.6GB but I'm getting a weird Kryo exception. I tried to set -XX:hashCode=0 in executor and driver, following this copmment: https://github.com/broadinstitute/gatk/issues/1524#issuecomment-189368808 But it didn't change anything. Are you aware of this problem?

Broadcasting Complex Custom Objects

2016-10-17 Thread Pedro Tuero

Hi guys, I'm trying to do a a job with Spark, using Java. The thing is I need to have an index of words of about 3 GB in each machine, so I'm trying to broadcast custom objects to represent the index and the interface with it. I'm using java standard serialization, so I tried to implement serial

Unsubscribe

Re: Java : Testing RDD aggregateByKey

Re: Java : Testing RDD aggregateByKey

Java : Testing RDD aggregateByKey

Coalesce vs reduce operation parameter

Re: Spark 2.4 partitions and tasks

Re: Spark 2.4 partitions and tasks

Re: Spark 2.4 partitions and tasks

Re: Spark 2.4 partitions and tasks

Re: Spark 2.4 partitions and tasks

Re: Spark 2.4 partitions and tasks

Re: Aws

Spark 2.4 partitions and tasks

Re: Aws

Aws

Broadcasted Object is empty in executors.

Kryo Exception: NegativeArraySizeException

Broadcasting Complex Custom Objects

18 matches

Site Navigation

Mail list logo

Footer information