Re: Isolate 1 partition and perform computations

2018-04-16 Thread Thodoris Zois
Hello, Thank you very much for your response Anastasie! Today I think I made it through dropping partitions in (runJob or submitJob) - I don’t remember exactly, in DAGScheduler. If it doesn’t work properly after some tests, I will follow your approach. Thank you, Thodoris > On 16 Apr 2018,

Re: Isolate 1 partition and perform computations

2018-04-16 Thread Anastasios Zouzias
Hi all, I think this is doable using the mapPartitionsWithIndex method of RDD. Example: val partitionIndex = 0 // Your favorite partition index here val rdd = spark.sparkContext.parallelize(Array.range(0, 1000)) // Replace elements of partitionIndex with [-10, .. ,0] val fixed =

Re: Isolate 1 partition and perform computations

2018-04-14 Thread Thodoris Zois
I forgot to mention that I would like my approach to be independent from the application that user is going to submit to Spark. Assume that I don’t know anything about user’s application… I expected to find a simpler approach. I saw in RDD.scala that an RDD is characterized by a list of

Re: Isolate 1 partition and perform computations

2018-04-14 Thread Matthias Boehm
you might wanna have a look into using a PartitionPruningRDD to select a subset of partitions by ID. This approach worked very well for multi-key lookups for us [1]. A major advantage compared to scan-based operations is that, if your source RDD has an existing partitioner, only relevant

Isolate 1 partition and perform computations

2018-04-14 Thread Thodoris Zois
Hello list, I am sorry for sending this message here, but I could not manage to get any response in “users”. For specific purposes I would like to isolate 1 partition of the RDD and perform computations only to this. For instance, suppose that a user asks Spark to create 500 partitions for