Re: How many partitions is my RDD split into?

Patrick Wendell Mon, 24 Mar 2014 13:32:36 -0700

Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.


- Patrick

On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
<shivaram.venkatara...@gmail.com> wrote:
> There is no direct way to get this in pyspark, but you can get it from the
> underlying java rdd. For example
>
> a = sc.parallelize([1,2,3,4], 2)
> a._jrdd.splits().size()
>
>
> On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
>>
>> Mark,
>>
>> This appears to be a Scala-only feature. :(
>>
>> Patrick,
>>
>> Are we planning to add this to PySpark?
>>
>> Nick
>>
>>
>> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>>
>>> It's much simpler: rdd.partitions.size
>>>
>>>
>>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
>>> <nicholas.cham...@gmail.com> wrote:
>>>>
>>>> Hey there fellow Dukes of Data,
>>>>
>>>> How can I tell how many partitions my RDD is split into?
>>>>
>>>> I'm interested in knowing because, from what I gather, having a good
>>>> number of partitions is good for performance. If I'm looking to understand
>>>> how my pipeline is performing, say for a parallelized write out to HDFS,
>>>> knowing how many partitions an RDD has would be a good thing to check.
>>>>
>>>> Is that correct?
>>>>
>>>> I could not find an obvious method or property to see how my RDD is
>>>> partitioned. Instead, I devised the following thingy:
>>>>
>>>> def f(idx, itr): yield idx
>>>>
>>>> rdd = sc.parallelize([1, 2, 3, 4], 4)
>>>> rdd.mapPartitionsWithIndex(f).count()
>>>>
>>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the
>>>> answer I'm looking for. Derp. :)
>>>>
>>>> So in summary, should I care about how finely my RDDs are partitioned?
>>>> And how would I check on that?
>>>>
>>>> Nick
>>>>
>>>>
>>>> ________________________________
>>>> View this message in context: How many partitions is my RDD split into?
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>

Re: How many partitions is my RDD split into?

Reply via email to