Ah we should just add this directly in pyspark - it's as simple as the code Shivaram just wrote.
- Patrick On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman <shivaram.venkatara...@gmail.com> wrote: > There is no direct way to get this in pyspark, but you can get it from the > underlying java rdd. For example > > a = sc.parallelize([1,2,3,4], 2) > a._jrdd.splits().size() > > > On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas > <nicholas.cham...@gmail.com> wrote: >> >> Mark, >> >> This appears to be a Scala-only feature. :( >> >> Patrick, >> >> Are we planning to add this to PySpark? >> >> Nick >> >> >> On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >>> >>> It's much simpler: rdd.partitions.size >>> >>> >>> On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas >>> <nicholas.cham...@gmail.com> wrote: >>>> >>>> Hey there fellow Dukes of Data, >>>> >>>> How can I tell how many partitions my RDD is split into? >>>> >>>> I'm interested in knowing because, from what I gather, having a good >>>> number of partitions is good for performance. If I'm looking to understand >>>> how my pipeline is performing, say for a parallelized write out to HDFS, >>>> knowing how many partitions an RDD has would be a good thing to check. >>>> >>>> Is that correct? >>>> >>>> I could not find an obvious method or property to see how my RDD is >>>> partitioned. Instead, I devised the following thingy: >>>> >>>> def f(idx, itr): yield idx >>>> >>>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>>> rdd.mapPartitionsWithIndex(f).count() >>>> >>>> Frankly, I'm not sure what I'm doing here, but this seems to give me the >>>> answer I'm looking for. Derp. :) >>>> >>>> So in summary, should I care about how finely my RDDs are partitioned? >>>> And how would I check on that? >>>> >>>> Nick >>>> >>>> >>>> ________________________________ >>>> View this message in context: How many partitions is my RDD split into? >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> >> >