Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Oh, glad to know it's that simple!

Patrick, in your last comment did you mean filter in? As in I start with
one year of data and filter it so I have one day left? I'm assuming in that
case the empty partitions would be for all the days that got filtered out.

Nick

2014년 3월 24일 월요일, Patrick Wendellpwend...@gmail.com님이 작성한 메시지:

 As Mark said you can actually access this easily. The main issue I've
 seen from a performance perspective is people having a bunch of really
 small partitions. This will still work but the performance will
 improve if you coalesce the partitions using rdd.coalesce().

 This can happen for example if you do a highly selective filter on an
 RDD. For instance, you filter out one day of data from a dataset of a
 year.

 - Patrick

 On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra 
 m...@clearstorydata.comjavascript:;
 wrote:
  It's much simpler: rdd.partitions.size
 
 
  On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
  nicholas.cham...@gmail.com javascript:; wrote:
 
  Hey there fellow Dukes of Data,
 
  How can I tell how many partitions my RDD is split into?
 
  I'm interested in knowing because, from what I gather, having a good
  number of partitions is good for performance. If I'm looking to
 understand
  how my pipeline is performing, say for a parallelized write out to HDFS,
  knowing how many partitions an RDD has would be a good thing to check.
 
  Is that correct?
 
  I could not find an obvious method or property to see how my RDD is
  partitioned. Instead, I devised the following thingy:
 
  def f(idx, itr): yield idx
 
  rdd = sc.parallelize([1, 2, 3, 4], 4)
  rdd.mapPartitionsWithIndex(f).count()
 
  Frankly, I'm not sure what I'm doing here, but this seems to give me the
  answer I'm looking for. Derp. :)
 
  So in summary, should I care about how finely my RDDs are partitioned?
 And
  how would I check on that?
 
  Nick
 
 
  
  View this message in context: How many partitions is my RDD split into?
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 



Re: How many partitions is my RDD split into?

2014-03-24 Thread Nicholas Chammas
Mark,

This appears to be a Scala-only feature. :(

Patrick,

Are we planning to add this to PySpark?

Nick


On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.





Re: How many partitions is my RDD split into?

2014-03-24 Thread Shivaram Venkataraman
There is no direct way to get this in pyspark, but you can get it from the
underlying java rdd. For example

a = sc.parallelize([1,2,3,4], 2)
a._jrdd.splits().size()


On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Mark,

 This appears to be a Scala-only feature. :(

 Patrick,

 Are we planning to add this to PySpark?

 Nick


 On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.comwrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.






Re: How many partitions is my RDD split into?

2014-03-24 Thread Patrick Wendell
Ah we should just add this directly in pyspark - it's as simple as the
code Shivaram just wrote.

- Patrick

On Mon, Mar 24, 2014 at 1:25 PM, Shivaram Venkataraman
shivaram.venkatara...@gmail.com wrote:
 There is no direct way to get this in pyspark, but you can get it from the
 underlying java rdd. For example

 a = sc.parallelize([1,2,3,4], 2)
 a._jrdd.splits().size()


 On Mon, Mar 24, 2014 at 7:46 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Mark,

 This appears to be a Scala-only feature. :(

 Patrick,

 Are we planning to add this to PySpark?

 Nick


 On Mon, Mar 24, 2014 at 12:53 AM, Mark Hamstra m...@clearstorydata.com
 wrote:

 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned?
 And how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.






Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size


On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 --
 View this message in context: How many partitions is my RDD split 
 into?http://apache-spark-user-list.1001560.n3.nabble.com/How-many-partitions-is-my-RDD-split-into-tp3072.html
 Sent from the Apache Spark User List mailing list 
 archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com.



Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've
seen from a performance perspective is people having a bunch of really
small partitions. This will still work but the performance will
improve if you coalesce the partitions using rdd.coalesce().

This can happen for example if you do a highly selective filter on an
RDD. For instance, you filter out one day of data from a dataset of a
year.

- Patrick

On Sun, Mar 23, 2014 at 9:53 PM, Mark Hamstra m...@clearstorydata.com wrote:
 It's much simpler: rdd.partitions.size


 On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Hey there fellow Dukes of Data,

 How can I tell how many partitions my RDD is split into?

 I'm interested in knowing because, from what I gather, having a good
 number of partitions is good for performance. If I'm looking to understand
 how my pipeline is performing, say for a parallelized write out to HDFS,
 knowing how many partitions an RDD has would be a good thing to check.

 Is that correct?

 I could not find an obvious method or property to see how my RDD is
 partitioned. Instead, I devised the following thingy:

 def f(idx, itr): yield idx

 rdd = sc.parallelize([1, 2, 3, 4], 4)
 rdd.mapPartitionsWithIndex(f).count()

 Frankly, I'm not sure what I'm doing here, but this seems to give me the
 answer I'm looking for. Derp. :)

 So in summary, should I care about how finely my RDDs are partitioned? And
 how would I check on that?

 Nick


 
 View this message in context: How many partitions is my RDD split into?
 Sent from the Apache Spark User List mailing list archive at Nabble.com.