Re: rdd only with one partition

2015-12-21 Thread Zhiliang Zhu
You may just refer to my another letter with title :
[Beg for help] spark job with very low efficiency

 

On Tuesday, December 22, 2015 1:49 AM, Ted Yu  wrote:
 

 I am not familiar with your use case, is it possible to perform the randomized 
combination operation based on subset of the rows in rdd0 ?That way you can 
increase the parallelism.
Cheers
On Mon, Dec 21, 2015 at 9:40 AM, Zhiliang Zhu  wrote:

Hi Ted,
Thanks a lot for your kind reply.
I needs to convert this rdd0 into another rdd1, rows of  rdd1 are generated 
from rdd0's row randomly combination operation.From that perspective, rdd0 
would be with one partition in order to randomly operate on its all rows, 
however, it would also lose spark parallelism benefit .
Best Wishes!Zhiliang

 

On Monday, December 21, 2015 11:17 PM, Ted Yu  wrote:
 

 Have you tried the following method ?
   * Note: With shuffle = true, you can actually coalesce to a larger number   
* of partitions. This is useful if you have a small number of partitions,   * 
say 100, potentially with a few partitions being abnormally large. Calling   * 
coalesce(1000, shuffle = true) will result in 1000 partitions with the   * data 
distributed using a hash partitioner.   */  def coalesce(numPartitions: Int, 
shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
Cheers
On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu  
wrote:

Dear All,
For some rdd, while there is just one partition, then the operation & 
arithmetic would only be single, the rdd has lose all the parallelism benefit 
from spark  system ...
Is it exactly like that?
Thanks very much in advance!Zhiliang





   



  

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
Have you tried the following method ?

   * Note: With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large.
Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner.
   */
  def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord:
Ordering[T] = null)

Cheers

On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu 
wrote:

> Dear All,
>
> For some rdd, while there is just one partition, then the operation &
> arithmetic would only be single, the rdd has lose all the parallelism
> benefit from spark  system ...
>
> Is it exactly like that?
>
> Thanks very much in advance!
> Zhiliang
>
>
>


Re: rdd only with one partition

2015-12-21 Thread Zhiliang Zhu
Hi Ted,
Thanks a lot for your kind reply.
I needs to convert this rdd0 into another rdd1, rows of  rdd1 are generated 
from rdd0's row randomly combination operation.From that perspective, rdd0 
would be with one partition in order to randomly operate on its all rows, 
however, it would also lose spark parallelism benefit .
Best Wishes!Zhiliang

 

On Monday, December 21, 2015 11:17 PM, Ted Yu  wrote:
 

 Have you tried the following method ?
   * Note: With shuffle = true, you can actually coalesce to a larger number   
* of partitions. This is useful if you have a small number of partitions,   * 
say 100, potentially with a few partitions being abnormally large. Calling   * 
coalesce(1000, shuffle = true) will result in 1000 partitions with the   * data 
distributed using a hash partitioner.   */  def coalesce(numPartitions: Int, 
shuffle: Boolean = false)(implicit ord: Ordering[T] = null)
Cheers
On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu  
wrote:

Dear All,
For some rdd, while there is just one partition, then the operation & 
arithmetic would only be single, the rdd has lose all the parallelism benefit 
from spark  system ...
Is it exactly like that?
Thanks very much in advance!Zhiliang





  

Re: rdd only with one partition

2015-12-21 Thread Ted Yu
I am not familiar with your use case, is it possible to perform the
randomized combination operation based on subset of the rows in rdd0 ?
That way you can increase the parallelism.

Cheers

On Mon, Dec 21, 2015 at 9:40 AM, Zhiliang Zhu  wrote:

> Hi Ted,
>
> Thanks a lot for your kind reply.
>
> I needs to convert this rdd0 into another rdd1, rows of  rdd1 are
> generated from rdd0's row randomly combination operation.
> From that perspective, rdd0 would be with one partition in order to
> randomly operate on its all rows, however, it would also lose spark
> parallelism benefit .
>
> Best Wishes!
> Zhiliang
>
>
>
>
> On Monday, December 21, 2015 11:17 PM, Ted Yu  wrote:
>
>
> Have you tried the following method ?
>
>* Note: With shuffle = true, you can actually coalesce to a larger
> number
>* of partitions. This is useful if you have a small number of
> partitions,
>* say 100, potentially with a few partitions being abnormally large.
> Calling
>* coalesce(1000, shuffle = true) will result in 1000 partitions with the
>* data distributed using a hash partitioner.
>*/
>   def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord:
> Ordering[T] = null)
>
> Cheers
>
> On Mon, Dec 21, 2015 at 2:47 AM, Zhiliang Zhu  > wrote:
>
> Dear All,
>
> For some rdd, while there is just one partition, then the operation &
> arithmetic would only be single, the rdd has lose all the parallelism
> benefit from spark  system ...
>
> Is it exactly like that?
>
> Thanks very much in advance!
> Zhiliang
>
>
>
>
>
>