Re: Random Shuffling

2015-06-24 Thread Maximilian Alber
That's not the point. In Machine Learning one often divides a data set X
into f.e. three sets, one for the training, one for the validation, one for
the final testing. The sets are usually created randomly according to some
ratio. Thus it would be important to keep the ratio and to do the whole
process randomly.

Cheers,
Max

On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen se...@apache.org wrote:

 If you do rebalance(), it will redistribute elements round-robin
 fashion, which should give you very even partition sizes.


 On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber 
 alber.maximil...@gmail.com wrote:

 Thank you!

 Still I cannot guarantee the size of each partition, or can I?
 Something like randomSplit in Spark.

 Cheers,
 Max

 On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax 
 mj...@informatik.hu-berlin.de wrote:

 Hi,

 using partitionCustom, the data distribution depends only on your
 probability distribution. If it is uniform, you should be fine (ie,
 choosing the channel like

  private final Random random = new Random(System.currentTimeMillis());
  int partition(K key, int numPartitions) {
return random.nextInt(numPartitions);
  }

 should do the trick.

 -Matthias

 On 06/15/2015 05:41 PM, Maximilian Alber wrote:
  Thanks!
 
  Ok, so for a random shuffle I need partitionCustom. But in that case
 the
  data might be out of balance then?
 
  For the splitting. Is there no way to have exact sizes?
 
  Cheers,
  Max
 
  On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann trohrm...@apache.org
  mailto:trohrm...@apache.org wrote:
 
  Hi Max,
 
  you can always shuffle your elements using the |rebalance| method.
  What Flink here does is to distribute the elements of each
 partition
  among all available TaskManagers. This happens in a round-robin
  fashion and is thus not completely random.
 
  A different mean is the |partitionCustom| method which allows you
 to
  specify for each element to which partition it shall be sent. You
  would have to specify a |Partitioner| to do this.
 
  For the splitting there is at moment no syntactic sugar. What you
  can do, though, is to assign each item a split ID and then use a
  |filter| operation to filter the individual splits. Depending on
 you
  split ID distribution you will have differently sized splits.
 
  Cheers,
  Till
 
  On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
  alber.maximil...@gmail.com
  http://mailto:alber.maximil...@gmail.com wrote:
 
  Hi Flinksters,
 
  I would like to shuffle my elements in the data set and then
  split it in two according to some ratio. Each element in the
  data set has an unique id. Is there a nice way to do it with
 the
  flink api?
  (It would be nice to have guaranteed random shuffling.)
  Thanks!
 
  Cheers,
  Max
 
  ​
 
 






Re: Random Shuffling

2015-06-24 Thread Sebastian
A very simple way to achieve is to generate a random variate on the 
driver that describes a mapping of datapoints to samples. Then you 
simply join the dataset with this mapping to generate the samples.


This approach requires you to know the size of the dataset in advance, 
but has the advantage that you can guarantee the sizes of the samples 
and can easily support more involved techniques such as sampling with 
replacement.


--sebastian


On 24.06.2015 10:38, Maximilian Alber wrote:

That's not the point. In Machine Learning one often divides a data set X
into f.e. three sets, one for the training, one for the validation, one
for the final testing. The sets are usually created randomly according
to some ratio. Thus it would be important to keep the ratio and to do
the whole process randomly.

Cheers,
Max

On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen se...@apache.org
mailto:se...@apache.org wrote:

If you do rebalance(), it will redistribute elements round-robin
fashion, which should give you very even partition sizes.


On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber
alber.maximil...@gmail.com mailto:alber.maximil...@gmail.com wrote:

Thank you!

Still I cannot guarantee the size of each partition, or can I?
Something like randomSplit in Spark.

Cheers,
Max

On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax
mj...@informatik.hu-berlin.de
mailto:mj...@informatik.hu-berlin.de wrote:

Hi,

using partitionCustom, the data distribution depends only on
your
probability distribution. If it is uniform, you should be
fine (ie,
choosing the channel like

  private final Random random = new
Random(System.currentTimeMillis());
  int partition(K key, int numPartitions) {
return random.nextInt(numPartitions);
  }

should do the trick.

-Matthias

On 06/15/2015 05:41 PM, Maximilian Alber wrote:
 Thanks!

 Ok, so for a random shuffle I need partitionCustom. But in that 
case the
 data might be out of balance then?

 For the splitting. Is there no way to have exact sizes?

 Cheers,
 Max

 On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann trohrm...@apache.org 
mailto:trohrm...@apache.org
 mailto:trohrm...@apache.org mailto:trohrm...@apache.org 
wrote:

 Hi Max,

 you can always shuffle your elements using the |rebalance| 
method.
 What Flink here does is to distribute the elements of each 
partition
 among all available TaskManagers. This happens in a 
round-robin
 fashion and is thus not completely random.

 A different mean is the |partitionCustom| method which allows 
you to
 specify for each element to which partition it shall be sent. 
You
 would have to specify a |Partitioner| to do this.

 For the splitting there is at moment no syntactic sugar. What 
you
 can do, though, is to assign each item a split ID and then 
use a
 |filter| operation to filter the individual splits. Depending 
on you
 split ID distribution you will have differently sized splits.

 Cheers,
 Till

 On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
alber.maximil...@gmail.com mailto:alber.maximil...@gmail.com
  http://mailto:alber.maximil...@gmail.com wrote:
 
  Hi Flinksters,
 
  I would like to shuffle my elements in the data
set and then
  split it in two according to some ratio. Each
element in the
  data set has an unique id. Is there a nice way to
do it with the
  flink api?
  (It would be nice to have guaranteed random
shuffling.)
  Thanks!
 
  Cheers,
  Max
 
  ​
 
 






Re: Random Shuffling

2015-06-23 Thread Maximilian Alber
Thank you!

Still I cannot guarantee the size of each partition, or can I?
Something like randomSplit in Spark.

Cheers,
Max

On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 Hi,

 using partitionCustom, the data distribution depends only on your
 probability distribution. If it is uniform, you should be fine (ie,
 choosing the channel like

  private final Random random = new Random(System.currentTimeMillis());
  int partition(K key, int numPartitions) {
return random.nextInt(numPartitions);
  }

 should do the trick.

 -Matthias

 On 06/15/2015 05:41 PM, Maximilian Alber wrote:
  Thanks!
 
  Ok, so for a random shuffle I need partitionCustom. But in that case the
  data might be out of balance then?
 
  For the splitting. Is there no way to have exact sizes?
 
  Cheers,
  Max
 
  On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann trohrm...@apache.org
  mailto:trohrm...@apache.org wrote:
 
  Hi Max,
 
  you can always shuffle your elements using the |rebalance| method.
  What Flink here does is to distribute the elements of each partition
  among all available TaskManagers. This happens in a round-robin
  fashion and is thus not completely random.
 
  A different mean is the |partitionCustom| method which allows you to
  specify for each element to which partition it shall be sent. You
  would have to specify a |Partitioner| to do this.
 
  For the splitting there is at moment no syntactic sugar. What you
  can do, though, is to assign each item a split ID and then use a
  |filter| operation to filter the individual splits. Depending on you
  split ID distribution you will have differently sized splits.
 
  Cheers,
  Till
 
  On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
  alber.maximil...@gmail.com
  http://mailto:alber.maximil...@gmail.com wrote:
 
  Hi Flinksters,
 
  I would like to shuffle my elements in the data set and then
  split it in two according to some ratio. Each element in the
  data set has an unique id. Is there a nice way to do it with the
  flink api?
  (It would be nice to have guaranteed random shuffling.)
  Thanks!
 
  Cheers,
  Max
 
  ​
 
 




Random Shuffling

2015-06-15 Thread Maximilian Alber
Hi Flinksters,

I would like to shuffle my elements in the data set and then split it in
two according to some ratio. Each element in the data set has an unique id.
Is there a nice way to do it with the flink api?
(It would be nice to have guaranteed random shuffling.)
Thanks!

Cheers,
Max


Re: Random Shuffling

2015-06-15 Thread Till Rohrmann
Hi Max,

you can always shuffle your elements using the rebalance method. What Flink
here does is to distribute the elements of each partition among all
available TaskManagers. This happens in a round-robin fashion and is thus
not completely random.

A different mean is the partitionCustom method which allows you to specify
for each element to which partition it shall be sent. You would have to
specify a Partitioner to do this.

For the splitting there is at moment no syntactic sugar. What you can do,
though, is to assign each item a split ID and then use a filter operation
to filter the individual splits. Depending on you split ID distribution you
will have differently sized splits.

Cheers,
Till

On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber alber.maximil...@gmail.com
http://mailto:alber.maximil...@gmail.com wrote:

Hi Flinksters,

 I would like to shuffle my elements in the data set and then split it in
 two according to some ratio. Each element in the data set has an unique id.
 Is there a nice way to do it with the flink api?
 (It would be nice to have guaranteed random shuffling.)
 Thanks!

 Cheers,
 Max

​


Re: Random Shuffling

2015-06-15 Thread Matthias J. Sax
I think, you need to implement an own Partitioner.java and hand it via
DataSet.partitionCustom(partitioner, field)

(Just specify any field you like; as you don't want to group by key, it
doesn't matter.)

When implementing the partitionier, you can ignore the key parameter and
compute the output channel randomly.

This is kind of a work-around, but it should work.


-Matthias

On 06/15/2015 01:49 PM, Maximilian Alber wrote:
 Hi Flinksters,
 
 I would like to shuffle my elements in the data set and then split it in
 two according to some ratio. Each element in the data set has an unique
 id. Is there a nice way to do it with the flink api?
 (It would be nice to have guaranteed random shuffling.)
 Thanks!
 
 Cheers,
 Max



signature.asc
Description: OpenPGP digital signature


Re: Random Shuffling

2015-06-15 Thread Maximilian Alber
Thanks!

Ok, so for a random shuffle I need partitionCustom. But in that case the
data might be out of balance then?

For the splitting. Is there no way to have exact sizes?

Cheers,
Max

On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann trohrm...@apache.org wrote:

 Hi Max,

 you can always shuffle your elements using the rebalance method. What
 Flink here does is to distribute the elements of each partition among all
 available TaskManagers. This happens in a round-robin fashion and is thus
 not completely random.

 A different mean is the partitionCustom method which allows you to
 specify for each element to which partition it shall be sent. You would
 have to specify a Partitioner to do this.

 For the splitting there is at moment no syntactic sugar. What you can do,
 though, is to assign each item a split ID and then use a filter operation
 to filter the individual splits. Depending on you split ID distribution you
 will have differently sized splits.

 Cheers,
 Till

 On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
 alber.maximil...@gmail.com http://mailto:alber.maximil...@gmail.com
 wrote:

 Hi Flinksters,

 I would like to shuffle my elements in the data set and then split it in
 two according to some ratio. Each element in the data set has an unique id.
 Is there a nice way to do it with the flink api?
 (It would be nice to have guaranteed random shuffling.)
 Thanks!

 Cheers,
 Max

 ​