Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hi Ryan, On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue wrote: > > To partition by a condition, you would need to create a column with the > result of that condition. Then you would partition by that column. The sort > option would also work here. We actually do something similar to filter based

Re: Feature request: split dataset based on condition

2019-02-04 Thread Thakrar, Jayesh
12:16 PM To: Andrew Melo Cc: Moein Hosseini , dev Subject: Re: Feature request: split dataset based on condition To partition by a condition, you would need to create a column with the result of that condition. Then you would partition by that column. The sort option would also work her

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
To partition by a condition, you would need to create a column with the result of that condition. Then you would partition by that column. The sort option would also work here. I don't think that there is much of a use case for this. You have a set of conditions on which to partition your data,

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello Ryan, On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue wrote: > > Andrew, can you give us more information about why partitioning the output > data doesn't work for your use case? > > It sounds like all you need to do is to create a table partitioned by A and > B, then you would automatically

Re: Feature request: split dataset based on condition

2019-02-04 Thread Ryan Blue
Andrew, can you give us more information about why partitioning the output data doesn't work for your use case? It sounds like all you need to do is to create a table partitioned by A and B, then you would automatically get the divisions you want. If what you're looking for is a way to scale the

Re: Feature request: split dataset based on condition

2019-02-04 Thread Andrew Melo
Hello On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote: > > I've seen many application need to split dataset to multiple datasets based > on some conditions. As there is no method to do it in one place, developers > use filter method multiple times. I think it can be useful to have method

Re: Feature request: split dataset based on condition

2019-02-03 Thread Maciej Szymkiewicz
If the goal is to split the output, then `DataFrameWriter.partitionBy` should do what you need, and no additional methods are required. If not you can also check Silex's implementation muxPartitions (see https://stackoverflow.com/a/37956034), but the applications are rather limited, due to high

Re: Feature request: split dataset based on condition

2019-02-03 Thread Sean Owen
I don't think Spark supports this model, where N inputs depending on parent are computed once at the same time. You can of course cache the parent and filter N times and do the same amount of work. One problem is, where would the N inputs live? they'd have to be stored if not used immediately, and

Re: Feature request: split dataset based on condition

2019-02-02 Thread Moein Hosseini
I don't consider it as method to apply filtering multiple time, instead use it as semi-action not just transformation. Let's think that we have something like map-partition which accept multiple lambda that each one collect their ROW for their dataset (or something like it). Is it possible? On

Re: Feature request: split dataset based on condition

2019-02-02 Thread Sean Owen
I think the problem is that can't produce multiple Datasets from one source in one operation - consider that reproducing one of them would mean reproducing all of them. You can write a method that would do the filtering multiple times but it wouldn't be faster. What do you have in mind that's

Feature request: split dataset based on condition

2019-02-01 Thread Moein Hosseini
I've seen many application need to split dataset to multiple datasets based on some conditions. As there is no method to do it in one place, developers use *filter *method multiple times. I think it can be useful to have method to split dataset based on condition in one iteration, something like