Hi Ryan,
On Mon, Feb 4, 2019 at 12:17 PM Ryan Blue wrote:
>
> To partition by a condition, you would need to create a column with the
> result of that condition. Then you would partition by that column. The sort
> option would also work here.
We actually do something similar to filter based
12:16 PM
To: Andrew Melo
Cc: Moein Hosseini , dev
Subject: Re: Feature request: split dataset based on condition
To partition by a condition, you would need to create a column with the result
of that condition. Then you would partition by that column. The sort option
would also work her
To partition by a condition, you would need to create a column with the
result of that condition. Then you would partition by that column. The sort
option would also work here.
I don't think that there is much of a use case for this. You have a set of
conditions on which to partition your data,
Hello Ryan,
On Mon, Feb 4, 2019 at 10:52 AM Ryan Blue wrote:
>
> Andrew, can you give us more information about why partitioning the output
> data doesn't work for your use case?
>
> It sounds like all you need to do is to create a table partitioned by A and
> B, then you would automatically
Andrew, can you give us more information about why partitioning the output
data doesn't work for your use case?
It sounds like all you need to do is to create a table partitioned by A and
B, then you would automatically get the divisions you want. If what you're
looking for is a way to scale the
Hello
On Sat, Feb 2, 2019 at 12:19 AM Moein Hosseini wrote:
>
> I've seen many application need to split dataset to multiple datasets based
> on some conditions. As there is no method to do it in one place, developers
> use filter method multiple times. I think it can be useful to have method
If the goal is to split the output, then `DataFrameWriter.partitionBy`
should do what you need, and no additional methods are required. If not you
can also check Silex's implementation muxPartitions (see
https://stackoverflow.com/a/37956034), but the applications are rather
limited, due to high
I don't think Spark supports this model, where N inputs depending on parent
are computed once at the same time. You can of course cache the parent and
filter N times and do the same amount of work. One problem is, where would
the N inputs live? they'd have to be stored if not used immediately, and
I don't consider it as method to apply filtering multiple time, instead use
it as semi-action not just transformation. Let's think that we have
something like map-partition which accept multiple lambda that each one
collect their ROW for their dataset (or something like it). Is it possible?
On
I think the problem is that can't produce multiple Datasets from one source
in one operation - consider that reproducing one of them would mean
reproducing all of them. You can write a method that would do the filtering
multiple times but it wouldn't be faster. What do you have in mind that's
I've seen many application need to split dataset to multiple datasets based
on some conditions. As there is no method to do it in one place, developers
use *filter *method multiple times. I think it can be useful to have method
to split dataset based on condition in one iteration, something like
11 matches
Mail list logo