Hi Nic,
I think you can do this with just the Scanner [1]:
taxi_ds = ds.dataset("~/Datasets/nyc-taxi", partitioning =
ds.partitioning(pa.schema([("year", pa.int16())]), flavor="hive"))
expr = # some expression equivalent to your case_when above
scanner = taxi_ds.scanner(columns={'new_col': expr})
output_partitioning = ds.partitioning(pa.schema([("year",
pa.int16())]), flavor="hive"))
ds.write_dataset(scanner, "outdir", format="parquet", partition)
[1]
https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data
On Mon, Mar 11, 2024 at 1:09 PM Nic Crane <[email protected]> wrote:
>
> Hey folks,
>
> In the process of trying to work out if something is a bug or not, I'm
> trying to work out the PyArrow equivalent for some R code (see [1]),
> which opens a dataset, projects a new column, and then writes the
> dataset to disk using the new column as a partitioning variable.
>
> All of the examples I have found so far in docs involve converting the
> dataset to a table in an intermediate step before projecting the new
> column, which I don't want to do, as the dataset is larger than
> memory.
>
> Is it possible to do this without converting the dataset into a table,
> and if so, how?
>
> Thanks,
>
> Nic
>
>
> [1]
> ```
> open_dataset("data/pums/person") |>
> mutate(
> age_group = case_when(
> AGEP < 25 ~ "Under 25",
> AGEP < 35 ~ "25-34",
> AGEP < 45 ~ "35-44",
> AGEP < 55 ~ "45-54",
> AGEP < 65 ~ "55-64",
> TRUE ~ "65+"
> )
> )|>
> write_dataset(
> path = "./data/pums/person-age-partitions",
> partitioning = c("year", "location", "age_group")
> )
> ```