Hi Nic,

I think you can do this with just the Scanner [1]:

taxi_ds = ds.dataset("~/Datasets/nyc-taxi", partitioning =
ds.partitioning(pa.schema([("year", pa.int16())]), flavor="hive"))
expr = # some expression equivalent to your case_when above
scanner = taxi_ds.scanner(columns={'new_col': expr})
output_partitioning = ds.partitioning(pa.schema([("year",
pa.int16())]), flavor="hive"))
ds.write_dataset(scanner, "outdir", format="parquet", partition)

[1] 
https://arrow.apache.org/docs/python/dataset.html#writing-large-amounts-of-data

On Mon, Mar 11, 2024 at 1:09 PM Nic Crane <[email protected]> wrote:
>
> Hey folks,
>
> In the process of trying to work out if something is a bug or not, I'm
> trying to work out the PyArrow equivalent for some R code (see [1]),
> which opens a dataset, projects a new column, and then writes the
> dataset to disk using the new column as a partitioning variable.
>
> All of the examples I have found so far in docs involve converting the
> dataset to a table in an intermediate step before projecting the new
> column, which I don't want to do, as the dataset is larger than
> memory.
>
> Is it possible to do this without converting the dataset into a table,
> and if so, how?
>
> Thanks,
>
> Nic
>
>
> [1]
> ```
> open_dataset("data/pums/person") |>
>   mutate(
>     age_group = case_when(
>       AGEP < 25 ~ "Under 25",
>       AGEP < 35 ~ "25-34",
>       AGEP < 45 ~ "35-44",
>       AGEP < 55 ~ "45-54",
>       AGEP < 65 ~ "55-64",
>       TRUE ~ "65+"
>     )
>   )|>
>   write_dataset(
>     path = "./data/pums/person-age-partitions",
>     partitioning = c("year", "location", "age_group")
>   )
> ```

Reply via email to