Re: Questions about working with large multi-file zipped CSV data

Ryan Kuhns Wed, 09 Nov 2022 17:33:24 -0800

Hi Weston,

Thanks for the information. That will help me make an informed decision.


I wish I was better with C++ so I could help parallelize the CSV parsing.

It could be useful to have some additional examples in the documentation on 
workflow patterns. If there is interest, that is something I would be up for 
contributing (assuming I can bounce some of these questions off everyone with 
more experience).

Thanks again,

Ryan







> On Nov 9, 2022, at 7:55 PM, Weston Pace <[email protected]> wrote:
> 
> 
>> 
>> 2. Some of the files end up being larger than memory when unzipped. In this 
>> case I’m using the file size to switch over and use open_csv instead of 
>> read_csv. Is there any plan for open_csv to be multithreaded in a future 
>> release (didn’t see anything on Jira, but I’m not great at searching on it)?
> 
> There is a PR in progress[1] which will add parallel reads to
> `open_csv` when provided with a random access file (e.g. not just an
> input stream).  This is important when reading from S3 but perhaps not
> as big of a deal when reading from a local disk (which usually doesn't
> support a ton of parallelism).  The streaming CSV reader's parsing is
> also not very parallel and could be improved (I would presume to get
> pretty close to read_csv performance).  However, I don't know anyone
> currently working on this.
> 
>> If I go with this approach, will the dataset to batches read be 
>> single-threaded (reading csv format) like open_csv? That is obviously not an 
>> issue for large files I would have had to use open_csv for anyway, but if 
>> the eventual dataset API read is single threaded, I might still want to use 
>> read_csv and process columns post read for smaller datasets.
> 
> dataset.to_batches is built on top of the streaming CSV reader (e.g.
> open_csv).  However, any compute work done by datasets (e.g.
> dictionary encoding, joins, casts, etc.) will be done in parallel.
> 
> [1] https://github.com/apache/arrow/pull/14269
> 
>> On Wed, Nov 9, 2022 at 3:21 PM Ryan Kuhns <[email protected]> wrote:
>> 
>> Hi Everyone,
>> 
>> Adam’s reply got me thinking about using the dataset API to overcome the 
>> problem I was facing in my third question. It seems like I could use the 
>> column projection to provide a mapping of from strings to integer lookup 
>> values. Then similar to the writing large amounts of data example 
>> (https://arrow.Apache.org/docs/Python/dataset.html) I can pass the dataset 
>> to write_dataset and never have everything in memory.
>> 
>> If I go with this approach, will the dataset to batches read be 
>> single-threaded (reading csv format) like open_csv? That is obviously not an 
>> issue for large files I would have had to use open_csv for anyway, but if 
>> the eventual dataset API read is single threaded, I might still want to use 
>> read_csv and process columns post read for smaller datasets.
>> 
>> Thanks,
>> 
>> Ryan
>> 
>> On Nov 9, 2022, at 4:07 PM, Ryan Kuhns <[email protected]> wrote:
>> 
>> 
>> Adam,
>> 
>> Thanks for pointing me to that. The fsspec approach looks like it will be 
>> helpful and the code snippet give me a good starting point.
>> 
>> -Ryan
>> 
>> On Nov 9, 2022, at 2:42 PM, Kirby, Adam <[email protected]> wrote:
>> 
>> 
>> Hi Ryan,
>> 
>> For your first question of a ZIP of multiple CSVs, I've had good luck [2] 
>> combining fsspec [1] with pyarrow dataset to process ZIPs of multiple CSVs. 
>> fsspec allows you to manage how much RAM you use on the read side with a few 
>> different cache configs.
>> 
>> In case helpful, I sent a python snippet earlier. [3]
>> 
>> [1] 
>> https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/zip.html
>> 
>> [2] The idea was proposed by [email protected] on this list and proved 
>> very helpful.
>> 
>> [3] https://www.mail-archive.com/[email protected]/msg02176.html
>> 
>> 
>>> On Wed, Nov 9, 2022, 12:15 PM Ryan Kuhns <[email protected]> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I’m using pyarrow to read, process, store and analyze some large files 
>>> (~460GB zipped on 400+ files updated quarterly).
>>> 
>>> I’ve have a couple thoughts/questions come up as I have worked through the 
>>> process. First two questions are mainly informational, wanting to confirm 
>>> what I’ve inferred from existing docs.
>>> 
>>> 1. I know pyarrow has functionality to uncompress a zipped file with a 
>>> single CSV in it, but in my case I have 3 files in the zip. I’m currently 
>>> using Python’s zipfile to find and open the file I want in the zip and then 
>>> I am reading it with pyarrow.read_csv. I wanted to confirm there isn’t 
>>> pyarrow functionality that might be able to tell me the files in the zip 
>>> and let me select the one to unzip and read.
>>> 
>>> 2. Some of the files end up being larger than memory when unzipped. In this 
>>> case I’m using the file size to switch over and use open_csv instead of 
>>> read_csv. Is there any plan for open_csv to be multithreaded in a future 
>>> release (didn’t see anything on Jira, but I’m not great at searching on it)?
>>> 
>>> 3. My data has lots of columns that are dimensions (with low cardinality) 
>>> with longish string values and a large number of rows. Since I have files 
>>> getting close to or above my available memory when unzipped, I need to be 
>>> as memory efficient as possible. Converting these to dictionaries via 
>>> ConvertOptions helps with the in-memory size. But then I get errors when 
>>> looking to join tables together later (due to functionality to unify 
>>> dictionaries not being implemented yet). Is that something that will be 
>>> added? How about the ability to provide a user dictionary that should be 
>>> used in the encoding (as optional param, fallback to current functionality 
>>> when not provided). Seems like that would reduce the need to infer the 
>>> dictionary from the data when encoding. It would be nice to ensure the same 
>>> dictionary mapping is used for a column across each file I read in. It 
>>> seems like I can’t guarantee that currently. A related feature that would 
>>> solve my issue would be a way to easily map a columns values to other 
>>> values on read. I’d imagine this would be something in ConvertOptions, 
>>> where you could specify a column and the mapping to use (parameter 
>>> accepting list of name, mapping tuples?). The end result would be the 
>>> ability to convert a string column to something like int16 on read via the 
>>> mapping. This would be more space efficient and also avoid the inability to 
>>> join on dictionary columns I am seeing currently.
>>> 
>>> Thanks,
>>> 
>>> Ryan
>>>

Re: Questions about working with large multi-file zipped CSV data

Reply via email to