I suspect this is a combination of [1] and [2]/[3]. We do not currently allow you to specify a filter during discovery. We could, and that should allow us to reduce the amount of reading we need to do.
Also, when no filter is supplied, we can be more efficient with our usage of S3. [1] https://github.com/apache/arrow/issues/31174 [2] https://github.com/apache/arrow/issues/34213 [3] https://github.com/apache/arrow/issues/25019 On Wed, Mar 29, 2023 at 1:17 AM Oxlade, Dan <[email protected]> wrote: > Hi all, > > > > I’m fairly new to arrow. > > > > I’m trying to create an Arrow Flight service that reads data from and s3 > bucket. On the face of it that appears to be quite simple. Unfortunately I > have a very large bucket with 1000’s of files across partitions. > > > > I’m trying the following in python: > > > > dataset = ds.dataset( > > f“{bucket}/{partition_root}/”, > > filesystem=s3fs, > > partitioning=my_partitioning_def, > > ) > > batches = dataset.to_batches( > > filter=my_filter_which_would_select_a_tiny_subset_of_files > > ) > > > > From my testing it seems as though the s3 bucket is scanned at the first > step, this is extremely inefficient in my use-case. Is there a way to delay > the scan until the filter is applied? This could reduce the scan of many > 1000’s of objects to a single object in s3. > > > > Hopefully that make sense. > > > > Thanks > > Dan > > > T. Rowe Price (including T. Rowe Price Group, Inc. and its affiliates) and > its associates do not provide legal or tax advice. Any tax-related > discussion contained in this e-mail, including any attachments, is not > intended or written to be used, and cannot be used, for the purpose of (i) > avoiding any tax penalties or (ii) promoting, marketing, or recommending to > any other party any transaction or matter addressed herein. Please consult > your independent legal counsel and/or professional tax advisor regarding > any legal or tax issues raised in this e-mail. > > The contents of this e-mail and any attachments are intended solely for > the use of the named addressee(s) and may contain confidential and/or > privileged information. Any unauthorized use, copying, disclosure, or > distribution of the contents of this e-mail is strictly prohibited by the > sender and may be unlawful. If you are not the intended recipient, please > notify the sender immediately and delete this e-mail. >
