It looks like https://github.com/dask/s3fs implements these methods. Would there need to be a wrapper over this for arrow or is it compatible as is?
-Luke On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <[email protected]> wrote: > That looks nice. Once you have wrapped that in a class that implements > read and seek like a Python file object, you should be able to pass this to > `pyarrow.parquet.read_table`. When you then set the columns argument on > that function, only the respective byte ranges are then requested from S3. > To minimise the number of requests, I would suggest you to implement the S3 > file with the exact ranges provided from the outside but when using > pyarrow, you should wrap your S3 file in an io.BufferedReader. > pyarrow.parquet requests exactly the ranges it needs but that can sometimes > be too coarse for object stores like S3. There you often like to do the > tradeoff of requesting some bytes more for a fewer number of requests. > > Uwe > > > On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote: > > This works in boto3: > > import boto3 > > obj = boto3.resource('s3').Object('mybucketfoo', 'foo') > stream = obj.get(Range='bytes=10-100')['Body']print(stream.read()) > > > On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <[email protected]> wrote: > > > Hello Luke, > > this is only partly implemented. You can do this and I already did do this > but this is sadly not in a perfect state. > > boto3 itself seems to be lacking a proper file-like class. You can get the > contents of a file in S3 as > https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody > . > This sadly seems to be missing a seek method. > > In my case I did access parquet files on S3 with per-column access using > the simplekv project. There a small file-like class is implemented on top > of boto (but not boto3): > https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 . > This is what you are looking for, just the wrong boto package as well as I > know that this implementation is sadly leaking http-connections and thus > when you access too many files (even in serial) at once, your network will > suffer. > > Cheers > Uwe > > > On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote: > > I have parquet files (each self contained) in S3 and I want to read > certain columns into a pandas dataframe without reading the entire object > out of S3. > > Is this implemented? boto3 in python supports reading from offsets in an > S3 object but I wasn't sure anyone has made that work with a parquet file > corresponding to certain columns? > > thanks, > Luke > > > >
