Re: parquet file in S3, is there a way to read a subset of all the columns in python

Luke Fri, 12 Oct 2018 09:03:27 -0700

It looks like https://github.com/dask/s3fs implements these methods.  Would
there need to be a wrapper over this for arrow or is it compatible as is?


-Luke

On Fri, Oct 12, 2018 at 9:13 AM Uwe L. Korn <[email protected]> wrote:

> That looks nice. Once you have wrapped that in a class that implements
> read and seek like a Python file object, you should be able to pass this to
> `pyarrow.parquet.read_table`. When you then set the columns argument on
> that function, only the respective byte ranges are then requested from S3.
> To minimise the number of requests, I would suggest you to implement the S3
> file with the exact ranges provided from the outside but when using
> pyarrow, you should wrap your S3 file in an io.BufferedReader.
> pyarrow.parquet requests exactly the ranges it needs but that can sometimes
> be too coarse for object stores like S3. There you often like to do the
> tradeoff of requesting some bytes more for a fewer number of requests.
>
> Uwe
>
>
> On Thu, Oct 11, 2018, at 11:27 PM, Luke wrote:
>
> This works in boto3:
>
> import boto3
>
> obj = boto3.resource('s3').Object('mybucketfoo', 'foo')
> stream = obj.get(Range='bytes=10-100')['Body']print(stream.read())
>
>
> On Thu, Oct 11, 2018 at 2:22 PM Uwe L. Korn <[email protected]> wrote:
>
>
> Hello Luke,
>
> this is only partly implemented. You can do this and I already did do this
> but this is sadly not in a perfect state.
>
> boto3 itself seems to be lacking a proper file-like class. You can get the
> contents of a file in S3 as
> https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
>  .
> This sadly seems to be missing a seek method.
>
> In my case I did access parquet files on S3 with per-column access using
> the simplekv project. There a small file-like class is implemented on top
> of boto (but not boto3):
> https://github.com/mbr/simplekv/blob/master/simplekv/net/botostore.py#L93 .
> This is what you are looking for, just the wrong boto package as well as I
> know that this implementation is sadly leaking http-connections and thus
> when you access too many files (even in serial) at once, your network will
> suffer.
>
> Cheers
> Uwe
>
>
> On Thu, Oct 11, 2018, at 8:01 PM, Luke wrote:
>
> I have parquet files (each self contained) in S3 and I want to read
> certain columns into a pandas dataframe without reading the entire object
> out of S3.
>
> Is this implemented?  boto3 in python supports reading from offsets in an
> S3 object but I wasn't sure anyone has made that work with a parquet file
> corresponding to certain columns?
>
> thanks,
> Luke
>
>
>
>

Re: parquet file in S3, is there a way to read a subset of all the columns in python

Reply via email to