Re: [Python] - Dataset API - What's happening under the hood?

Weston Pace Mon, 19 Sep 2022 10:12:24 -0700

Sorry for the slow reply.

> This could be something on the Azure side but I find I am being bottlenecked 
> on the download speed and have noticed if I spin up multiple Python sessions 
> (or in my case interactive windows) I can increase my throughput. Hence I can 
> download each year of the taxinyc dataset in separate interactive windows and 
> increase my bandwidth consumed.


I would say that this is always a problem.  In the datasets API the
goal is to maximize the resource usage within a single process.  Now,
it may be a known or expected problem :)

> Does the Dataset API make use of memory mapping? Do I have the correct 
> understanding that memory mapping is only intended for dealing with large 
> data stored on a local file system. Where as data stored on a cloud file 
> system in the feather format effectively cannot be memory mapped?

I think the benefits of memory mapping are rather subtle and often
misleading.  Datasets can make use of memory mapping for local
filesystems.  Doing so will, at best, have a slight performance
benefit (avoiding a memcpy) but would most likely decrease performance
(by introducing I/O where it is not expected) and it will have no
effect whatsoever on the amount of RAM used.

> This works as well as noted previosuly, so I assume the python operators are 
> mapped across similar to what happens when you use the operators against a 
> numpy or pandas series it just executes a np.multiply or pd. multiply in the 
> background.

Yes.  However the functions that get mapped can sometimes be
surprising.  Specifically, logical operations map to the _kleene
variation and arithmetic maps to the _checked variation.  You can find
the implementation at [1].  For multiplication this boils down to:

```
@staticmethod
cdef Expression _expr_or_scalar(object expr):
    if isinstance(expr, Expression):
        return (<Expression> expr)
    return (<Expression> Expression._scalar(expr))

...

def __mul__(Expression self, other):
    other = Expression._expr_or_scalar(other)
    return Expression._call("multiply_checked", [self, other])
```


On Mon, Sep 19, 2022 at 12:52 AM Jacek Pliszka <[email protected]> wrote:
>
> Re 2.   In Python Azure SDK there is logic for partial blob read:
>
> https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob
>
> However I was unable to use it as it does not support parquet files
> with decimal columns and these are the ones I have.
>
> BR
>
> J
>
> pt., 16 wrz 2022 o 02:26 Aldrin <[email protected]> napisał(a):
> >
> > For Question 2:
> > At a glance, I don't see anything in adlfs or azure that is able to do 
> > partial reads of a blob. If you're using block blobs, then likely you would 
> > want to store blocks of your file as separate blocks of a blob, and then 
> > you can do partial data transfers that way. I could be misunderstanding the 
> > SDKs or how Azure stores data, but my guess is that a whole blob is 
> > retrieved and then the local file is able to support partial, block-based 
> > reads as you expect from local filesystems. You may be able to double check 
> > how much data is being retrieved by looking at where adlfs is mounting your 
> > blob storage.
> >
> > For Question 3:
> > you can memory map remote files, it's just that every page fault will be 
> > even more expensive than for local files. I am not sure how to tell the 
> > dataset API to do memory mapping, and I'm not sure how well that would work 
> > over adlfs.
> >
> > For Question 4:
> > Can you try using `pc.scalar(1000)` as shown in the first code excerpt in 
> > [1]:
> >
> > >> x, y = pa.scalar(7.8), pa.scalar(9.3)
> > >> pc.multiply(x, y)
> > <pyarrow.DoubleScalar: 72.54>
> >
> > [1]: 
> > https://arrow.apache.org/docs/python/compute.html#standard-compute-functions
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Thu, Sep 8, 2022 at 8:26 PM Nikhil Makan <[email protected]> wrote:
> >>
> >> Hi There,
> >>
> >> I have been experimenting with Tabular Datasets for data that can be 
> >> larger than memory and had a few questions related to what's going on 
> >> under the hood and how to work with it (I understand it is still 
> >> experimental).
> >>
> >> Question 1: Reading Data from Azure Blob Storage
> >> Now I know the filesystems don't fully support this yet, but there is an 
> >> fsspec compatible library (adlfs) which is shown in the file system 
> >> example which I have used. Example below with the nyc taxi dataset, where 
> >> I am pulling the whole dataset through and writing to disk to the feather 
> >> format.
> >>
> >> import adlfs
> >> import pyarrow.dataset as ds
> >>
> >> fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage')
> >>
> >> dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet')
> >>
> >> scanner = dataset.scanner()
> >> ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather')
> >>
> >> This could be something on the Azure side but I find I am being 
> >> bottlenecked on the download speed and have noticed if I spin up multiple 
> >> Python sessions (or in my case interactive windows) I can increase my 
> >> throughput. Hence I can download each year of the taxinyc dataset in 
> >> separate interactive windows and increase my bandwidth consumed. The 
> >> tabular dataset documentation notes 'optionally parallel reading.' Do you 
> >> know how I can control this? Or perhaps control the number of concurrent 
> >> connections. Or has this got nothing to do with the arrow and sits purley 
> >> on the Azure side? I have increased the io thread count from the default 8 
> >> to 16 and saw no difference, but could still spin up more interactive 
> >> windows to maximise bandwidth.
> >>
> >> Question 2: Reading Filtered Data from Azure Blob Storage
> >> Unfortunately I don't quite have a repeatable example here. However using 
> >> the same data above, only this time I have each year as a feather file 
> >> instead of a parquet file. I have uploaded this to my own Azure blob 
> >> storage account.
> >> I am trying to read a subset of this data from the blob storage by 
> >> selecting columns and filtering the data. The final result should be a 
> >> dataframe that takes up around 240 mb of memory (I have tested this by 
> >> working with the data locally). However when I run this by connecting to 
> >> the Azure blob storage it takes over an hour to run and it's clear it's 
> >> downloading a lot more data than I would have thought. Given the file 
> >> formats are feather that supports random access I would have thought I 
> >> would only have to download the 240 mb?
> >>
> >> Is there more going on in the background? Perhaps I am using this 
> >> incorrectly?
> >>
> >> import adlfs
> >> import pyarrow.dataset as ds
> >>
> >> connection_string = ''
> >> fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,)
> >>
> >> ds_f = ds.dataset("taxinyc/green/feather/", format='feather')
> >>
> >> df = (
> >>     ds_f
> >>     .scanner(
> >>         columns={ # Selections and Projections
> >>             'passengerCount': ds.field(('passengerCount'))*1000,
> >>             'tripDistance': ds.field(('tripDistance'))
> >>         },
> >>         filter=(ds.field('vendorID') == 1)
> >>     )
> >>     .to_table()
> >>     .to_pandas()
> >> )
> >>
> >> df.info()
> >>
> >> Question 3: How is memory mapping being applied?
> >> Does the Dataset API make use of memory mapping? Do I have the correct 
> >> understanding that memory mapping is only intended for dealing with large 
> >> data stored on a local file system. Where as data stored on a cloud file 
> >> system in the feather format effectively cannot be memory mapped?
> >>
> >> Question 4: Projections
> >> I noticed in the scanner function when projecting a column I am unable to 
> >> use any compute functions (I get a Type Error: only other expressions 
> >> allowed as arguments) yet I am able to multiply this using standard python 
> >> arithmetic.
> >>
> >> 'passengerCount': ds.field(('passengerCount'))*1000,
> >>
> >> 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000),
> >>
> >> Is this correct or am I to process this using an iterator via record batch 
> >> to do this out of core? Is it actually even doing it out of core with " 
> >> *1000 ".
> >>
> >> Thanks for your help in advance. I have been following the Arrow project 
> >> for the last two years but have only recently decided to dive into it in 
> >> depth to explore it for various use cases. I am particularly interested in 
> >> the out-of-core data processing and the interaction with cloud storages to 
> >> retrieve only a selection of data from feather files. Hopefully at some 
> >> point when I have enough knowledge I can contribute to this amazing 
> >> project.
> >>
> >> Kind regards
> >> Nikhil Makan

Re: [Python] - Dataset API - What's happening under the hood?

Reply via email to