Sorry for the slow reply.
> This could be something on the Azure side but I find I am being bottlenecked
> on the download speed and have noticed if I spin up multiple Python sessions
> (or in my case interactive windows) I can increase my throughput. Hence I can
> download each year of the taxinyc dataset in separate interactive windows and
> increase my bandwidth consumed.
I would say that this is always a problem. In the datasets API the
goal is to maximize the resource usage within a single process. Now,
it may be a known or expected problem :)
> Does the Dataset API make use of memory mapping? Do I have the correct
> understanding that memory mapping is only intended for dealing with large
> data stored on a local file system. Where as data stored on a cloud file
> system in the feather format effectively cannot be memory mapped?
I think the benefits of memory mapping are rather subtle and often
misleading. Datasets can make use of memory mapping for local
filesystems. Doing so will, at best, have a slight performance
benefit (avoiding a memcpy) but would most likely decrease performance
(by introducing I/O where it is not expected) and it will have no
effect whatsoever on the amount of RAM used.
> This works as well as noted previosuly, so I assume the python operators are
> mapped across similar to what happens when you use the operators against a
> numpy or pandas series it just executes a np.multiply or pd. multiply in the
> background.
Yes. However the functions that get mapped can sometimes be
surprising. Specifically, logical operations map to the _kleene
variation and arithmetic maps to the _checked variation. You can find
the implementation at [1]. For multiplication this boils down to:
```
@staticmethod
cdef Expression _expr_or_scalar(object expr):
if isinstance(expr, Expression):
return (<Expression> expr)
return (<Expression> Expression._scalar(expr))
...
def __mul__(Expression self, other):
other = Expression._expr_or_scalar(other)
return Expression._call("multiply_checked", [self, other])
```
On Mon, Sep 19, 2022 at 12:52 AM Jacek Pliszka <[email protected]> wrote:
>
> Re 2. In Python Azure SDK there is logic for partial blob read:
>
> https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob
>
> However I was unable to use it as it does not support parquet files
> with decimal columns and these are the ones I have.
>
> BR
>
> J
>
> pt., 16 wrz 2022 o 02:26 Aldrin <[email protected]> napisaĆ(a):
> >
> > For Question 2:
> > At a glance, I don't see anything in adlfs or azure that is able to do
> > partial reads of a blob. If you're using block blobs, then likely you would
> > want to store blocks of your file as separate blocks of a blob, and then
> > you can do partial data transfers that way. I could be misunderstanding the
> > SDKs or how Azure stores data, but my guess is that a whole blob is
> > retrieved and then the local file is able to support partial, block-based
> > reads as you expect from local filesystems. You may be able to double check
> > how much data is being retrieved by looking at where adlfs is mounting your
> > blob storage.
> >
> > For Question 3:
> > you can memory map remote files, it's just that every page fault will be
> > even more expensive than for local files. I am not sure how to tell the
> > dataset API to do memory mapping, and I'm not sure how well that would work
> > over adlfs.
> >
> > For Question 4:
> > Can you try using `pc.scalar(1000)` as shown in the first code excerpt in
> > [1]:
> >
> > >> x, y = pa.scalar(7.8), pa.scalar(9.3)
> > >> pc.multiply(x, y)
> > <pyarrow.DoubleScalar: 72.54>
> >
> > [1]:
> > https://arrow.apache.org/docs/python/compute.html#standard-compute-functions
> >
> > Aldrin Montana
> > Computer Science PhD Student
> > UC Santa Cruz
> >
> >
> > On Thu, Sep 8, 2022 at 8:26 PM Nikhil Makan <[email protected]> wrote:
> >>
> >> Hi There,
> >>
> >> I have been experimenting with Tabular Datasets for data that can be
> >> larger than memory and had a few questions related to what's going on
> >> under the hood and how to work with it (I understand it is still
> >> experimental).
> >>
> >> Question 1: Reading Data from Azure Blob Storage
> >> Now I know the filesystems don't fully support this yet, but there is an
> >> fsspec compatible library (adlfs) which is shown in the file system
> >> example which I have used. Example below with the nyc taxi dataset, where
> >> I am pulling the whole dataset through and writing to disk to the feather
> >> format.
> >>
> >> import adlfs
> >> import pyarrow.dataset as ds
> >>
> >> fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage')
> >>
> >> dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet')
> >>
> >> scanner = dataset.scanner()
> >> ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather')
> >>
> >> This could be something on the Azure side but I find I am being
> >> bottlenecked on the download speed and have noticed if I spin up multiple
> >> Python sessions (or in my case interactive windows) I can increase my
> >> throughput. Hence I can download each year of the taxinyc dataset in
> >> separate interactive windows and increase my bandwidth consumed. The
> >> tabular dataset documentation notes 'optionally parallel reading.' Do you
> >> know how I can control this? Or perhaps control the number of concurrent
> >> connections. Or has this got nothing to do with the arrow and sits purley
> >> on the Azure side? I have increased the io thread count from the default 8
> >> to 16 and saw no difference, but could still spin up more interactive
> >> windows to maximise bandwidth.
> >>
> >> Question 2: Reading Filtered Data from Azure Blob Storage
> >> Unfortunately I don't quite have a repeatable example here. However using
> >> the same data above, only this time I have each year as a feather file
> >> instead of a parquet file. I have uploaded this to my own Azure blob
> >> storage account.
> >> I am trying to read a subset of this data from the blob storage by
> >> selecting columns and filtering the data. The final result should be a
> >> dataframe that takes up around 240 mb of memory (I have tested this by
> >> working with the data locally). However when I run this by connecting to
> >> the Azure blob storage it takes over an hour to run and it's clear it's
> >> downloading a lot more data than I would have thought. Given the file
> >> formats are feather that supports random access I would have thought I
> >> would only have to download the 240 mb?
> >>
> >> Is there more going on in the background? Perhaps I am using this
> >> incorrectly?
> >>
> >> import adlfs
> >> import pyarrow.dataset as ds
> >>
> >> connection_string = ''
> >> fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,)
> >>
> >> ds_f = ds.dataset("taxinyc/green/feather/", format='feather')
> >>
> >> df = (
> >> ds_f
> >> .scanner(
> >> columns={ # Selections and Projections
> >> 'passengerCount': ds.field(('passengerCount'))*1000,
> >> 'tripDistance': ds.field(('tripDistance'))
> >> },
> >> filter=(ds.field('vendorID') == 1)
> >> )
> >> .to_table()
> >> .to_pandas()
> >> )
> >>
> >> df.info()
> >>
> >> Question 3: How is memory mapping being applied?
> >> Does the Dataset API make use of memory mapping? Do I have the correct
> >> understanding that memory mapping is only intended for dealing with large
> >> data stored on a local file system. Where as data stored on a cloud file
> >> system in the feather format effectively cannot be memory mapped?
> >>
> >> Question 4: Projections
> >> I noticed in the scanner function when projecting a column I am unable to
> >> use any compute functions (I get a Type Error: only other expressions
> >> allowed as arguments) yet I am able to multiply this using standard python
> >> arithmetic.
> >>
> >> 'passengerCount': ds.field(('passengerCount'))*1000,
> >>
> >> 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000),
> >>
> >> Is this correct or am I to process this using an iterator via record batch
> >> to do this out of core? Is it actually even doing it out of core with "
> >> *1000 ".
> >>
> >> Thanks for your help in advance. I have been following the Arrow project
> >> for the last two years but have only recently decided to dive into it in
> >> depth to explore it for various use cases. I am particularly interested in
> >> the out-of-core data processing and the interaction with cloud storages to
> >> retrieve only a selection of data from feather files. Hopefully at some
> >> point when I have enough knowledge I can contribute to this amazing
> >> project.
> >>
> >> Kind regards
> >> Nikhil Makan