Re 2. In Python Azure SDK there is logic for partial blob read: https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob
However I was unable to use it as it does not support parquet files with decimal columns and these are the ones I have. BR J pt., 16 wrz 2022 o 02:26 Aldrin <[email protected]> napisaĆ(a): > > For Question 2: > At a glance, I don't see anything in adlfs or azure that is able to do > partial reads of a blob. If you're using block blobs, then likely you would > want to store blocks of your file as separate blocks of a blob, and then you > can do partial data transfers that way. I could be misunderstanding the SDKs > or how Azure stores data, but my guess is that a whole blob is retrieved and > then the local file is able to support partial, block-based reads as you > expect from local filesystems. You may be able to double check how much data > is being retrieved by looking at where adlfs is mounting your blob storage. > > For Question 3: > you can memory map remote files, it's just that every page fault will be even > more expensive than for local files. I am not sure how to tell the dataset > API to do memory mapping, and I'm not sure how well that would work over > adlfs. > > For Question 4: > Can you try using `pc.scalar(1000)` as shown in the first code excerpt in [1]: > > >> x, y = pa.scalar(7.8), pa.scalar(9.3) > >> pc.multiply(x, y) > <pyarrow.DoubleScalar: 72.54> > > [1]: > https://arrow.apache.org/docs/python/compute.html#standard-compute-functions > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Thu, Sep 8, 2022 at 8:26 PM Nikhil Makan <[email protected]> wrote: >> >> Hi There, >> >> I have been experimenting with Tabular Datasets for data that can be larger >> than memory and had a few questions related to what's going on under the >> hood and how to work with it (I understand it is still experimental). >> >> Question 1: Reading Data from Azure Blob Storage >> Now I know the filesystems don't fully support this yet, but there is an >> fsspec compatible library (adlfs) which is shown in the file system example >> which I have used. Example below with the nyc taxi dataset, where I am >> pulling the whole dataset through and writing to disk to the feather format. >> >> import adlfs >> import pyarrow.dataset as ds >> >> fs = adlfs.AzureBlobFileSystem(account_name='azureopendatastorage') >> >> dataset = ds.dataset('nyctlc/green/', filesystem=fs, format='parquet') >> >> scanner = dataset.scanner() >> ds.write_dataset(scanner, f'taxinyc/green/feather/', format='feather') >> >> This could be something on the Azure side but I find I am being bottlenecked >> on the download speed and have noticed if I spin up multiple Python sessions >> (or in my case interactive windows) I can increase my throughput. Hence I >> can download each year of the taxinyc dataset in separate interactive >> windows and increase my bandwidth consumed. The tabular dataset >> documentation notes 'optionally parallel reading.' Do you know how I can >> control this? Or perhaps control the number of concurrent connections. Or >> has this got nothing to do with the arrow and sits purley on the Azure side? >> I have increased the io thread count from the default 8 to 16 and saw no >> difference, but could still spin up more interactive windows to maximise >> bandwidth. >> >> Question 2: Reading Filtered Data from Azure Blob Storage >> Unfortunately I don't quite have a repeatable example here. However using >> the same data above, only this time I have each year as a feather file >> instead of a parquet file. I have uploaded this to my own Azure blob storage >> account. >> I am trying to read a subset of this data from the blob storage by selecting >> columns and filtering the data. The final result should be a dataframe that >> takes up around 240 mb of memory (I have tested this by working with the >> data locally). However when I run this by connecting to the Azure blob >> storage it takes over an hour to run and it's clear it's downloading a lot >> more data than I would have thought. Given the file formats are feather that >> supports random access I would have thought I would only have to download >> the 240 mb? >> >> Is there more going on in the background? Perhaps I am using this >> incorrectly? >> >> import adlfs >> import pyarrow.dataset as ds >> >> connection_string = '' >> fs = adlfs.AzureBlobFileSystem(connection_string=connection_string,) >> >> ds_f = ds.dataset("taxinyc/green/feather/", format='feather') >> >> df = ( >> ds_f >> .scanner( >> columns={ # Selections and Projections >> 'passengerCount': ds.field(('passengerCount'))*1000, >> 'tripDistance': ds.field(('tripDistance')) >> }, >> filter=(ds.field('vendorID') == 1) >> ) >> .to_table() >> .to_pandas() >> ) >> >> df.info() >> >> Question 3: How is memory mapping being applied? >> Does the Dataset API make use of memory mapping? Do I have the correct >> understanding that memory mapping is only intended for dealing with large >> data stored on a local file system. Where as data stored on a cloud file >> system in the feather format effectively cannot be memory mapped? >> >> Question 4: Projections >> I noticed in the scanner function when projecting a column I am unable to >> use any compute functions (I get a Type Error: only other expressions >> allowed as arguments) yet I am able to multiply this using standard python >> arithmetic. >> >> 'passengerCount': ds.field(('passengerCount'))*1000, >> >> 'passengerCount': pc.multiply(ds.field(('passengerCount')),1000), >> >> Is this correct or am I to process this using an iterator via record batch >> to do this out of core? Is it actually even doing it out of core with " >> *1000 ". >> >> Thanks for your help in advance. I have been following the Arrow project for >> the last two years but have only recently decided to dive into it in depth >> to explore it for various use cases. I am particularly interested in the >> out-of-core data processing and the interaction with cloud storages to >> retrieve only a selection of data from feather files. Hopefully at some >> point when I have enough knowledge I can contribute to this amazing project. >> >> Kind regards >> Nikhil Makan
