This is an automated email from the ASF dual-hosted git repository. uwe pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push: new 3e3f7c2 ARROW-2066: [Python] Document using pyarrow with Azure Blob Store 3e3f7c2 is described below commit 3e3f7c2c583054ec226cb5909a9368f920eae06c Author: rrussell <rruss...@adobe.com> AuthorDate: Fri Feb 23 13:38:38 2018 +0100 ARROW-2066: [Python] Document using pyarrow with Azure Blob Store Original question: https://github.com/apache/arrow/issues/1510 Improvement story: https://issues.apache.org/jira/browse/ARROW-2066 Author: rrussell <rruss...@adobe.com> Closes #1544 from rjrussell77/arrow-2066-docs-azure-parquet and squashes the following commits: 0d3972c <rrussell> Add missing byte_stream declaration/assignment a5addb0 <rrussell> use more common 'df' instead of 'pd' for pandas dataframe variable, remove head() call and instead use comment to indicate generic fill-in code, add comment re: stream closure in finally block f056888 <rrussell> Clean up white space 1fe9866 <rrussell> Add try/except/finally blocks to ensure closure of the byte stream 36f7378 <rrussell> Replace usage of tempfile buffer with BytesIO stream 654a6f9 <rrussell> Add back original Notes bullets 5d450fc <rrussell> fix 4770de1 <rrussell> fix 4c75824 <rrussell> Try moving the bullet to remove italics 803cbca <rrussell> Use asterisks for list 051b91d <rrussell> Fix formatting a015deb <rrussell> Fix formatting 1815816 <rrussell> fix formatting 599e04f <rrussell> Fix formatting 34c5a16 <rrussell> Fix formatting 6fd9f70 <rrussell> remove inline edits 83a38c4 <rrussell> Try to fix italics f130e04 <rrussell> Change wording a bit 718bd94 <rrussell> Fix unintended italics 7bab640 <rrussell> Refine indented bullet and fix title underline 26a53e4 <rrussell> Fix formatting 5fbea89 <rrussell> Add a note about keys and add polish 5365a9c <rrussell> Add helpful notes about Azure properties 6841116 <rrussell> Polish the formatting eb643e4 <rrussell> ARROW-2066 Add documentation for Arrow/Azure/Parquet solution --- python/doc/source/parquet.rst | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst index ac56520..3d01e1d 100644 --- a/python/doc/source/parquet.rst +++ b/python/doc/source/parquet.rst @@ -246,3 +246,44 @@ throughput: pq.read_table(where, nthreads=4) pq.ParquetDataset(where).read(nthreads=4) + +Reading a Parquet File from Azure Blob storage +---------------------------------------------- + +The code below shows how to use Azure's storage sdk along with pyarrow to read +a parquet file into a Pandas dataframe. +This is suitable for executing inside a Jupyter notebook running on a Python 3 +kernel. + +Dependencies: + +* python 3.6.2 +* azure-storage 0.36.0 +* pyarrow 0.8.0 + +.. code-block:: python + + import pyarrow.parquet as pq + from io import BytesIO + from azure.storage.blob import BlockBlobService + + account_name = '...' + account_key = '...' + container_name = '...' + parquet_file = 'mysample.parquet' + + byte_stream = io.BytesIO() + block_blob_service = BlockBlobService(account_name=account_name, account_key=account_key) + try: + block_blob_service.get_blob_to_stream(container_name=container_name, blob_name=parquet_file, stream=byte_stream) + df = pq.read_table(source=byte_stream).to_pandas() + # Do work on df ... + finally: + # Add finally block to ensure closure of the stream + byte_stream.close() + +Notes: + +* The ``account_key`` can be found under ``Settings -> Access keys`` in the Microsoft Azure portal for a given container +* The code above works for a container with private access, Lease State = Available, Lease Status = Unlocked +* The parquet file was Blob Type = Block blob -- To stop receiving notification emails like this one, please contact u...@apache.org.