[arrow] branch master updated: ARROW-2066: [Python] Document using pyarrow with Azure Blob Store

uwe Fri, 23 Feb 2018 04:39:07 -0800

This is an automated email from the ASF dual-hosted git repository.

uwe pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git



The following commit(s) were added to refs/heads/master by this push:
     new 3e3f7c2  ARROW-2066: [Python] Document using pyarrow with Azure Blob 
Store
3e3f7c2 is described below

commit 3e3f7c2c583054ec226cb5909a9368f920eae06c
Author: rrussell <rruss...@adobe.com>
AuthorDate: Fri Feb 23 13:38:38 2018 +0100

    ARROW-2066: [Python] Document using pyarrow with Azure Blob Store
    
    Original question:
    
    https://github.com/apache/arrow/issues/1510
    
    Improvement story:
    
    https://issues.apache.org/jira/browse/ARROW-2066
    
    Author: rrussell <rruss...@adobe.com>
    
    Closes #1544 from rjrussell77/arrow-2066-docs-azure-parquet and squashes 
the following commits:
    
    0d3972c <rrussell> Add missing byte_stream declaration/assignment
    a5addb0 <rrussell> use more common 'df' instead of 'pd' for pandas 
dataframe variable, remove head() call and instead use comment to indicate 
generic fill-in code, add comment re: stream closure in finally block
    f056888 <rrussell> Clean up white space
    1fe9866 <rrussell> Add try/except/finally blocks to ensure closure of the 
byte stream
    36f7378 <rrussell> Replace usage of tempfile buffer with BytesIO stream
    654a6f9 <rrussell> Add back original Notes bullets
    5d450fc <rrussell> fix
    4770de1 <rrussell> fix
    4c75824 <rrussell> Try moving the bullet to remove italics
    803cbca <rrussell> Use asterisks for list
    051b91d <rrussell> Fix formatting
    a015deb <rrussell> Fix formatting
    1815816 <rrussell> fix formatting
    599e04f <rrussell> Fix formatting
    34c5a16 <rrussell> Fix formatting
    6fd9f70 <rrussell> remove inline edits
    83a38c4 <rrussell> Try to fix italics
    f130e04 <rrussell> Change wording a bit
    718bd94 <rrussell> Fix unintended italics
    7bab640 <rrussell> Refine indented bullet and fix title underline
    26a53e4 <rrussell> Fix formatting
    5fbea89 <rrussell> Add a note about keys and add polish
    5365a9c <rrussell> Add helpful notes about Azure properties
    6841116 <rrussell> Polish the formatting
    eb643e4 <rrussell> ARROW-2066 Add documentation for Arrow/Azure/Parquet 
solution
---
 python/doc/source/parquet.rst | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index ac56520..3d01e1d 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -246,3 +246,44 @@ throughput:
 
    pq.read_table(where, nthreads=4)
    pq.ParquetDataset(where).read(nthreads=4)
+
+Reading a Parquet File from Azure Blob storage
+----------------------------------------------
+
+The code below shows how to use Azure's storage sdk along with pyarrow to read
+a parquet file into a Pandas dataframe.
+This is suitable for executing inside a Jupyter notebook running on a Python 3
+kernel.
+
+Dependencies: 
+
+* python 3.6.2 
+* azure-storage 0.36.0 
+* pyarrow 0.8.0 
+
+.. code-block:: python
+
+   import pyarrow.parquet as pq
+   from io import BytesIO
+   from azure.storage.blob import BlockBlobService
+
+   account_name = '...'
+   account_key = '...'
+   container_name = '...'
+   parquet_file = 'mysample.parquet'
+
+   byte_stream = io.BytesIO()
+   block_blob_service = BlockBlobService(account_name=account_name, 
account_key=account_key)
+   try:
+      block_blob_service.get_blob_to_stream(container_name=container_name, 
blob_name=parquet_file, stream=byte_stream)
+      df = pq.read_table(source=byte_stream).to_pandas()
+      # Do work on df ...
+   finally:
+      # Add finally block to ensure closure of the stream
+      byte_stream.close()
+
+Notes:
+
+* The ``account_key`` can be found under ``Settings -> Access keys`` in the 
Microsoft Azure portal for a given container
+* The code above works for a container with private access, Lease State = 
Available, Lease Status = Unlocked
+* The parquet file was Blob Type = Block blob

-- 
To stop receiving notification emails like this one, please contact
u...@apache.org.

[arrow] branch master updated: ARROW-2066: [Python] Document using pyarrow with Azure Blob Store

Reply via email to