Is Hierarchical Namespace [1] Enabled on the Storage Account? When HNS is not enabled or when operations using ADLFS fail, the Azure file system implementation falls back to Azure Blobs operations.
I have a draft on my machine of a change that would add a configuration option to *force* the use of ADLFS and fail instead of falling back to Azure Blobs when ADLFS operations fail. Any specific reason for not wanting Azure Blobs to never be used? __ Felipe [1] https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace [2] ADLFS = Azure Data Lake File System Gen 2 On Wed, Jul 10, 2024 at 6:34 PM Sutou Kouhei <k...@clear-code.com> wrote: > Hi, > > > azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // > If I don't do this, then the > > // > blob.core.windows.net is used; > > // > I want dfs not blob, so... not certain > > // > why that happens either > > This is strange. In general, you should not do this. > AzureFS uses both of blob storage API and data lake storage > API. If data lake storage API is available, AzureFS uses it > automatically. So you should not change > blob_storage_authority. > > If you don't have this line, what was happen? > > > Thanks, > -- > kou > > In > < > dm3pr05mb1054334eeaeae4a95805de322f3...@dm3pr05mb10543.namprd05.prod.outlook.com > > > "Using the new Azure filesystem object (C++)" on Wed, 10 Jul 2024 > 16:58:52 +0000, > "Jerry Adair via user" <user@arrow.apache.org> wrote: > > > Hi- > > > > I am attempting to use the new Azure filesystem object in C++. > Arrow/Parquet version 16.0.0. I already have code that works for GCS and > AWS/S3. I have been waiting for quite a while to see the new Azure > filesystem object released. Now that it has in this version (16.0.0) I > have been trying to use it. Without success. I presumed that it would > work in the same manner in which the GCS and S3/AWS filesystem objects > work. You create the object, then you can use it in the same manner that > you used the other filesystem objects. Note that I am not using Arrow > methods to read/write the data but rather the Parquet methods. This works > for local, GCS and S3/AWS. However I cannot open a file on Azure. It > seems like no matter which authentication method I try to use, it doesn't > work. And I get different results depending on which auth approach I take > (client secret versus account key, etc.). Here is a code summary of what I > am trying to do: > > > > arrow::fs::AzureOptions azureOptions; > > arrow::Status configureStatus = arrow::Status::OK(); > > > > // exact values obfuscated > > azureOptions.account_name = "mytest"; > > azureOptions.dfs_storage_authority = ".dfs.core.windows.net"; > > azureOptions.blob_storage_authority = ".dfs.core.windows.net"; // > If I don't do this, then the > > // > blob.core.windows.net is used; > > // > I want dfs not blob, so... not certain > > // > why that happens either > > std::string client_id = "3f061894-blah"; > > std::string client_secret = "2c796e9eblah"; > > std::string tenant_id = "b1c14d5c-blah"; > > //std::string account_key = "flMhWgNts+i/blah=="; > > > > > > //configureStatus = azureOptions.ConfigureAccountKeyCredential( > account_key ); > > configureStatus = azureOptions.ConfigureClientSecretCredential( > tenant_id, client_id, client_secret ); > > //configureStatus = > azureOptions.ConfigureManagedIdentityCredential( client_id ); > > if( false == configureStatus.ok() ) > > { > > // Uh-oh, throw > > > > } > > > > std::shared_ptr<arrow::fs::AzureFileSystem> azureFileSystem; > > arrow::Result<std::shared_ptr<arrow::fs::AzureFileSystem>> > azureFileSystemResult = arrow::fs::AzureFileSystem::Make( azureOptions ); > > if( true == azureFileSystemResult.ok() ) > > { > > azureFileSystem = azureFileSystemResult.ValueOrDie(); > > > > } > > else > > { > > // Uh-oh, throw > > > > } > > > > //const std::string path( "parquet/ParquetFiles/plain.parquet" > ); > > const std::string path( "parquet/ParquetFiles/plain.parquet" ); > > std::shared_ptr<arrow::io::RandomAccessFile> arrowFile; > > std::cout << "1\n"; > > arrow::Result<std::shared_ptr<arrow::io::RandomAccessFile>> > openResult = azureFileSystem->OpenInputFile( path ); > > std::cout << "2\n"; > > > > And that is where things run off the rails. At this point, all I want > to do is open the input file, create a Parquet file reader like so: > > > > std::unique_ptr<parquet::ParquetFileReader> parquet_reader = > parquet::ParquetFileReader::Open( arrowFile ); > > > > Then go about my business of reading/writing Parquet data as per > normal. Ergo, just as I do for the other filesystem objects. But the > OpenInputFile() method fails for the Azure use case scenario. If I attempt > the account key configuration, then the error I see is: > > > > adls_read > > Parquet file read commencing... > > 1 > > Parquet read error: map::at > > > > Where the "1" is just a marker to show how far I got in the process of > reading a pre-existing Parquet file from the Azure server. Ergo, a > low-brow means of debugging. The cout is shown above. I don't get to "2", > obviously. > > > > When attempting the client secret credential auth, I see the following > failure: > > > > adls_read > > Parquet file read commencing... > > 1 > > Parquet read error: GetToken(): error response: 401 Unauthorized > > > > Then when attempting the Managed Identity auth configuration, I get the > following: > > > > adls_read > > Parquet file read commencing... > > 1 > > ^C > > > > Where the process just hangs and I have to interrupt out of it. Note > that I didn't have this level of difficulty when I implemented our support > for GCS and S3/AWS. Those were relatively straightforward. Azure however > has been more difficult; this should just work. I mean, you create the > filesystem object, then you are supposed to be able to use it in the same > manner that you use any other Arrow filesystem object. However I can't > open a file and I suspect it is due to some type of handshaking issue with > Azure. Azure has all of these moving parts; tenant ID, application/client > ID, client secret, object ID (which we don't use in this case) and the list > goes on. Finally, I saw this in the azurefs.h header at line 102: > > > > // TODO(GH-38598): Add support for more auth methods. > > // std::string connection_string; > > // std::string sas_token; > > > > But it seemed clear to me that this was referring to other auth methods > than those that have been implemented thus far (ergo client secret, account > key, etc.). Am I correct? > > > > So my questions are: > > > > 1. Any ideas where I am going wrong here? > > 2. Has anyone else used the Azure filesystem object? > > 3. Has it worked for you? > > 4. If so, what was your approach? > > > > Note that I did peruse the azurefs_test.cc for examples. I did see > various approaches. One involved invoking the MakeDataLakeServiceClient() > method. It wasn't clear if I needed to do that or not, but then I saw that > this is done during the private implementation of the AzureFileSystem's > Make() method, thus: > > > > static Result<std::unique_ptr<AzureFileSystem::Impl>> > Make(AzureOptions options, > > > io::IOContext io_context) { > > auto self = std::unique_ptr<AzureFileSystem::Impl>( > > new AzureFileSystem::Impl(std::move(options), > std::move(io_context))); > > ARROW_ASSIGN_OR_RAISE(self->blob_service_client_, > > self->options_.MakeBlobServiceClient()); > > ARROW_ASSIGN_OR_RAISE(self->datalake_service_client_, > > self->options_.MakeDataLakeServiceClient()); > > return self; > > } > > > > So it seemed like I wouldn't need to do it separately. > > > > Anyway, I need to get this working ASAP, so I am open to feedback. I'll > continue plugging away. > > > > Thanks! > > Jerry >