Re: GetHDFS from Azure Blob

Austin Heyne Wed, 29 Mar 2017 09:36:51 -0700

For the record,

The way we figured out to fix this is to create a new XML file for eachroot level container that we use (tentatively fs.xml). This fs.xml lookslike the following:


<configuration>
    <property>
      <name>fs.defaultFS</name>
<value>wasb://[email protected]/</value>
    </property>
</configuration>

We then include the core-site.xml, hdfs-site.xml and fs.xml in the'Hadoop Configuration Resources' path ensuring the fs.xml comes last.This will overwrite the fs.defaultFS value set in core-site.xml.


Thanks everyone for the help,
Austin

On 03/28/2017 06:11 PM, Austin Heyne wrote:

Thanks Bryan,
We're only working with one account here but with multiple root levelcontainers. e.g.
wasb://[email protected]/
wasb://[email protected]/
wasb://[email protected]/
The thing that stands out to me the most is why would the defaultFSneed to be set at all if we're always providing complete wasb://...paths? Almost seems like a bug or oversight.
If anyone has any input on how we could work around this please let meknow.
Thanks for your help,
Austin

On 03/28/2017 04:39 PM, Bryan Bende wrote:
Austin,

I think you are correct that its <containername>@<accountname>, I
hadn't looked at this config in a long time and was reading too
quickly before :)

That would line up with the other property
fs.azure.account.key.<accountname>.blob.core.windows.net where you
specify the key for that account.

I have no idea if this will work, but lets say you had three different
WASB file systems, presumably each with their own account name and
key, you might be able to define these in core-site.xml:

  <property>
<name>fs.azure.account.key.ACCOUNT1.blob.core.windows.net</name>
       <value>KEY1</value>
     </property>

  <property>
<name>fs.azure.account.key.ACCOUNT2.blob.core.windows.net</name>
       <value>KEY2</value>
     </property>

  <property>
<name>fs.azure.account.key.ACCOUNT3.blob.core.windows.net</name>
       <value>KEY3</value>
     </property>

Then in your HDFS processor in NiFi you point at this core-site.xml
and use a specific directory like
wasb://[email protected]/<path> and I'm hoping
it would know how to use the key for ACCOUNT3.

Not really sure if that helps your situation.

-Bryan


On Tue, Mar 28, 2017 at 4:14 PM, Austin Heyne <[email protected]> wrote:
Bryan,
So I initially didn't think much of it (assumed it a typo, etc) butyou've
said that the access url for wasb that you've been using is
wasb://YOUR_USER@YOUR_HOST/. However, this has never worked for usand I'mwondering if we have a difference configuration somewhere. What wehave touse iswasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>which seems to be in line with the Azure blob storage GUI and iswhat isoutlined here [1]. Is there some other way this connector is beingsetup? Itwould make much more sense using your access pattern as then eachcontainer
wouldn't need to have it's own core-site.xml.

Thanks,
Austin

[1a]
https://hadoop.apache.org/docs/current/hadoop-azure/index.html#Accessing_wasb_URLs
[1b]
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage
On 03/28/2017 03:55 PM, Bryan Bende wrote:
Austin,

I believe the default FS is only used when you write to a path that
doesn't specify the filesystem. Meaning, if you set the directory of
PutHDFS to /data then it will use the default FS, but if you specify
wasb://user@wasb2/data then it will go to /data in a different
filesystem.

The problem here is that I don't see a way to specify different keys
for each WASB filesystem in the core-site.xml.

Admittedly I have never tried to setup something like this with many
different filesystems.

-Bryan


On Tue, Mar 28, 2017 at 3:50 PM, Austin Heyne <[email protected]> wrote:
Hi Andre,

Yes, I'm aware of that configuration property, it's what I have been
using
to set the core-site.xml and hdfs-site.xml. For testing this I didn't
modify
the core-site located in the HADOOP_CONF_DIR but rather copied and
modified
it and the pointed the processor to the copy. The problem withthis is
that
we'll end up with a large number of core-site.xml copies that willall
have
to be maintained separately. Ideally we'd be able to specify the
defaultFS
in the processor config or have the processor behave like the hdfs
command
line tools. The command line tools don't require the defaultFS tobe set
to
a wasb url in order to use wasb urls.

The key idea here is long term maintainability and using Ambari to
maintain
the configuration. If we need to change any other setting in the
core-site.xml we'd have to change it in a bunch of different files
manually.

Thanks,
Austin


On 03/28/2017 03:34 PM, Andre wrote:

Austin,

Perhaps that wasn't explicit but the settings don't need to be system
wide,
instead the defaultFS may be changed just for a particular processor,
while
the others may use configurations.

The *HDFS processor documentation mentions it allows yout to set
particular
hadoop configurations:
" A file or comma separated list of files which contains theHadoop filesystem configuration. Without this, Hadoop will search theclasspath for
a
'core-site.xml' and 'hdfs-site.xml' file or will revert to a default
configuration"
Have you tried using this field to point to a file as described byBryan?
Cheers

On 29 Mar 2017 05:21, "Austin Heyne" <[email protected]> wrote:

Thanks Bryan,
Working with the configuration you sent what I needed to changewas to
set
the fs.defaultFS to the wasb url that we're working from.Unfortunately
this
is a less than ideal solution since we'll be pulling files frommultiple
wasb urls and ingesting them into an Accumulo datastore. Changing the
defaultFS I'm pretty certainly would mess with our localHDFS/Accumuloinstall. In addition we're trying to maintain all of thisconfiguration
with
Ambari, which from what I can tell only supports one core-site
configuration
file.
Is the only solution here to maintain multiple core-site.xml filesor is
there another way we configure this?

Thanks,

Austin



On 03/28/2017 01:41 PM, Bryan Bende wrote:
Austin,

Can you provide the full error message and stacktrace for  the
IllegalArgumentException from nifi-app.log?
When you start the processor it creates a FileSystem instancebased on
the config files provided to the processor, which in turn causes all
of the corresponding classes to load.

I'm not that familiar with Azure, but if "Azure blob store" is WASB,
then I have successfully done the following...

In core-site.xml:

<configuration>

       <property>
         <name>fs.defaultFS</name>
<value>wasb://YOUR_USER@YOUR_HOST/</value>
       </property>

       <property>
<name>fs.azure.account.key.nifi.blob.core.windows.net</name>
         <value>YOUR_KEY</value>
       </property>

       <property>
<name>fs.AbstractFileSystem.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
       </property>

       <property>
         <name>fs.wasb.impl</name>
<value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
       </property>

       <property>
         <name>fs.azure.skip.metrics</name>
         <value>true</value>
       </property>

</configuration>

In Additional Resources property of an HDFS processor, point to a
directory with:

azure-storage-2.0.0.jar
commons-codec-1.6.jar
commons-lang3-3.3.2.jar
commons-logging-1.1.1.jar
guava-11.0.2.jar
hadoop-azure-2.7.3.jar
httpclient-4.2.5.jar
httpcore-4.2.4.jar
jackson-core-2.2.3.jar
jsr305-1.3.9.jar
slf4j-api-1.7.5.jar


Thanks,

Bryan
On Tue, Mar 28, 2017 at 1:15 PM, Austin Heyne <[email protected]>wrote:
Hi all,

Thanks for all the help you've given me so far. Today I'm trying to
pull
files from an Azure blob store. I've done some reading on thisand fromprevious tickets [1] and guides [2] it seems the recommendedapproach
is
to
place the required jars, to use the HDFS Azure protocol, in'AdditionalClasspath Resoures' and the hadoop core-site and hdfs-siteconfigs into
the
'Hadoop Configuration Resources'. I have my local HDFS properly
configured
to access wasb urls. I'm able to ls, copy to and from, etc with out
problem.
Using the same HDFS config files and trying both all the jars in my
hadoop-client/lib directory (hdp) and using the jars recommendin [1]
I'm
still seeing the "java.lang.IllegalArgumentException: Wrong FS:" error
in
my NiFi logs and am unable to pull files from Azure blob storage.
Interestingly, it seems the processor is spinning up way tofast, the
errors
appear in the log as soon as I start the processor. I'm not surehow it
could be loading all of those jars that quickly.
Does anyone have any experience with this or recommendations totry?
Thanks,
Austin

[1] https://issues.apache.org/jira/browse/NIFI-1922
[2]
https://community.hortonworks.com/articles/71916/connecting-to-azure-data-lake-from-a-nifi-dataflow.html

Re: GetHDFS from Azure Blob

Reply via email to