[ https://issues.apache.org/jira/browse/ARROW-6389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920517#comment-16920517 ]
Ben Schreck commented on ARROW-6389: ------------------------------------ I fixed it by making HADOOP_HOME=/usr in the worker environment. Pyarrow for some reason sets the hadoop executable to $HADOOP_HOME/bin/hadoop on python/pyarrow/hdfs.py:L137. $HADOOP_HOME on my system was already /usr/bin/hadoop, so this resulted in /usr/bin/hadoop/bin/hadoop, which resulted in a wrong $CLASSPATH. > java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] > ---------------------------------------------------------------- > > Key: ARROW-6389 > URL: https://issues.apache.org/jira/browse/ARROW-6389 > Project: Apache Arrow > Issue Type: Bug > Components: Java, Python > Affects Versions: 0.14.1 > Environment: Hadoop 2.85 > EMR 5.24.1 > python version: 3.7.4 > skein version: 0.8.0 > Reporter: Ben Schreck > Priority: Blocker > > I can't access hdfs through pyarrow ( from inside a yarn container created by > skein) > This code works in a jupyter notebook running on the master node, or in an > ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var: > ```{{import pyarrow; pyarrow.hdfs.connect()```}} > > However, when running on yarn by submitting the following skein application, > I get a Java error. > > {{name: test_conn > queue: default > master: > env: > ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native > JAVA_HOME: /etc/alternatives/jre > resources: > vcores: 1 > memory: 10 GiB > files: > conda_env: /home/hadoop/environment.tar.gz > script: | > echo $HADOOP_HOME > echo $JAVA_HOME > echo $HADOOP_CLASSPATH > echo $ARROW_LIBHDFS_DIR > source conda_env/bin/activate > python -c "import pyarrow; pyarrow.hdfs.connect(); > print(fs.open('test.txt').read())" > echo "Hello World!"}} > FYI I tried with/without all those extra env vars, to no effect. I also tried > modifying the EMR cluster with any of the following > > {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs" > "fs.AbstractFileSystem.hdfs.impl": > "org.apache.hadoop.hdfs.DistributedFileSystem" > "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}} > The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- > it was able to find which class by name to use for the "hdfs://" prefix, > namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find > that class. > Logs: > > {{========================================================================================= > LogType:application.driver.log > Log Upload Time:Thu Aug 29 20:51:59 +0000 2019 > LogLength:2635 > Log Contents: > /usr/lib/hadoop > /usr/lib/jvm/java-openjdk > :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* > hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=(NULL)) error: > java.io.IOException: No FileSystem for scheme: hdfs > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) > at > org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414) > at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411) > Traceback (most recent call last): > File "<string>", line 1, in <module> > File > "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 215, in connect > extra_conf=extra_conf) > File > "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 40, in __init__ > self._connect(host, port, user, kerb_ticket, driver, extra_conf) > File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect > File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: HDFS connection failed > Hello World! > End of LogType:application.driver.log > LogType:application.master.log > Log Upload Time:Thu Aug 29 20:51:59 +0000 2019 > LogLength:1588 > Log Contents: > 19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0 > 19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop > 19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification > successfully loaded > 19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at > IP.ec2.internal/IP:8030 > 19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: > yarn.client.max-cached-nodemanagers-proxies : 0 > 19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at > IP.ec2.internal:39361 > 19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at > IP.ec2.internal:36511 > 19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with > resource manager > 19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at > IP.ec2.internal/IP:8032 > 19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver > 19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application > driver completed successfully. > 19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application > with status SUCCEEDED > 19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory > hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001 > 19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down > 19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down > End of LogType:application.master.log}} -- This message was sent by Atlassian Jira (v8.3.2#803003)