Ben Schreck created ARROW-6389: ---------------------------------- Summary: java.io.IOException: No FileSystem for scheme: hdfs [On AWS EMR] Key: ARROW-6389 URL: https://issues.apache.org/jira/browse/ARROW-6389 Project: Apache Arrow Issue Type: Bug Components: Java, Python Affects Versions: 0.14.1 Environment: Hadoop 2.85 EMR 5.24.1 python version: 3.7.4 skein version: 0.8.0 Reporter: Ben Schreck
I can't access hdfs through pyarrow ( from inside a yarn container created by skein) This code works in a jupyter notebook running on the master node, or in an ipython terminal on a worker when given the ARROW_LIBHDFS_DIR env var: ```{{import pyarrow; pyarrow.hdfs.connect()```}} However, when running on yarn by submitting the following skein application, I get a Java error. {{name: test_conn queue: default master: env: ARROW_LIBHDFS_DIR: /usr/lib/hadoop/lib/native JAVA_HOME: /etc/alternatives/jre resources: vcores: 1 memory: 10 GiB files: conda_env: /home/hadoop/environment.tar.gz script: | echo $HADOOP_HOME echo $JAVA_HOME echo $HADOOP_CLASSPATH echo $ARROW_LIBHDFS_DIR source conda_env/bin/activate python -c "import pyarrow; pyarrow.hdfs.connect(); print(fs.open('test.txt').read())" echo "Hello World!"}} FYI I tried with/without all those extra env vars, to no effect. I also tried modifying the EMR cluster with any of the following {{"fs.hdfs.impl": "org.apache.hadoop.fs.Hdfs" "fs.AbstractFileSystem.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem" "fs.hdfs.impl": "org.apache.hadoop.hdfs.DistributedFileSystem"}} The {{fs.AbstractFileSystem.hdfs.impl}} one gave a slightly different error- it was able to find which class by name to use for the "hdfs://" prefix, namely {{org.apache.hadoop.hdfs.DistributedFileSystem}}, but not able to find that class. Logs: {{========================================================================================= LogType:application.driver.log Log Upload Time:Thu Aug 29 20:51:59 +0000 2019 LogLength:2635 Log Contents: /usr/lib/hadoop /usr/lib/jvm/java-openjdk :/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar:/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar:/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar:/usr/share/aws/emr/cloudwatch-sink/lib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/* hdfsBuilderConnect(forceNewInstance=1, nn=default, port=0, kerbTicketCachePath=(NULL), userName=(NULL)) error: java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2846) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2857) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2896) at org.apache.hadoop.fs.FileSystem$Cache.getUnique(FileSystem.java:2884) at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:439) at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:414) at org.apache.hadoop.fs.FileSystem$2.run(FileSystem.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:411) Traceback (most recent call last): File "<string>", line 1, in <module> File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/mnt2/yarn/usercache/hadoop/appcache/application_1567110830725_0001/container_1567110830725_0001_01_000001/conda_env/lib/python3.7/site-packages/pyarrow/hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed Hello World! End of LogType:application.driver.log LogType:application.master.log Log Upload Time:Thu Aug 29 20:51:59 +0000 2019 LogLength:1588 Log Contents: 19/08/29 20:51:55 INFO skein.ApplicationMaster: Starting Skein version 0.8.0 19/08/29 20:51:55 INFO skein.ApplicationMaster: Running as user hadoop 19/08/29 20:51:55 INFO skein.ApplicationMaster: Application specification successfully loaded 19/08/29 20:51:56 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8030 19/08/29 20:51:56 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0 19/08/29 20:51:56 INFO skein.ApplicationMaster: gRPC server started at IP.ec2.internal:39361 19/08/29 20:51:57 INFO skein.ApplicationMaster: WebUI server started at IP.ec2.internal:36511 19/08/29 20:51:57 INFO skein.ApplicationMaster: Registering application with resource manager 19/08/29 20:51:57 INFO client.RMProxy: Connecting to ResourceManager at IP.ec2.internal/IP:8032 19/08/29 20:51:57 INFO skein.ApplicationMaster: Starting application driver 19/08/29 20:51:57 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully. 19/08/29 20:51:57 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED 19/08/29 20:51:57 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 19/08/29 20:51:58 INFO skein.ApplicationMaster: Deleted application directory hdfs://IP.ec2.internal:8020/user/hadoop/.skein/application_1567110830725_0001 19/08/29 20:51:58 INFO skein.ApplicationMaster: WebUI server shut down 19/08/29 20:51:58 INFO skein.ApplicationMaster: gRPC server shut down End of LogType:application.master.log}} -- This message was sent by Atlassian Jira (v8.3.2#803003)