Eric Henry created ARROW-8154: --------------------------------- Summary: HDFS Filesystem does not set environment variables in pyarrow 0.16.0 release Key: ARROW-8154 URL: https://issues.apache.org/jira/browse/ARROW-8154 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Eric Henry
In pyarrow 0.15.x, HDFS filesystem works as follows: If you set HADOOP_HOME env var, it looks for libhdfs.so in $HADOOP_HOME/lib/native. In pyarrow 0.16.x, if you set HADOOP_HOME, it looks for libhdfs.so in $HADOOP_HOME, which is incorrect behaviour on all systems I am using. Also, CLASSPATH no longer gets set automatically, which is very convenient. The issue here is that I need to set hadoop home correctly to be able to use other libraries, but have to reset it to use apache arrow. e.g. os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" ..do stuff here.. ...then connect to arrow... os.environ["HADOOP_HOME"] = "/usr/lib/hadoop/lib/native" hdfs = pyarrow.hdfs.connect(host, port) ...then reset my hadoop home... os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" etc. Example: >>> os.environ["HADOOP_HOME"] = "/usr/lib/hadoop" >>> hdfs = pyarrow.hdfs.connect(host, port) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", line 215, in connect extra_conf=extra_conf) File "/home/user/.conda/envs/retroscoring/lib/python3.6/site-packages/pyarrow/hdfs.py", line 40, in __init__ self._connect(host, port, user, kerb_ticket, driver, extra_conf) File "pyarrow/io-hdfs.pxi", line 89, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Unable to load libhdfs: /usr/lib/hadoop/libhdfs.so: cannot open shared object file: No such file or directory -- This message was sent by Atlassian Jira (v8.3.4#803005)