[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
[ https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919936#comment-16919936 ] Saurabh Bajaj commented on ARROW-5922: -- Try setting the environment variable ARROW_LIBHDFS_DIR to the explicit location of libhdfs.so at the worker nodes. That's what worked for me. > [Python] Unable to connect to HDFS from a worker/data node on a Kerberized > cluster using pyarrow' hdfs API > -- > > Key: ARROW-5922 > URL: https://issues.apache.org/jira/browse/ARROW-5922 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Unix >Reporter: Saurabh Bajaj >Priority: Major > Fix For: 0.14.0 > > > Here's what I'm trying: > {{```}} > {{import pyarrow as pa }} > {{conf = \{"hadoop.security.authentication": "kerberos"} }} > {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} > {{```}} > However, when I submit this job to the cluster using {{Dask-YARN}}, I get the > following error: > ``` > {{File "test/run.py", line 3 fs = > pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 211, in connect File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} > {{```}} > I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however > I run into the same error. Since the error is not descriptive, I'm not sure > which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Closed] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saurabh Bajaj closed ARROW-6150. Resolution: Fixed Fix Version/s: 0.14.1 > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > Fix For: 0.14.1 > > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903033#comment-16903033 ] Saurabh Bajaj commented on ARROW-6150: -- Turns out this was being cause by duplication of computation of "dask.get" tasks on Delayed objects, hence a user error. Closing this ticket. > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397 ] Saurabh Bajaj edited comment on ARROW-6150 at 8/6/19 7:12 PM: -- I tried setting port=8020 in pa.hdfs.connect(), but same intermittent errors. was (Author: sbajaj): I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent errors. > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397 ] Saurabh Bajaj commented on ARROW-6150: -- I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent errors. > [Python] Intermittent HDFS error > > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error
[ https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901394#comment-16901394 ] Saurabh Bajaj commented on ARROW-6150: -- [~wesmckinn] Thanks for your response! I found https://issues.apache.org/jira/browse/ARROW-3957 and the PR that address it: [https://github.com/apache/arrow/commit/758bd557584107cb336cbc3422744dacd93978af]. Seems like the cause of the issue is an incorrect port? The default to {{pa.hdfs.connect()}} is {{port=0}}. What would be the correct port to use? > Intermittent Pyarrow HDFS IO error > -- > > Key: ARROW-6150 > URL: https://issues.apache.org/jira/browse/ARROW-6150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.1 >Reporter: Saurabh Bajaj >Priority: Minor > > I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code > shown in traceback below) using PyArrow's HDFS IO library. However, the job > intermittently runs into the error shown below, not every run, only > sometimes. I'm unable to determine the root cause of this issue. > > {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, > fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in > pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in > pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, > errno: 255 (Unknown error 255) Please check that you are connecting to the > correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error
Saurabh Bajaj created ARROW-6150: Summary: Intermittent Pyarrow HDFS IO error Key: ARROW-6150 URL: https://issues.apache.org/jira/browse/ARROW-6150 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Reporter: Saurabh Bajaj I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code shown in traceback below) using PyArrow's HDFS IO library. However, the job intermittently runs into the error shown below, not every run, only sometimes. I'm unable to determine the root cause of this issue. {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, errno: 255 (Unknown error 255) Please check that you are connecting to the correct HDFS RPC port}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Closed] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
[ https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saurabh Bajaj closed ARROW-5922. Resolution: Works for Me > [Python] Unable to connect to HDFS from a worker/data node on a Kerberized > cluster using pyarrow' hdfs API > -- > > Key: ARROW-5922 > URL: https://issues.apache.org/jira/browse/ARROW-5922 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.14.0 > Environment: Unix >Reporter: Saurabh Bajaj >Priority: Major > Fix For: 0.14.0 > > > Here's what I'm trying: > {{```}} > {{import pyarrow as pa }} > {{conf = \{"hadoop.security.authentication": "kerberos"} }} > {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} > {{```}} > However, when I submit this job to the cluster using {{Dask-YARN}}, I get the > following error: > ``` > {{File "test/run.py", line 3 fs = > pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 211, in connect File > "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", > line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in > pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in > pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} > {{```}} > I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however > I run into the same error. Since the error is not descriptive, I'm not sure > which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-5922) Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API
Saurabh Bajaj created ARROW-5922: Summary: Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API Key: ARROW-5922 URL: https://issues.apache.org/jira/browse/ARROW-5922 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.0 Environment: Unix Reporter: Saurabh Bajaj Fix For: 0.14.0 Here's what I'm trying: {{```}} {{import pyarrow as pa }} {{conf = \{"hadoop.security.authentication": "kerberos"} }} {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}} {{```}} However, when I submit this job to the cluster using {{Dask-YARN}}, I get the following error: ``` {{File "test/run.py", line 3 fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}} {{```}} I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however I run into the same error. Since the error is not descriptive, I'm not sure which setting needs to be altered. Any clues anyone? -- This message was sent by Atlassian JIRA (v7.6.14#76016)