[jira] [Commented] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-08-30 Thread Saurabh Bajaj (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919936#comment-16919936
 ] 

Saurabh Bajaj commented on ARROW-5922:
--

Try setting the environment variable ARROW_LIBHDFS_DIR to the explicit location 
of libhdfs.so at the worker nodes. That's what worked for me. 

> [Python] Unable to connect to HDFS from a worker/data node on a Kerberized 
> cluster using pyarrow' hdfs API
> --
>
> Key: ARROW-5922
> URL: https://issues.apache.org/jira/browse/ARROW-5922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Unix
>Reporter: Saurabh Bajaj
>Priority: Major
> Fix For: 0.14.0
>
>
> Here's what I'm trying:
> {{```}}
> {{import pyarrow as pa }}
> {{conf = \{"hadoop.security.authentication": "kerberos"} }}
> {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}
> {{```}}
> However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
> following error:
> ```
> {{File "test/run.py", line 3 fs = 
> pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 211, in connect File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
> pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}
> {{```}}
> I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however 
> I run into the same error. Since the error is not descriptive, I'm not sure 
> which setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Closed] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Saurabh Bajaj (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Bajaj closed ARROW-6150.

   Resolution: Fixed
Fix Version/s: 0.14.1

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
> Fix For: 0.14.1
>
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-08 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16903033#comment-16903033
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

Turns out this was being cause by duplication of computation of "dask.get" 
tasks on Delayed objects, hence a user error. Closing this ticket. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397
 ] 

Saurabh Bajaj edited comment on ARROW-6150 at 8/6/19 7:12 PM:
--

I tried setting port=8020 in pa.hdfs.connect(), but same intermittent errors. 


was (Author: sbajaj):
I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent 
errors. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) [Python] Intermittent HDFS error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901397#comment-16901397
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

I tried setting `port=8020` in `pa.hdfs.connect()`, but same intermittent 
errors. 

> [Python] Intermittent HDFS error
> 
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Saurabh Bajaj (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901394#comment-16901394
 ] 

Saurabh Bajaj commented on ARROW-6150:
--

[~wesmckinn] Thanks for your response! 

I found https://issues.apache.org/jira/browse/ARROW-3957 and the PR that 
address it: 
[https://github.com/apache/arrow/commit/758bd557584107cb336cbc3422744dacd93978af].
 

Seems like the cause of the issue is an incorrect port? The default to 
{{pa.hdfs.connect()}} is {{port=0}}. What would be the correct port to use?

> Intermittent Pyarrow HDFS IO error
> --
>
> Key: ARROW-6150
> URL: https://issues.apache.org/jira/browse/ARROW-6150
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.1
>Reporter: Saurabh Bajaj
>Priority: Minor
>
> I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
> shown in traceback below) using PyArrow's HDFS IO library. However, the job 
> intermittently runs into the error shown below, not every run, only 
> sometimes. I'm unable to determine the root cause of this issue.
>  
> {{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
> fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
> pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
> pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
> errno: 255 (Unknown error 255) Please check that you are connecting to the 
> correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6150) Intermittent Pyarrow HDFS IO error

2019-08-06 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-6150:


 Summary: Intermittent Pyarrow HDFS IO error
 Key: ARROW-6150
 URL: https://issues.apache.org/jira/browse/ARROW-6150
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Saurabh Bajaj


I'm running a Dask-YARN job that dumps a results dictionary into HDFS (code 
shown in traceback below) using PyArrow's HDFS IO library. However, the job 
intermittently runs into the error shown below, not every run, only sometimes. 
I'm unable to determine the root cause of this issue.

 

{{ File "/extractor.py", line 87, in __call__ json.dump(results_dict, 
fp=_UTF8Encoder(f), indent=4) File "pyarrow/io.pxi", line 72, in 
pyarrow.lib.NativeFile.__exit__ File "pyarrow/io.pxi", line 130, in 
pyarrow.lib.NativeFile.close File "pyarrow/error.pxi", line 87, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS CloseFile failed, 
errno: 255 (Unknown error 255) Please check that you are connecting to the 
correct HDFS RPC port}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (ARROW-5922) [Python] Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-08-06 Thread Saurabh Bajaj (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saurabh Bajaj closed ARROW-5922.

Resolution: Works for Me

> [Python] Unable to connect to HDFS from a worker/data node on a Kerberized 
> cluster using pyarrow' hdfs API
> --
>
> Key: ARROW-5922
> URL: https://issues.apache.org/jira/browse/ARROW-5922
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.14.0
> Environment: Unix
>Reporter: Saurabh Bajaj
>Priority: Major
> Fix For: 0.14.0
>
>
> Here's what I'm trying:
> {{```}}
> {{import pyarrow as pa }}
> {{conf = \{"hadoop.security.authentication": "kerberos"} }}
> {{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}
> {{```}}
> However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
> following error:
> ```
> {{File "test/run.py", line 3 fs = 
> pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 211, in connect File 
> "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
>  line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
> pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
> pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}
> {{```}}
> I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however 
> I run into the same error. Since the error is not descriptive, I'm not sure 
> which setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5922) Unable to connect to HDFS from a worker/data node on a Kerberized cluster using pyarrow' hdfs API

2019-07-12 Thread Saurabh Bajaj (JIRA)
Saurabh Bajaj created ARROW-5922:


 Summary: Unable to connect to HDFS from a worker/data node on a 
Kerberized cluster using pyarrow' hdfs API
 Key: ARROW-5922
 URL: https://issues.apache.org/jira/browse/ARROW-5922
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.0
 Environment: Unix
Reporter: Saurabh Bajaj
 Fix For: 0.14.0


Here's what I'm trying:

{{```}}

{{import pyarrow as pa }}

{{conf = \{"hadoop.security.authentication": "kerberos"} }}

{{fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf)}}

{{```}}

However, when I submit this job to the cluster using {{Dask-YARN}}, I get the 
following error:

```

{{File "test/run.py", line 3 fs = 
pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_4", extra_conf=conf) File 
"/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 211, in connect File 
"/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_03/environment/lib/python3.7/site-packages/pyarrow/hdfs.py",
 line 38, in __init__ File "pyarrow/io-hdfs.pxi", line 105, in 
pyarrow.lib.HadoopFileSystem._connect File "pyarrow/error.pxi", line 83, in 
pyarrow.lib.check_status pyarrow.lib.ArrowIOError: HDFS connection failed}}

{{```}}

I also tried setting {{host (to a name node)}} and {{port (=8020)}}, however I 
run into the same error. Since the error is not descriptive, I'm not sure which 
setting needs to be altered. Any clues anyone?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)