[jira] [Commented] (ARROW-2081) Hdfs client isn't fork-safe
[ https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382361#comment-16382361 ] Antoine Pitrou commented on ARROW-2081: --- For the record, if you want decent multiprocessing performance together with fork safety, I would suggest using the "forkserver" method, not "spawn". (Note the C libhdfs3 library isn't fork-safe, so no need to try it out IMHO :-)) > Hdfs client isn't fork-safe > --- > > Key: ARROW-2081 > URL: https://issues.apache.org/jira/browse/ARROW-2081 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Jim Crist >Priority: Major > > Given the following script: > > {code:java} > import multiprocessing as mp > import pyarrow as pa > def ls(h): > print("calling ls") > return h.ls("/tmp") > if __name__ == '__main__': > h = pa.hdfs.connect() > print("Using 'spawn'") > pool = mp.get_context('spawn').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'spawn' succeeded\n") > print("Using 'fork'") > pool = mp.get_context('fork').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'fork' succeeded") > {code} > > Results in the following output: > > {code:java} > $ python test.py > Using 'spawn' > calling ls > calling ls > 'spawn' succeeded > Using 'fork{code} > > The process then hangs, and I have to `kill -9` the forked worker processes. > > I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a > problem with libhdfs or just arrow's use of it (a quick google search didn't > turn up anything useful). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2081) Hdfs client isn't fork-safe
[ https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364539#comment-16364539 ] Wes McKinney commented on ARROW-2081: - Is there a way we can detect the fork in the child process(es) and at least avoid a hang or segfault? > Hdfs client isn't fork-safe > --- > > Key: ARROW-2081 > URL: https://issues.apache.org/jira/browse/ARROW-2081 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Jim Crist >Priority: Major > > Given the following script: > > {code:java} > import multiprocessing as mp > import pyarrow as pa > def ls(h): > print("calling ls") > return h.ls("/tmp") > if __name__ == '__main__': > h = pa.hdfs.connect() > print("Using 'spawn'") > pool = mp.get_context('spawn').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'spawn' succeeded\n") > print("Using 'fork'") > pool = mp.get_context('fork').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'fork' succeeded") > {code} > > Results in the following output: > > {code:java} > $ python test.py > Using 'spawn' > calling ls > calling ls > 'spawn' succeeded > Using 'fork{code} > > The process then hangs, and I have to `kill -9` the forked worker processes. > > I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a > problem with libhdfs or just arrow's use of it (a quick google search didn't > turn up anything useful). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2081) Hdfs client isn't fork-safe
[ https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349774#comment-16349774 ] Wes McKinney commented on ARROW-2081: - I think this has to do with the general policy around forking with an embedded JVM. It may not be supported, but I didn't turn up any immediate references > Hdfs client isn't fork-safe > --- > > Key: ARROW-2081 > URL: https://issues.apache.org/jira/browse/ARROW-2081 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Jim Crist >Priority: Major > > Given the following script: > > {code:java} > import multiprocessing as mp > import pyarrow as pa > def ls(h): > print("calling ls") > return h.ls("/tmp") > if __name__ == '__main__': > h = pa.hdfs.connect() > print("Using 'spawn'") > pool = mp.get_context('spawn').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'spawn' succeeded\n") > print("Using 'fork'") > pool = mp.get_context('fork').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'fork' succeeded") > {code} > > Results in the following output: > > {code:java} > $ python test.py > Using 'spawn' > calling ls > calling ls > 'spawn' succeeded > Using 'fork{code} > > The process then hangs, and I have to `kill -9` the forked worker processes. > > I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a > problem with libhdfs or just arrow's use of it (a quick google search didn't > turn up anything useful). -- This message was sent by Atlassian JIRA (v7.6.3#76005)