Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
dmpetrov commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3045953884 `Python 3.13.5 (main, Jun 11 2025, 15:36:57) [Clang 17.0.0 (clang-1700.0.13.3)] on darwin` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3043769609 Oh, thank you @dmpetrov . Which Python version is it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
dmpetrov commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3043723163 This is lldb stack trace. I hope it will help. ``` $ lldb python (lldb) run example.py Process 64890 launched: '/.../python' (arm64) Process 64890 stopped * thread #2, stop reason = exec frame #0: 0x0001000147c0 dyld`_dyld_start dyld`_dyld_start: -> 0x1000147c0 <+0>: movx0, sp 0x1000147c4 <+4>: andsp, x0, #0xfff0 0x1000147c8 <+8>: movx29, #0x0 ; =0 0x1000147cc <+12>: movx30, #0x0 ; =0 Target 0: (Python) stopped. (lldb) c Process 64890 resuming Schema from dataset: URL: string TEXT: string WIDTH: double HEIGHT: double similarity: double punsafe: double pwatermark: double AESTHETIC_SCORE: double hash: int64 Data from dataset: URL ... hash 0 https://endscan.com/media/36373/fb0bf7b2abe7ac... ... 929872200875109155 1 https://static0.colliderimages.com/wordpress/w... ... 8338800302313723098 2 https://images.squarespace-cdn.com/content/v1/... ... 7578604913656441916 [3 rows x 9 columns] ^C Process 64890 stopped * thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x0001864ec3cc libsystem_kernel.dylib`__psynch_cvwait + 8 libsystem_kernel.dylib`__psynch_cvwait: -> 0x1864ec3cc <+8>: b.lo 0x1864ec3ec; <+40> 0x1864ec3d0 <+12>: pacibsp 0x1864ec3d4 <+16>: stpx29, x30, [sp, #-0x10]! 0x1864ec3d8 <+20>: movx29, sp Target 0: (Python) stopped. (lldb) thread backtrace all * thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x0001864ec3cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame #1: 0x00018652b0e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame #2: 0x00018645b298 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&) + 32 frame #3: 0x0001144e59d8 libarrow.2000.dylib`arrow::internal::ThreadPool::Shutdown(bool) + 176 frame #4: 0x0001144e5874 libarrow.2000.dylib`arrow::internal::ThreadPool::~ThreadPool() + 56 frame #5: 0x0001144e5b14 libarrow.2000.dylib`arrow::internal::ThreadPool::~ThreadPool() + 12 frame #6: 0x000114270378 libarrow.2000.dylib`std::__1::shared_ptr::~shared_ptr[abi:ue170006]() + 56 frame #7: 0x0001863e2944 libsystem_c.dylib`__cxa_finalize_ranges + 480 frame #8: 0x0001863e2704 libsystem_c.dylib`exit + 44 frame #9: 0x000186534dc8 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 20 frame #10: 0x00018618acac dyld`dyld4::LibSystemHelpersWrapper::exit(int) const + 172 frame #11: 0x00018618abc8 dyld`start + 6124 thread #3 frame #0: 0x0001864ea8b0 libsystem_kernel.dylib`__workq_kernreturn + 8 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
dmpetrov commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3029385404 @pitrou i cannot run it in gdb in near future unfortunately -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3019044307 @dmpetrov Do you have the possibility to run your script under gdb and get a backtrace of where it's hanging at? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3019004300 And the other 2 threads are just waiting for thread pool shutdown: ``` Thread 3 (Thread 0x76985a3f3080 (LWP 94748)): #0 0x76985a098d71 in __futex_abstimed_wait_common64 (private=-1, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x35b339a4) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=-1, abstime=0x0, clockid=0, expected=0, futex_word=0x35b339a4) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x35b339a4, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139 #3 0x76985a09b7ed in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x35b33920, cond=0x35b33978) at ./nptl/pthread_cond_wait.c:503 #4 ___pthread_cond_wait (cond=0x35b33978, mutex=0x35b33920) at ./nptl/pthread_cond_wait.c:627 #5 0x7698584b9047 in std::condition_variable::wait(std::unique_lock&) () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x7698538949fb in arrow::internal::ThreadPool::Shutdown(bool) () from /home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100 #7 0x76985389ad6f in std::_Sp_counted_ptr::_M_dispose() () from /home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100 #8 0x769853182727 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100 #9 0x76985a047a76 in __run_exit_handlers (status=0, listp=, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:108 #10 0x76985a047bbe in __GI_exit (status=) at ./stdlib/exit.c:138 #11 0x76985a02a1d1 in __libc_start_call_main (main=main@entry=0x518950, argc=argc@entry=2, argv=argv@entry=0x7fff97594798) at ../sysdeps/nptl/libc_start_call_main.h:74 #12 0x76985a02a28b in __libc_start_main_impl (main=0x518950, argc=2, argv=0x7fff97594798, init=, fini=, rtld_fini=, stack_end=0x7fff97594788) at ../csu/libc-start.c:360 #13 0x006575a5 in _start () Thread 2 (Thread 0x7698523ff6c0 (LWP 94749)): #0 0x76985a098d71 in __futex_abstimed_wait_common64 (private=30360, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x769852a16630) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=30360, abstime=0x0, clockid=0, expected=0, futex_word=0x769852a16630) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x769852a16630, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139 #3 0x76985a09b7ed in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x769852a16678, cond=0x769852a16608) at ./nptl/pthread_cond_wait.c:503 #4 ___pthread_cond_wait (cond=0x769852a16608, mutex=0x769852a16678) at ./nptl/pthread_cond_wait.c:627 #5 0x7698545f1793 in background_thread_entry () from /home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100 #6 0x76985a09caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447 #7 0x76985a129c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3018965995 I can also reproduce with a recent nightly build (`pyarrow-21.0.0.dev254-cp312-cp312-manylinux_2_28_x86_64.whl`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3018933031 I do not exactly get a hang on PyArrow 20.0, though it crashes at the end: ``` $ python ~/arrow/dev/issue_43497.py Schema from dataset: URL: string TEXT: string WIDTH: double HEIGHT: double similarity: double punsafe: double pwatermark: double AESTHETIC_SCORE: double hash: int64 Data from dataset: URL TEXT WIDTH ... pwatermark AESTHETIC_SCORE hash 0 https://endscan.com/media/36373/fb0bf7b2abe7ac... View 47 photos of this 3 bed, 4 bath, and 2,49... 2080.0 ...0.100449 5.040063 929872200875109155 1 https://static0.colliderimages.com/wordpress/w... john barrowman - photo #12 2011.0 ...0.732896 5.544570 8338800302313723098 2 https://images.squarespace-cdn.com/content/v1/... A black and white limited edition portraits of... 1655.0 ...0.081598 5.969655 7578604913656441916 [3 rows x 9 columns] Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no thread-state for this thread Python runtime state: finalizing (tstate=0x00ba6ac8) Abandon ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
dmpetrov commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-3006481902 We're hitting this issue again - and I think many others users are too. Would be great if this could be prioritized. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
dberenbaum commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2359476905 Hi, thanks for looking into it. I think you need to setup your gcloud credentials. You should be able to do that by installing the `gcloud` CLI and then running `gcloud auth application-default login`. I can confirm that the above code from @pitrou works fine for me: ```python >>> import pyarrow.dataset as ds >>> uri = "gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv?retry_limit_seconds=5" >>> dataset = ds.dataset(uri, format="csv") >>> print(dataset.head(5)) pyarrow.Table URL: string TEXT: string WIDTH: double HEIGHT: double similarity: double punsafe: double pwatermark: double AESTHETIC_SCORE: double hash: int64 URL: [["https://endscan.com/media/36373/fb0bf7b2abe7acfbcf95e2a180832e43.jpg","https://static0.colliderimages.com/wordpress/wp-content/uploads/2015/03/arrow-paleyfest-john-barrowman.jpg","https://images.squarespace-cdn.com/content/v1/5bc717b929f2cc0b619dbff7/1554986901865-64XLGVFH3L1V6W0JXMBS/ke17ZwdGBToddI8pDm48kMU-brTfAKJOFGpJx6cnIMl7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z5QPOohDIaIeljMHgDF5CVlOqpeNLcJ80NK65_fV7S1UQqOSwhaf_yYOau3t15EsUQdsqpPwT44aNDfKbHVgxfV5-NaFAjT2bqbYOqLXVJa5A/Profile+of+a+Harris+Hawk.jpg","https://media.kohlsimg.com/is/image/kohls/2656691_ALT2?wid=1024&hei=1024&op_sharpen=1","https://4.bp.blogspot.com/-o67J2rH8_eM/UWSpsLkz3eI/D9U/5g2xqqTt8-w/s1600/Dwarf+Galaxy+Chart+2+small.jpg";]] TEXT: [["View 47 photos of this 3 bed, 4 bath, and 2,490 sqft. condo home located at 341 Mill St, Saint Paul, Minnesota 55102 is Active for $825,000.","john barrowman - photo #12","A black and white limited edition portraits of a Harris Hawk. Bird portraits in the Raptor series by fine art photographer Paul Coghlin.","Women's Converse Chuck Taylor All Star Madison Floral Lined Sneakers","Physicists of the Caribbean: Infographic : Dwarf Galaxy ..."]] WIDTH: [[2080,2011,1655,1024,1600]] HEIGHT: [[1388,3000,1055,1024,1600]] similarity: [[0.2859024107456207,0.31411105394363403,0.3719012141227722,0.2890952229499817,0.3397912383079529]] punsafe: [[0.00013357401,0.0062506497,0.014455641,0.00017145276,0.00019073486]] pwatermark: [[0.10044884,0.73289585,0.081598304,0.18584651,0.6887595]] AESTHETIC_SCORE: [[5.0400634,5.54457,5.9696546,5.014857,5.413278]] hash: [[929872200875109155,8338800302313723098,7578604913656441916,3451195012265564296,-4028870017594316595]] ``` In fact, as you can see from this code block, I am even able to successfully run `dataset.head()` in the REPL. However, running the same code in a script raises a seg fault. I have observed the same behavior in both Mac and Linux. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
coryan commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358780357 > > Note the specific error: `UNAVAILABLE ... Couldn't resolve host name`. This error is retryable, whether on authentication or later. One typically wants to try again if the hostname could not be resolved. > > I would (perhaps naively) expect DNS to be extremely reliable nowadays, so it would perhaps make sense to have a shorter retry time for DNS requests than for regular requests. Good point. @ddelgrosso1 maybe consider this as a feature request for the retry loops. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358760739 > Note the specific error: `UNAVAILABLE ... Couldn't resolve host name`. This error is retryable, whether on authentication or later. One typically wants to try again if the hostname could not be resolved. I would (perhaps naively) expect DNS to be extremely reliable nowadays, so it would perhaps make sense to have a shorter retry time for DNS requests than for regular requests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358755621 > "some long", did you mean "so long"? I believe you are suggesting that authentication errors should not be retried, or retried less aggressively. Please forgive me if I misinterpreted your comment. Yes, sorry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
coryan commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358695839 I no longer work in the Google Cloud SDKs. @ddelgrosso1 has taken over for me. > It's a pity that by default this would retry for **some long** on an authentication failure, though. Perhaps there's a way to avoid that? "some long", did you mean "so long"? I believe you are suggesting that authentication errors should not be retried, or retried less aggressively. Please forgive me if I misinterpreted your comment. Note the specific error: `UNAVAILABLE ... Couldn't resolve host name`. This error is retryable, whether on authentication or later. One typically wants to try again if the hostname could not be resolved. The one exception would be trying to resolve `metadata.google.internal.` **if** there is good reason to believe the application is not on a GCE machine. @ddelgrosso1 recently implemented code around this area. They may have better ideas on how to resolve it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]
pitrou commented on issue #43497: URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358649730 Ok, it seems the request simply fails authenticating and then retries a number of times. You can see this by putting a limit on retry duration (5 seconds in the example below): ```python >>> import pyarrow.dataset as ds >>> uri = "gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv?retry_limit_seconds=5" >>> dataset = ds.dataset(uri, format="csv") Traceback (most recent call last): ... OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted GetObjectMetadata: Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: PerformWork() - CURL error [6]=Couldn't resolve host name) ``` It's a pity that by default this would retry for some long on an authentication failure, though. Perhaps there's a way to avoid that? cc @coryan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
