Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-07-07 Thread via GitHub


dmpetrov commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3045953884

   `Python 3.13.5 (main, Jun 11 2025, 15:36:57) [Clang 17.0.0 
(clang-1700.0.13.3)] on darwin`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-07-07 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3043769609

   Oh, thank you @dmpetrov . Which Python version is it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-07-07 Thread via GitHub


dmpetrov commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3043723163

   This is lldb stack trace. I hope it will help.
   ```
   $ lldb python
   (lldb) run example.py
   Process 64890 launched: '/.../python' (arm64)
   Process 64890 stopped
   * thread #2, stop reason = exec
   frame #0: 0x0001000147c0 dyld`_dyld_start
   dyld`_dyld_start:
   ->  0x1000147c0 <+0>:  movx0, sp
   0x1000147c4 <+4>:  andsp, x0, #0xfff0
   0x1000147c8 <+8>:  movx29, #0x0 ; =0
   0x1000147cc <+12>: movx30, #0x0 ; =0
   Target 0: (Python) stopped.
   (lldb) c
   Process 64890 resuming
   Schema from dataset:
   URL: string
   TEXT: string
   WIDTH: double
   HEIGHT: double
   similarity: double
   punsafe: double
   pwatermark: double
   AESTHETIC_SCORE: double
   hash: int64
   
   Data from dataset:
URL  ... 
hash
   0  https://endscan.com/media/36373/fb0bf7b2abe7ac...  ...   
929872200875109155
   1  https://static0.colliderimages.com/wordpress/w...  ...  
8338800302313723098
   2  https://images.squarespace-cdn.com/content/v1/...  ...  
7578604913656441916
   
   [3 rows x 9 columns]
   
   ^C
   
   Process 64890 stopped
   * thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
   frame #0: 0x0001864ec3cc libsystem_kernel.dylib`__psynch_cvwait + 8
   libsystem_kernel.dylib`__psynch_cvwait:
   ->  0x1864ec3cc <+8>:  b.lo   0x1864ec3ec; <+40>
   0x1864ec3d0 <+12>: pacibsp
   0x1864ec3d4 <+16>: stpx29, x30, [sp, #-0x10]!
   0x1864ec3d8 <+20>: movx29, sp
   Target 0: (Python) stopped.
   (lldb) thread backtrace all
   * thread #2, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
 * frame #0: 0x0001864ec3cc libsystem_kernel.dylib`__psynch_cvwait + 8
   frame #1: 0x00018652b0e0 libsystem_pthread.dylib`_pthread_cond_wait 
+ 984
   frame #2: 0x00018645b298 
libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock&)
 + 32
   frame #3: 0x0001144e59d8 
libarrow.2000.dylib`arrow::internal::ThreadPool::Shutdown(bool) + 176
   frame #4: 0x0001144e5874 
libarrow.2000.dylib`arrow::internal::ThreadPool::~ThreadPool() + 56
   frame #5: 0x0001144e5b14 
libarrow.2000.dylib`arrow::internal::ThreadPool::~ThreadPool() + 12
   frame #6: 0x000114270378 
libarrow.2000.dylib`std::__1::shared_ptr::~shared_ptr[abi:ue170006]()
 + 56
   frame #7: 0x0001863e2944 libsystem_c.dylib`__cxa_finalize_ranges + 
480
   frame #8: 0x0001863e2704 libsystem_c.dylib`exit + 44
   frame #9: 0x000186534dc8 
libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 20
   frame #10: 0x00018618acac 
dyld`dyld4::LibSystemHelpersWrapper::exit(int) const + 172
   frame #11: 0x00018618abc8 dyld`start + 6124
 thread #3
   frame #0: 0x0001864ea8b0 libsystem_kernel.dylib`__workq_kernreturn + 
8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-07-02 Thread via GitHub


dmpetrov commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3029385404

   @pitrou i cannot run it in gdb in near future unfortunately


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-06-30 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3019044307

   @dmpetrov Do you have the possibility to run your script under gdb and get a 
backtrace of where it's hanging at?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-06-30 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3019004300

   And the other 2 threads are just waiting for thread pool shutdown:
   ```
   Thread 3 (Thread 0x76985a3f3080 (LWP 94748)):
   #0  0x76985a098d71 in __futex_abstimed_wait_common64 (private=-1, 
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x35b339a4) at 
./nptl/futex-internal.c:57
   #1  __futex_abstimed_wait_common (cancel=true, private=-1, abstime=0x0, 
clockid=0, expected=0, futex_word=0x35b339a4) at ./nptl/futex-internal.c:87
   #2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x35b339a4, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at 
./nptl/futex-internal.c:139
   #3  0x76985a09b7ed in __pthread_cond_wait_common (abstime=0x0, 
clockid=0, mutex=0x35b33920, cond=0x35b33978) at ./nptl/pthread_cond_wait.c:503
   #4  ___pthread_cond_wait (cond=0x35b33978, mutex=0x35b33920) at 
./nptl/pthread_cond_wait.c:627
   #5  0x7698584b9047 in 
std::condition_variable::wait(std::unique_lock&) () from 
/lib/x86_64-linux-gnu/libstdc++.so.6
   #6  0x7698538949fb in arrow::internal::ThreadPool::Shutdown(bool) () 
from 
/home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100
   #7  0x76985389ad6f in std::_Sp_counted_ptr::_M_dispose() () from 
/home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100
   #8  0x769853182727 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
/home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100
   #9  0x76985a047a76 in __run_exit_handlers (status=0, listp=, run_list_atexit=run_list_atexit@entry=true, 
run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:108
   #10 0x76985a047bbe in __GI_exit (status=) at 
./stdlib/exit.c:138
   #11 0x76985a02a1d1 in __libc_start_call_main (main=main@entry=0x518950, 
argc=argc@entry=2, argv=argv@entry=0x7fff97594798) at 
../sysdeps/nptl/libc_start_call_main.h:74
   #12 0x76985a02a28b in __libc_start_main_impl (main=0x518950, argc=2, 
argv=0x7fff97594798, init=, fini=, 
rtld_fini=, stack_end=0x7fff97594788) at ../csu/libc-start.c:360
   #13 0x006575a5 in _start ()
   
   Thread 2 (Thread 0x7698523ff6c0 (LWP 94749)):
   #0  0x76985a098d71 in __futex_abstimed_wait_common64 (private=30360, 
cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x769852a16630) at 
./nptl/futex-internal.c:57
   #1  __futex_abstimed_wait_common (cancel=true, private=30360, abstime=0x0, 
clockid=0, expected=0, futex_word=0x769852a16630) at ./nptl/futex-internal.c:87
   #2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x769852a16630, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at 
./nptl/futex-internal.c:139
   #3  0x76985a09b7ed in __pthread_cond_wait_common (abstime=0x0, 
clockid=0, mutex=0x769852a16678, cond=0x769852a16608) at 
./nptl/pthread_cond_wait.c:503
   #4  ___pthread_cond_wait (cond=0x769852a16608, mutex=0x769852a16678) at 
./nptl/pthread_cond_wait.c:627
   #5  0x7698545f1793 in background_thread_entry () from 
/home/antoine/t/venv-3.12/lib/python3.12/site-packages/pyarrow/libarrow.so.2100
   #6  0x76985a09caa4 in start_thread (arg=) at 
./nptl/pthread_create.c:447
   #7  0x76985a129c3c in clone3 () at 
../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-06-30 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3018965995

   I can also reproduce with a recent nightly build 
(`pyarrow-21.0.0.dev254-cp312-cp312-manylinux_2_28_x86_64.whl`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-06-30 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3018933031

   I do not exactly get a hang on PyArrow 20.0, though it crashes at the end:
   ```
   $ python ~/arrow/dev/issue_43497.py 
   
   
   Schema from dataset:
   URL: string
   TEXT: string
   WIDTH: double
   HEIGHT: double
   similarity: double
   punsafe: double
   pwatermark: double
   AESTHETIC_SCORE: double
   hash: int64
   
   Data from dataset:
URL 
  TEXT   WIDTH  ...  pwatermark  AESTHETIC_SCORE
 hash
   0  https://endscan.com/media/36373/fb0bf7b2abe7ac...  View 47 photos of this 
3 bed, 4 bath, and 2,49...  2080.0  ...0.100449 5.040063   
929872200875109155
   1  https://static0.colliderimages.com/wordpress/w... 
john barrowman - photo #12  2011.0  ...0.732896 5.544570  
8338800302313723098
   2  https://images.squarespace-cdn.com/content/v1/...  A black and white 
limited edition portraits of...  1655.0  ...0.081598 5.969655  
7578604913656441916
   
   [3 rows x 9 columns]
   Fatal Python error: PyGILState_Release: auto-releasing thread-state, but no 
thread-state for this thread
   Python runtime state: finalizing (tstate=0x00ba6ac8)
   
   Abandon
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2025-06-25 Thread via GitHub


dmpetrov commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-3006481902

   We're hitting this issue again - and I think many others users are too. 
Would be great if this could be prioritized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


dberenbaum commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2359476905

   Hi, thanks for looking into it. I think you need to setup your gcloud 
credentials. You should be able to do that by installing the `gcloud` CLI and 
then running `gcloud auth application-default login`.
   
   I can confirm that the above code from @pitrou works fine for me:
   
   ```python
   >>> import pyarrow.dataset as ds
   >>> uri = 
"gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv?retry_limit_seconds=5"
   >>> dataset = ds.dataset(uri, format="csv")
   >>> print(dataset.head(5))
   pyarrow.Table
   URL: string
   TEXT: string
   WIDTH: double
   HEIGHT: double
   similarity: double
   punsafe: double
   pwatermark: double
   AESTHETIC_SCORE: double
   hash: int64
   
   URL: 
[["https://endscan.com/media/36373/fb0bf7b2abe7acfbcf95e2a180832e43.jpg","https://static0.colliderimages.com/wordpress/wp-content/uploads/2015/03/arrow-paleyfest-john-barrowman.jpg","https://images.squarespace-cdn.com/content/v1/5bc717b929f2cc0b619dbff7/1554986901865-64XLGVFH3L1V6W0JXMBS/ke17ZwdGBToddI8pDm48kMU-brTfAKJOFGpJx6cnIMl7gQa3H78H3Y0txjaiv_0fDoOvxcdMmMKkDsyUqMSsMWxHk725yiiHCCLfrh8O1z5QPOohDIaIeljMHgDF5CVlOqpeNLcJ80NK65_fV7S1UQqOSwhaf_yYOau3t15EsUQdsqpPwT44aNDfKbHVgxfV5-NaFAjT2bqbYOqLXVJa5A/Profile+of+a+Harris+Hawk.jpg","https://media.kohlsimg.com/is/image/kohls/2656691_ALT2?wid=1024&hei=1024&op_sharpen=1","https://4.bp.blogspot.com/-o67J2rH8_eM/UWSpsLkz3eI/D9U/5g2xqqTt8-w/s1600/Dwarf+Galaxy+Chart+2+small.jpg";]]
   TEXT: [["View 47 photos of this 3 bed, 4 bath, and 2,490 sqft. condo home 
located at 341 Mill St, Saint Paul, Minnesota 55102 is Active for 
$825,000.","john barrowman - photo #12","A black and white limited edition 
portraits of a Harris Hawk. Bird portraits in the Raptor series by fine art 
photographer Paul Coghlin.","Women's Converse Chuck Taylor All Star Madison 
Floral Lined Sneakers","Physicists of the Caribbean: Infographic : Dwarf Galaxy 
..."]]
   WIDTH: [[2080,2011,1655,1024,1600]]
   HEIGHT: [[1388,3000,1055,1024,1600]]
   similarity: 
[[0.2859024107456207,0.31411105394363403,0.3719012141227722,0.2890952229499817,0.3397912383079529]]
   punsafe: 
[[0.00013357401,0.0062506497,0.014455641,0.00017145276,0.00019073486]]
   pwatermark: [[0.10044884,0.73289585,0.081598304,0.18584651,0.6887595]]
   AESTHETIC_SCORE: [[5.0400634,5.54457,5.9696546,5.014857,5.413278]]
   hash: 
[[929872200875109155,8338800302313723098,7578604913656441916,3451195012265564296,-4028870017594316595]]
   ```
   
   In fact, as you can see from this code block, I am even able to successfully 
run `dataset.head()` in the REPL. However, running the same code in a script 
raises a seg fault. I have observed the same behavior in both Mac and Linux.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


coryan commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358780357

   > > Note the specific error: `UNAVAILABLE ... Couldn't resolve host name`. 
This error is retryable, whether on authentication or later. One typically 
wants to try again if the hostname could not be resolved.
   > 
   > I would (perhaps naively) expect DNS to be extremely reliable nowadays, so 
it would perhaps make sense to have a shorter retry time for DNS requests than 
for regular requests.
   
   Good point.  @ddelgrosso1 maybe consider this as a feature request for the 
retry loops.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358760739

   > Note the specific error: `UNAVAILABLE ... Couldn't resolve host name`. 
This error is retryable, whether on authentication or later. One typically 
wants to try again if the hostname could not be resolved.
   
   I would (perhaps naively) expect DNS to be extremely reliable nowadays, so 
it would perhaps make sense to have a shorter retry time for DNS requests than 
for regular requests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358755621

   > "some long", did you mean "so long"? I believe you are suggesting that 
authentication errors should not be retried, or retried less aggressively. 
Please forgive me if I misinterpreted your comment.
   
   Yes, sorry.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


coryan commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358695839

   I no longer work in the Google Cloud SDKs.  @ddelgrosso1 has taken over for 
me.
   
   > It's a pity that by default this would retry for **some long** on an 
authentication failure, though. Perhaps there's a way to avoid that?
   
   "some long", did you mean "so long"? I believe you are suggesting that 
authentication errors should not be retried, or retried less aggressively. 
Please forgive me if I misinterpreted your comment.
   
   Note the specific error: `UNAVAILABLE ...  Couldn't resolve host name`.  
This error is retryable, whether on authentication or later.  One typically 
wants to try again if the hostname could not be resolved.  The one exception 
would be trying to resolve `metadata.google.internal.` **if** there is good 
reason to believe the application is not on a GCE machine.  @ddelgrosso1 
recently implemented code around this area. They may have better ideas on how 
to resolve it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



Re: [I] [Python] Reading partial data/first block hangs on some cloud filesystems [arrow]

2024-09-18 Thread via GitHub


pitrou commented on issue #43497:
URL: https://github.com/apache/arrow/issues/43497#issuecomment-2358649730

   Ok, it seems the request simply fails authenticating and then retries a 
number of times. You can see this by putting a limit on retry duration (5 
seconds in the example below):
   ```python
   >>> import pyarrow.dataset as ds
   >>> uri = 
"gs://datachain-demo/laion-aesthetics-csv/laion_aesthetics_1024_33M_1.csv?retry_limit_seconds=5"
   >>> dataset = ds.dataset(uri, format="csv")
   Traceback (most recent call last):
 ...
   OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted 
GetObjectMetadata: Could not create a OAuth2 access token to authenticate the 
request. The request was not sent, as such an access token is required to 
complete the request successfully. Learn more about Google Cloud authentication 
at https://cloud.google.com/docs/authentication. The underlying error message 
was: PerformWork() - CURL error [6]=Couldn't resolve host name)
   ```
   
   It's a pity that by default this would retry for some long on an 
authentication failure, though. Perhaps there's a way to avoid that?
   
   cc @coryan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]