Hi, I have an arrow file produced by HF datasets and I am trying to load this dataset/arrow file with `datasets.load_from_disk(the_dataset_folder)`. I noticed that the first time I load it, it would be significantly slower than the subsequent times. Two days later, I will retry loading it, and it will be slow again...
After diving a little bit, the gap happens in the `_memory_mapped_arrow_table_from_file` function, and in particular in the call to `RecordBatchStreamReader.read_all`: https://github.com/huggingface/datasets/blob/158917e24128afbbe0f03ce36ea8cd9f850ea853/src/datasets/table.py#L51 `read_all` is slow the first time (probably for some operations that are only happening once, and are cached for a few hours?), but not the subsequent times. ``` >>> import time >>> import pyarrow as pa >>> def _memory_mapped_arrow_table_from_file(filename): ... memory_mapped_stream = pa.memory_map(filename) ... opened_stream = pa.ipc.open_stream(memory_mapped_stream) ... start_time = time.time() ... _ = opened_stream.read_all() ... print(f"{time.time()-start_time}") ... >>> filename_slow = "train/00248-00249/cache-3d25861de64b93b5.arrow" >>> _memory_mapped_arrow_table_from_file(filename_slow) # First time 0.24040865898132324 >>> _memory_mapped_arrow_table_from_file(filename_slow) # subsequent times 0.0006551742553710938 >>> _memory_mapped_arrow_table_from_file(filename_slow) 0.0006804466247558594 >>> _memory_mapped_arrow_table_from_file(filename_slow) 0.0009818077087402344 ``` Anything I can do to remove that discrepancy? My setup: - Platform: Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.17 - Python version: 3.8.13 - PyArrow version: 9.0.0 Thanks in advance! -- *Victor Sanh* Scientist 🤗 We're hiring! <https://angel.co/company/hugging-face/jobs> website: https://huggingface.co/
