[Python] Why is the access to values in ChunkedArray O(n_chunks) ?

Quentin Lhoest Mon, 15 Mar 2021 09:11:59 -0700

Hi !

I’m working on the huggingface/datasets library and we are using Arrow for 
natural language processing datasets.
Our datasets are usually text datasets, often bigger than memory.


We use memory mapping of arrow files to load the datasets.

> Unfortunately it looks like accessing examples is not O(1)

More specifically, it looks like the time it takes to access an example of a 
ChunkedArray is a function of the index of the chunk in which the example is at.
If the example is in the last chunk, then it takes more time to access it than 
accessing an example in the first chunk.

For example, with a Table consisting of 1 column “text” defined by:
- 1024 chunks
- each chunk is 1024 rows
- each row is a text of 1024 characters

Then the time it takes to access one example are:

```
Time to access example at i=0%  : 6.7μs
Time to access example at i=10% : 7.2μs
Time to access example at i=20% : 9.1μs
Time to access example at i=30% : 11.4μs
Time to access example at i=40% : 13.8μs
Time to access example at i=50% : 16.2μs
Time to access example at i=60% : 18.7μs
Time to access example at i=70% : 21.1μs
Time to access example at i=80% : 26.8μs
Time to access example at i=90% : 25.2μs
```

The time measured are the average times to do `table[“text”][j]` depending on 
the index we want to fetch (from the first example at 0% to the example at 90% 
of the length of the table).

You can take a look at the code that produces this benchmark here 
<https://pastebin.com/pSkYHQn9>.

> Why do I observe this behavior ? Is there a way to get O(1) access without 
> bringing the whole table in memory ?


Thanks in advance for you help !

Quentin

[Python] Why is the access to values in ChunkedArray O(n_chunks) ?

Reply via email to