In case it's helpful at all, on the Julia side, we have an index included <https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14> with the ChainedVector type (the Julia equivalent of ChunkedArray), where each element of the `inds` field is the last index of the corresponding array chunk. We then use that index when indexing <https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L32> with `searchsortedfirst`, which does a binary search over the indices. We also overload iteration <https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L91> so that sequential access just moves from chunk to chunk without having to pay the binary search indexing cost.
Anyway, maybe that at least sparks some ideas. -Jacob On Mon, Mar 15, 2021 at 11:11 AM Quentin Lhoest <[email protected]> wrote: > Thanks for your response ! > > This is accurate the O(1) access applies to a single Array. Chunked > arrays are stored as a C++ vector of Arrays. There is currently no > indexing structure on top of the vector to allow for anything better than > O(chunk) to an arbitrary element. > > > That makes sense, thanks. > > I am a little surprised that the difference is as pronounced in your > benchmark for only 1024 chunks though, I'm not familiar enough with the > memory mapped code path to be able to answer why that would be the case. > > > For information, I’m getting approximately the same results without memory > mapping though: > ``` > > Time to access example at i=0% : 9.2μs > Time to access example at i=10% : 7.6μs > Time to access example at i=20% : 8.7μs > Time to access example at i=30% : 10.8μs > Time to access example at i=40% : 13.0μs > Time to access example at i=50% : 15.4μs > Time to access example at i=60% : 17.6μs > Time to access example at i=70% : 20.5μs > Time to access example at i=80% : 22.9μs > Time to access example at i=90% : 24.1μs > > ``` > > From the previous code I just replaced `pa.memory_map(…)` by `open(…, > “rb”)` > > Also, regarding the time difference, it must come from an iteration over > the chunks until the requested example is reached. > > In terms of timing, 20μs is a reasonable time for a list of 1000 elements > in python, since for loops are pretty slow. > I would expect it to be much faster if it was done on the C++ side though. > > Do you know if the iteration is done on the python side, or on the C++ > side ? > > This would probably be a good addition to the library. In the short term > as a work-around you could build the index externally, and use that. (chunk > methods are exposed on ChunkedArray). > > > May I create a post on JIRA about adding an indexing structure for > ChunkedArray ? > > > Best, > > Quentin > > On Mar 15, 2021, at 5:40 PM, Micah Kornfield <[email protected]> > wrote: > > I am a little surprised that the difference is as pronounced in your > benchmark for only 1024 chunks though, I'm not familiar enough with the > memory mapped code path to be able to answer why that would be the case. > > On Mon, Mar 15, 2021 at 9:37 AM Micah Kornfield <[email protected]> > wrote: > Hi Quentin, > > More specifically, it looks like the time it takes to access an example > of a ChunkedArray is a function of the index of the chunk in which the > example is at. > > This is accurate the O(1) access applies to a single Array. Chunked > arrays are stored as a C++ vector of Arrays. There is currently no > indexing structure on top of the vector to allow for anything better than > O(chunk) to an arbitrary element. > > This would probably be a good addition to the library. In the short term > as a work-around you could build the index externally, and use that. (chunk > methods are exposed on ChunkedArray). > > -Micah > > > > > On Mon, Mar 15, 2021 at 9:12 AM Quentin Lhoest <[email protected]> > wrote: > Hi ! > > I’m working on the huggingface/datasets library and we are using Arrow for > natural language processing datasets. > Our datasets are usually text datasets, often bigger than memory. > > We use memory mapping of arrow files to load the datasets. > > > Unfortunately it looks like accessing examples is not O(1) > > More specifically, it looks like the time it takes to access an example of > a ChunkedArray is a function of the index of the chunk in which the example > is at. > If the example is in the last chunk, then it takes more time to access it > than accessing an example in the first chunk. > > For example, with a Table consisting of 1 column “text” defined by: > - 1024 chunks > - each chunk is 1024 rows > - each row is a text of 1024 characters > > Then the time it takes to access one example are: > > ``` > Time to access example at i=0% : 6.7μs > Time to access example at i=10% : 7.2μs > Time to access example at i=20% : 9.1μs > Time to access example at i=30% : 11.4μs > Time to access example at i=40% : 13.8μs > Time to access example at i=50% : 16.2μs > Time to access example at i=60% : 18.7μs > Time to access example at i=70% : 21.1μs > Time to access example at i=80% : 26.8μs > Time to access example at i=90% : 25.2μs > > ``` > > The time measured are the average times to do `table[“text”][j]` depending > on the index we want to fetch (from the first example at 0% to the example > at 90% of the length of the table). > > You can take a look at the code that produces this benchmark here. > > > Why do I observe this behavior ? Is there a way to get O(1) access > without bringing the whole table in memory ? > > > Thanks in advance for you help ! > > Quentin > > >
