Hi everyone, Today was the 3rd time I came across a situation where it was needed to retrieve all the positions of the line endings (or beginnings) in a very long python string as efficiently as possible. First time, it was needed in prompt_toolkit, where I spent a crazy amount of time looking for the most performant solution. Second time was in a commercial project where performance was very critical too. Third time is for the Rich/Textual project from Will McGugan. (See: https://twitter.com/willmcgugan/status/1537782771137011715 )
The problem is that the `str` type doesn't expose any API to efficiently find all \n positions. Every Python implementation is either calling `.index()` in a loop and collecting the results or running a regex over the string and collecting all positions. For long strings, depending on the implementation, this results in a lot of overhead due to either: - calling Python functions (or any other Python instruction) for every \n character in the input. The amount of executed Python instructions is O(n) here. - Copying string data into new strings. The fastest solution I've been using for some time, does this (simplified): `accumulate(chain([0], map(len, text.splitlines(True))))`. The performance is great here, because the amount of Python instructions is O(1). Everything is chained in C-code thanks to itertools. Because of that, it can outperform the regex solution with a factor of ~2.5. (Regex isn't slow, but iterating over the results is.) The bad things about this solution is however: - Very cumbersome syntax. - We call `splitlines()` which internally allocates a huge amount of strings, only to use their lengths. That is still much more overhead then a simple for-loop in C would be. Performance matters here, because for these kind of problems, the list of integers that gets produced is typically used as an index to quickly find character offsets in the original string, depending on which line is displayed/processed. The bisect library helps too to quickly convert any index position of that string into a line number. The point is, that for big inputs, the amount of Python instructions executed is not O(n), but O(1). Of course, some of the C code remains O(n). So, my ask here. Would it make sense to add a `line_offsets()` method to `str`? Or even `character_offsets(character)` if we want to do that for any character? Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary lengths? Thanks, Jonathan
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/6WAMKYXOYA3SKL5HIRZP4WARMYYKXI3Q/ Code of Conduct: http://python.org/psf/codeofconduct/