Hi Marco, Thanks a lot for this feedback. I am one of the contribs building out the new commons-numbers release which will replace much of commons-math including the statistics libraries.
I took a look at the commons-math code for Median. Based on my reading of it I am not surprised by what you report. Median just invokes Percentile. Percentile does seem to have some overhead, for example lots of range and null checking. As as one off Median is probably not as efficient as a simple sort. However it can store data and recalculate percentiles, and there it might be efficient to use, as well as having flexibility in how it is implemented. I quite agree that most users simply want to call median() on their array and expect at least as efficient an algorithm as they could hand code themselves. For medians as we all learned in CS 101 class, for a guaranteed result one cannot improve on O(n log n) time, just sorting like you mentioned, however a method can deliver O(log n) time on average, but with a potential worst case of O(n^2), and the user could be given this choice of implementation. The fact that the commons-math contributors did not seem aware of this finding shows some of the occasional weaknesses I am finding as we design a new, fleeter library in commons-numbers. We are redesigning the statistics libraries over this summer with support from Google Summer of Code, so I will see what I can do. Without feedback like yours we would not find these points of improvement, so thanks. On Wed, May 29, 2019 at 3:24 AM Marco Neumann <marco.neum...@gmail.com> wrote: > I am evaluating the use of Apache Math Commons Median for the querying of > large data sets in another Apache project called Apache Jena. > > In my preliminary performance tests I was surprised to find that a simple > implementation of a median function with Arrays.sort() and a programmatic > selection of the median value yields much faster results > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). > > Since we only use this function for Arrays of confirmed numbers is there a > particular benefit in using Apache Commons Math for this task or are we > better advised to use our own implementation here? > > Thank You >