Hi Marco,

Thanks a lot for this feedback. I am one of the contribs building out the
new commons-numbers release which will replace much of commons-math
including the statistics libraries.

I took a look at the commons-math code for Median. Based on my reading of
it I am not surprised by what you report. Median just invokes Percentile.
Percentile does seem to have some overhead, for example lots of range and
null checking. As as one off Median is probably not as efficient as a
simple sort. However it can store data and recalculate percentiles, and
there it might be efficient to use, as well as having flexibility in how it
is implemented.

I quite agree that most users simply want to call median() on their array
and expect at least as efficient an algorithm as they could hand code
themselves. For medians as we all learned in CS 101 class, for a guaranteed
result one cannot improve on O(n log n) time, just sorting like you
mentioned, however a method can deliver O(log n) time on average, but with
a potential worst case of O(n^2), and the user could be given this choice
of implementation.

The fact that the commons-math contributors did not seem aware of this
finding shows some of the occasional weaknesses I am finding as we design a
new, fleeter library in commons-numbers. We are redesigning the statistics
libraries over this summer with support from Google Summer of Code, so I
will see what I can do. Without feedback like yours we would not find these
points of improvement, so thanks.

On Wed, May 29, 2019 at 3:24 AM Marco Neumann <marco.neum...@gmail.com>
wrote:

> I am evaluating the use of Apache Math Commons Median for the querying of
> large data sets in another Apache project called Apache Jena.
>
> In my preliminary performance tests I was surprised to find that a simple
> implementation of a median function with Arrays.sort() and a programmatic
> selection of the median value yields much faster results
> than Median().evaluate() or DescriptiveStatistics.getPercentile(50).
>
> Since we only use this function for  Arrays of confirmed numbers is there a
> particular benefit in using Apache Commons Math for this task or are we
> better advised to use our own implementation here?
>
> Thank You
>

Reply via email to