Re: [Matplotlib-users] Bug in boxplot/mlab.prctile
Hi! I noticed that the boxplot function incorrectly calculates the location of the median line in each box. As a simple example, plotting the dataset [1, 2, 3, 4] incorrectly plots the median line at 3. I can confirm this. [..] I would suggest that mlab.prctile be fixed to conform to some one or other of these methods, rather than adding to the proliferation of approaches to quantile-calculation. Is there any motivation for always truncating to integer (other that it's quicker to type :-)? And I agree here. I also recently (before I noticed this thread) posted a bug report #3151034 [1] There is also documented, that the mlab.prctle function does not yield the same results as scipy.stats.scoreatpercentile. In addition to make the confusion complete matlab reports yet another result ... But I think at least matplotlib and the scipy-stats-package should agree. Jochen [1] http://sourceforge.net/tracker/?func=detailaid=3151034group_id=80706atid=560720 -- Jochen Deibele, PhD candidate, Dipl.-Ing. Department of Circulation and Medical Imaging Norwegian University of Science and Technology (NTNU) Phone: +47 728 28028 E-Mail: jochen.deib...@ntnu.no -- Protect Your Site and Customers from Malware Attacks Learn about various malware tactics and how to avoid them. Understand malware threats, the impact they can have on your business, and how you can protect your company and customers by using code signing. http://p.sf.net/sfu/oracle-sfdevnl ___ Matplotlib-users mailing list Matplotlib-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-users
Re: [Matplotlib-users] Bug in boxplot/mlab.prctile
2011/1/1 OKB (not okblacke) brenb...@brenbarn.net: I noticed that the boxplot function incorrectly calculates the location of the median line in each box. As a simple example, plotting the dataset [1, 2, 3, 4] incorrectly plots the median line at 3. It seems to work fine in matplotlib 1.0.0: u...@host:~$ python Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) [GCC 4.4.5] on linux2 Type help, copyright, credits or license for more information. import matplotlib as mpl mpl.__version__ '1.0.0' import matplotlib.pyplot as plt import matplotlib.mlab as mlab plt.ion() plt.boxplot([1, 2, 3, 4]) {'medians': [matplotlib.lines.Line2D object at 0x3ad6250], 'fliers': [matplotlib.lines.Line2D object at 0x3ad6610, matplotlib.lines.Line2D object at 0x3ad69d0], 'whiskers': [matplotlib.lines.Line2D object at 0x3acff50, matplotlib.lines.Line2D object at 0x3ad4310], 'boxes': [matplotlib.lines.Line2D object at 0x3ad4e50], 'caps': [matplotlib.lines.Line2D object at 0x3ad46d0, matplotlib.lines.Line2D object at 0x3ad4a90]} plt.grid() plt.boxplot([1, 2, 3, 4]) {'medians': [matplotlib.lines.Line2D object at 0x3dfbad0], 'fliers': [matplotlib.lines.Line2D object at 0x3dfbe90, matplotlib.lines.Line2D object at 0x3dff290], 'whiskers': [matplotlib.lines.Line2D object at 0x3df8810, matplotlib.lines.Line2D object at 0x3df8b90], 'boxes': [matplotlib.lines.Line2D object at 0x3dfb710], 'caps': [matplotlib.lines.Line2D object at 0x3df8f50, matplotlib.lines.Line2D object at 0x3dfb350]} plt.grid() # See attached image. ... mlab.prctile([1, 2, 3, 4]) array([ 1. , 1.75, 2.5 , 3.25, 4. ]) Goyo It also seems that the quartile calculations for the box are a little peculiar. I have seen some discussion in old mailing list postings about mlab.prctile and its ways of calculating percentiles, which are different than those of some other software. I'm aware that there is legitimate disagreement about the best way to calculate the quartiles. However, it seems to me that mlab's way is still not any of these possibly-correct ways, because it uses int() or nparray.astype(int) to coerce the percentile result to an integer index. This TRUNCATES the floating-point result. No accepted quantile- calculating method that I'm aware of does this; they all ROUND instead of truncating (if they want to coerce to an integer index at all, in order to produce a quantile value that is an element of the data set), or in some cases they round uniformly up for the lower quartile and down for the upper. You can see a summary of different methods at http://www.amstat.org/publications/jse/v14n3/langford.html ; the method used by mlab does not appear to agree with any of these. I would suggest that mlab.prctile be fixed to conform to some one or other of these methods, rather than adding to the proliferation of approaches to quantile-calculation. Is there any motivation for always truncating to integer (other that it's quicker to type :-)? Also, regardless of these quartile issues, there is, as far as I'm aware, no one who denies that the median of a (sorted) data set with an even number of values is the mean of the middle two values. Since numpy is already a dependency for matplotlib, boxplot shouldn't use mlab.prctile at all to decide where to plot the median line -- just use numpy.median. Thanks, -- --OKB (not okblacke) Brendan Barnwell Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail. --author unknown attachment: boxplot_sample.png-- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl___ Matplotlib-users mailing list Matplotlib-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-users
[Matplotlib-users] Bug in boxplot/mlab.prctile
I noticed that the boxplot function incorrectly calculates the location of the median line in each box. As a simple example, plotting the dataset [1, 2, 3, 4] incorrectly plots the median line at 3. It also seems that the quartile calculations for the box are a little peculiar. I have seen some discussion in old mailing list postings about mlab.prctile and its ways of calculating percentiles, which are different than those of some other software. I'm aware that there is legitimate disagreement about the best way to calculate the quartiles. However, it seems to me that mlab's way is still not any of these possibly-correct ways, because it uses int() or nparray.astype(int) to coerce the percentile result to an integer index. This TRUNCATES the floating-point result. No accepted quantile- calculating method that I'm aware of does this; they all ROUND instead of truncating (if they want to coerce to an integer index at all, in order to produce a quantile value that is an element of the data set), or in some cases they round uniformly up for the lower quartile and down for the upper. You can see a summary of different methods at http://www.amstat.org/publications/jse/v14n3/langford.html ; the method used by mlab does not appear to agree with any of these. I would suggest that mlab.prctile be fixed to conform to some one or other of these methods, rather than adding to the proliferation of approaches to quantile-calculation. Is there any motivation for always truncating to integer (other that it's quicker to type :-)? Also, regardless of these quartile issues, there is, as far as I'm aware, no one who denies that the median of a (sorted) data set with an even number of values is the mean of the middle two values. Since numpy is already a dependency for matplotlib, boxplot shouldn't use mlab.prctile at all to decide where to plot the median line -- just use numpy.median. Thanks, -- --OKB (not okblacke) Brendan Barnwell Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail. --author unknown -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl ___ Matplotlib-users mailing list Matplotlib-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-users