[matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Yaroslav Halchenko
Dear Matplotlib gurus,

Following the code to demonstrate recent(ish) fix for whiskers in boxplots:
https://github.com/matplotlib/matplotlib/pull/1855 I have compared it against
R's boxplot.  Description seems to correspond, and all the percentiles are the
same in numpy and R (3.0.1) but R's boxplot seems to have extended IQR box and
still have an upper whisker (corresponds to 9000, which is not within
75%+1.5*IQR), when it shouldn't:
http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb

is R's plot incorrect or am I missing something (e.g. documented feature
in R's boxplot) warranting such a difference?

Thanks in advance
-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
___
Matplotlib-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Paul Hobson
Hey Yaroslav,

As the author of the fix and the recent overhaul to boxplots, I can say
with certainty that R is wrong! ;-)

More seriously, the main thing that I take away from Tukey's paper about
boxplots, is that there are many valid ways to draw them. I personally set
up the new boxplot functionality to take the most basic boxplot definition
very literally. My guess is that R is fudging those rules a bit for the
purpose of completeness, or aesthetics, or ...(?)

Perhaps one can look at the purpose of boxplots in two different fashions:
1) Matplotlib: show some of the data and some basic stats
2) R (I'm guession): show how the data are /probably/ distributed.

Obviously, I prefer #1. But I'm not going to say that #2 is wrong just yet.




On Sat, Feb 15, 2014 at 5:00 AM, Yaroslav Halchenko wrote:

> Dear Matplotlib gurus,
>
> Following the code to demonstrate recent(ish) fix for whiskers in boxplots:
> https://github.com/matplotlib/matplotlib/pull/1855 I have compared it
> against
> R's boxplot.  Description seems to correspond, and all the percentiles are
> the
> same in numpy and R (3.0.1) but R's boxplot seems to have extended IQR box
> and
> still have an upper whisker (corresponds to 9000, which is not within
> 75%+1.5*IQR), when it shouldn't:
>
> http://nbviewer.ipython.org/url/www.onerussian.com/tmp/boxplot-Python-vs-R.ipynb
>
> is R's plot incorrect or am I missing something (e.g. documented feature
> in R's boxplot) warranting such a difference?
>
> Thanks in advance
> --
> Yaroslav O. Halchenko, Ph.D.
> http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
> Senior Research Associate, Psychological and Brain Sciences Dept.
> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
> Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
> WWW:   http://www.linkedin.com/in/yarik
>
>
> --
> Android apps run on BlackBerry 10
> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
> Now with support for Jelly Bean, Bluetooth, Mapview and more.
> Get your Android app in front of a whole new audience.  Start now.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
> ___
> Matplotlib-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/matplotlib-devel
>
--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk___
Matplotlib-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Yaroslav Halchenko
Hi Paul,

On Sat, 15 Feb 2014, Paul Hobson wrote:
>As the author of the fix and the recent overhaul to boxplots

Thanks for that!

>  I can say with certainty that R is wrong! ;-)

phew -- thanks ;)

>More seriously, the main thing that I take away from Tukey's paper about
>boxplots, is that there are many valid ways to draw them. I personally set
>up the new boxplot functionality to take the most basic boxplot definition
>very literally. My guess is that R is fudging those rules a bit for the
>purpose of completeness, or aesthetics, or ...(?)

well -- I was trying to figure out why the divergence from R's boxplot
help, but so far it seemed to match description/definition for boxplot
as in matplotlib.  I guess the next step would be to look "inside"
(running apt-get source r-base now ;-) )

>Perhaps one can look at the purpose of boxplots in two different fashions:
>1) Matplotlib: show some of the data and some basic stats
>2) R (I'm guession): show how the data are /probably/ distributed.�
>Obviously, I prefer #1. But I'm not going to say that #2 is wrong just
>yet.

would you may be interested to adopt (or just do independently) an
option to e.g. plot the data point?  once I shared this one
http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb
and the actual code https://gist.github.com/yarikoptic/9023331

I just never got to formalize it into mpl pull request :-/
-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
___
Matplotlib-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Paul Hobson
Yaroslav,

Those figures look great. Seaborn has some similar functionality (scroll
down a bit):
http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot

The main point of the most recent overhaul of boxplots was to allow users
to just what you describe. The methods plt.boxplot and ax.boxplot now do
very little on their own. Input data are passed to
matplotlib.cbook.boxplot_stats, that function returns a list of
dictionaries of statistics, and then ax.bxp actually does the drawing. All
of this is to say that you can write your own function to modify
boxplot_stats' output or generate independently the list of dictionaries
expected by ax.bxp.

The keys of those dictionaries can include:
 - label  -> tick label for the boxplot
 - mean -> mean value (can plot as a line or point)
 - median -> 50th percentile
 - q1 -> first quartile (25th pctl)
 - q3 -> third quartile (75 (pctl)
 - cilo -> lower notch around the median
 - ciho -> upper notch around the median
 - whislo -> end of the lower whisker
 - whishi -> end of the upper whisker
 - fliers -> outliers

Basically, you can set the appropriate values to whatever you want to draw
boxplots however you wish (like open/close diagrams for pandas).

Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can
either be a float (1.5 by default), a list of integer percentiles (like 5,
95), or the strings 'range', 'limits', or 'min/max', all of which will
extend the whiskers to over all of the data.

Since you're running off of master, you should access to this new
functionality.

Here's a link to the PR that overhauled ax.boxplot and created ax.bxp:
https://github.com/matplotlib/matplotlib/pull/2643

Looking at it now -- it looks like cbook.boxplot_stats' docstring got
cutoff. I'll pull together a PR to fix that soon.

Feel free to hit me up with any other questions!

-paul



On Sat, Feb 15, 2014 at 2:20 PM, Yaroslav Halchenko wrote:

> Hi Paul,
>
> On Sat, 15 Feb 2014, Paul Hobson wrote:
> >As the author of the fix and the recent overhaul to boxplots
>
> Thanks for that!
>
> >  I can say with certainty that R is wrong! ;-)
>
> phew -- thanks ;)
>
> >More seriously, the main thing that I take away from Tukey's paper
> about
> >boxplots, is that there are many valid ways to draw them. I
> personally set
> >up the new boxplot functionality to take the most basic boxplot
> definition
> >very literally. My guess is that R is fudging those rules a bit for
> the
> >purpose of completeness, or aesthetics, or ...(?)
>
> well -- I was trying to figure out why the divergence from R's boxplot
> help, but so far it seemed to match description/definition for boxplot
> as in matplotlib.  I guess the next step would be to look "inside"
> (running apt-get source r-base now ;-) )
>
> >Perhaps one can look at the purpose of boxplots in two different
> fashions:
> >1) Matplotlib: show some of the data and some basic stats
> >2) R (I'm guession): show how the data are /probably/ distributed.�
> >Obviously, I prefer #1. But I'm not going to say that #2 is wrong just
> >yet.
>
> would you may be interested to adopt (or just do independently) an
> option to e.g. plot the data point?  once I shared this one
> http://nbviewer.ipython.org/url/www.onerussian.com/tmp/run_plots.ipynb
> and the actual code https://gist.github.com/yarikoptic/9023331
>
> I just never got to formalize it into mpl pull request :-/
> --
> Yaroslav O. Halchenko, Ph.D.
> http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
> Senior Research Associate, Psychological and Brain Sciences Dept.
> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
> Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
> WWW:   http://www.linkedin.com/in/yarik
>
>
> --
> Android apps run on BlackBerry 10
> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
> Now with support for Jelly Bean, Bluetooth, Mapview and more.
> Get your Android app in front of a whole new audience.  Start now.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
> ___
> Matplotlib-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/matplotlib-devel
>
--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk___
Matplotlib-devel mailing list
[email protected]
https://lists.so

Re: [matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Yaroslav Halchenko

On Sat, 15 Feb 2014, Paul Hobson wrote:
>Those figures look great. Seaborn has some similar functionality (scroll
>down a bit):
>
> [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot

right -- seaborn looks really nice and I am yet to take advantage of it.

BUT that is why we are talking here, at matplotlib list:  seaborn (and
few others) while aiming to provide high level convenience, specific to
e.g. using pandas as the core datastructures, add improvements which
could easily go into stock matplotlib and thus benefit all of the users.
That is why I thought that improving boxplot itself could be of
more generic benefit, while allowing all the dependent projects take
advantage of it without requiring unnecessary fragmentation (e.g. "use
seaborn for paired plots", which could easily go straight into stock
boxplot operating on arrays).  

Even violin plots could probably could be done in matplotlib with
some basic density estimator (with parameter for a custom one) as an
option within boxplot function itself.

>The main point of the most recent overhaul of boxplots was to allow users
>to just what you describe. The methods plt.boxplot and ax.boxplot now do
>very little on their own. Input data are passed to
>matplotlib.cbook.boxplot_stats, that function returns a list of
>dictionaries of statistics, and then ax.bxp actually does the drawing. All
>of this is to say that you can write your own function to modify
>boxplot_stats' output or generate independently the list of dictionaries
>expected by ax.bxp.
>The keys of those dictionaries can include:
> - label  -> tick label for the boxplot
> - mean -> mean value (can plot as a line or point)
> - median -> 50th percentile
> - q1 -> first quartile (25th pctl)
> - q3 -> third quartile (75 (pctl)
> - cilo -> lower notch around the median
> - ciho -> upper notch around the median 
> - whislo -> end of the lower whisker
> - whishi -> end of the upper whisker
> - fliers -> outliers
>Basically, you can set the appropriate values to whatever you want to draw
>boxplots however you wish (like open/close diagrams for pandas).
>Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can
>either be a float (1.5 by default), a list of integer percentiles (like 5,
>95), or the strings 'range', 'limits', or 'min/max', all of which will
>extend the whiskers to over all of the data.
>Since you're running off of master, you should access to this new
>functionality.

;-) usually I run off the releases and even more often from releases in
Debian stable.  But yes -- I have the master and this new functionality
looks neat -- thanks again.  But those few enhancements, such as

- plot actual datapoints with the jitter
- plot pairing lines across boxplots

seems to be not there and I would consider them worthwhile enhancement

>Feel free to hit me up with any other questions!

sorry that I have hit with not really a question above ;-)
-- 
Yaroslav O. Halchenko, Ph.D.
http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
Senior Research Associate, Psychological and Brain Sciences Dept.
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
___
Matplotlib-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] is R wrong? (boxplot)

2014-02-15 Thread Thomas A Caswell
As a side note, adding jitter has been discussed before
(https://github.com/matplotlib/matplotlib/issues/2750) in a slightly
different context and the consensus was to _not_ add it to mpl (as it
is a non-deterministic data transformation).

Tom

On Sat, Feb 15, 2014 at 10:45 PM, Yaroslav Halchenko  
wrote:
>
> On Sat, 15 Feb 2014, Paul Hobson wrote:
>>Those figures look great. Seaborn has some similar functionality (scroll
>>down a bit):
>>
>> [1]http://nbviewer.ipython.org/github/mwaskom/seaborn/blob/master/examples/plotting_distributions.ipynb#Comparing-distributions:-boxplot-and-violinplot
>
> right -- seaborn looks really nice and I am yet to take advantage of it.
>
> BUT that is why we are talking here, at matplotlib list:  seaborn (and
> few others) while aiming to provide high level convenience, specific to
> e.g. using pandas as the core datastructures, add improvements which
> could easily go into stock matplotlib and thus benefit all of the users.
> That is why I thought that improving boxplot itself could be of
> more generic benefit, while allowing all the dependent projects take
> advantage of it without requiring unnecessary fragmentation (e.g. "use
> seaborn for paired plots", which could easily go straight into stock
> boxplot operating on arrays).
>
> Even violin plots could probably could be done in matplotlib with
> some basic density estimator (with parameter for a custom one) as an
> option within boxplot function itself.
>
>>The main point of the most recent overhaul of boxplots was to allow users
>>to just what you describe. The methods plt.boxplot and ax.boxplot now do
>>very little on their own. Input data are passed to
>>matplotlib.cbook.boxplot_stats, that function returns a list of
>>dictionaries of statistics, and then ax.bxp actually does the drawing. All
>>of this is to say that you can write your own function to modify
>>boxplot_stats' output or generate independently the list of dictionaries
>>expected by ax.bxp.
>>The keys of those dictionaries can include:
>> - label  -> tick label for the boxplot
>> - mean -> mean value (can plot as a line or point)
>> - median -> 50th percentile
>> - q1 -> first quartile (25th pctl)
>> - q3 -> third quartile (75 (pctl)
>> - cilo -> lower notch around the median
>> - ciho -> upper notch around the median
>> - whislo -> end of the lower whisker
>> - whishi -> end of the upper whisker
>> - fliers -> outliers
>>Basically, you can set the appropriate values to whatever you want to draw
>>boxplots however you wish (like open/close diagrams for pandas).
>>Also, the `whis` kwarg accepted by boxplot and cbook.boxplot_stats can
>>either be a float (1.5 by default), a list of integer percentiles (like 5,
>>95), or the strings 'range', 'limits', or 'min/max', all of which will
>>extend the whiskers to over all of the data.
>>Since you're running off of master, you should access to this new
>>functionality.
>
> ;-) usually I run off the releases and even more often from releases in
> Debian stable.  But yes -- I have the master and this new functionality
> looks neat -- thanks again.  But those few enhancements, such as
>
> - plot actual datapoints with the jitter
> - plot pairing lines across boxplots
>
> seems to be not there and I would consider them worthwhile enhancement
>
>>Feel free to hit me up with any other questions!
>
> sorry that I have hit with not really a question above ;-)
> --
> Yaroslav O. Halchenko, Ph.D.
> http://neuro.debian.net http://www.pymvpa.org http://www.fail2ban.org
> Senior Research Associate, Psychological and Brain Sciences Dept.
> Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
> Phone: +1 (603) 646-9834   Fax: +1 (603) 646-1419
> WWW:   http://www.linkedin.com/in/yarik
>
> --
> Android apps run on BlackBerry 10
> Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
> Now with support for Jelly Bean, Bluetooth, Mapview and more.
> Get your Android app in front of a whole new audience.  Start now.
> http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/4140/ostg.clktrk
> ___
> Matplotlib-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/matplotlib-devel



-- 
Thomas A Caswell
PhD Candidate University of Chicago
Nagel and Gardel labs
[email protected]
jfi.uchicago.edu/~tcaswell
o: 773.702.7204

--
Android apps run on BlackBerry 10
Introducing the new BlackBerry 10.2.1 Runtime for Android apps.
Now with support for Jelly Bean, Bluetooth, Mapview and more.
Get your Android app in front of a whole new audience.  Start now.
http://pubads.g.doubleclick.net/gampad/clk?id=124407151&iu=/41