Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-16 Thread Nathaniel Smith
Oh sure, I'm not suggesting it be impossible to calculate for a single data
set. If nothing else, if we had a version that accepted a list of data
sets, then you could always pass in a single-element list :-).

On Mar 15, 2018 22:10, "Eric Wieser"  wrote:

> That sounds like a reasonable extension - but I think there still exist
> cases where you want to treat the data as one uniform set when computing
> bins (toggling between orthogonal subsets of data) so isn't really a useful
> replacement.
>
> I suppose this becomes relevant when `density` is passed to the individual
> histogram invocations. Does matplotlib handle that correctly for stacked
> histograms?
>
> On Thu, Mar 15, 2018, 20:14 Nathaniel Smith  wrote:
>
>> Instead of an nobs argument, maybe we should have a version that accepts
>> multiple data sets, so that we have the full information and can improve
>> the algorithm over time.
>>
>> On Mar 15, 2018 7:57 PM, "Thomas Caswell"  wrote:
>>
>>> Yes I like the name.
>>>
>>> The primary use-case for Matplotlib is that our `hist` method can take
>>> in a list of arrays and produces N histograms in one shot. Currently with
>>> 'auto' we only use the first data set to sort out what the bins should be
>>> and then re-use those for the rest of the data sets.  This will let us get
>>> the bins on the merged input, but I take Josef's point that this is not
>>> actually what we want
>>>
>>> Tom
>>>
>>> On Mon, Mar 12, 2018 at 11:35 PM  wrote:
>>>
 On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser
  wrote:
 >> Given that the bin selection are data driven, transferring them
 across datasets might not be so useful.
 >
 > The main application would be to compute bins across the union of all
 > datasets. This is already possibly by using `np.histogram` and
 > discarding the first result, but that's super wasteful.

 assuming "union" means a combined dataset.

 If you stack  datasets, then the number of observations will not be
 correct for individual datasets.

 In that case an additional keyword like nobs, or whatever name would
 be appropriate for numpy, would be useful, e.g. use the average number
 of observations across datasets.
 Auxiliary statistic like std could then be computed on the total
 dataset (if that makes sense, which would not be the case if the
 variance across datasets is larger than the variance within datasets.

 Josef

 > ___
 > NumPy-Discussion mailing list
 > NumPy-Discussion@python.org
 > https://mail.python.org/mailman/listinfo/numpy-discussion
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@python.org
 https://mail.python.org/mailman/listinfo/numpy-discussion

>>>
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-15 Thread Eric Wieser
That sounds like a reasonable extension - but I think there still exist
cases where you want to treat the data as one uniform set when computing
bins (toggling between orthogonal subsets of data) so isn't really a useful
replacement.

I suppose this becomes relevant when `density` is passed to the individual
histogram invocations. Does matplotlib handle that correctly for stacked
histograms?

On Thu, Mar 15, 2018, 20:14 Nathaniel Smith  wrote:

> Instead of an nobs argument, maybe we should have a version that accepts
> multiple data sets, so that we have the full information and can improve
> the algorithm over time.
>
> On Mar 15, 2018 7:57 PM, "Thomas Caswell"  wrote:
>
>> Yes I like the name.
>>
>> The primary use-case for Matplotlib is that our `hist` method can take in
>> a list of arrays and produces N histograms in one shot. Currently with
>> 'auto' we only use the first data set to sort out what the bins should be
>> and then re-use those for the rest of the data sets.  This will let us get
>> the bins on the merged input, but I take Josef's point that this is not
>> actually what we want
>>
>> Tom
>>
>> On Mon, Mar 12, 2018 at 11:35 PM  wrote:
>>
>>> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser
>>>  wrote:
>>> >> Given that the bin selection are data driven, transferring them
>>> across datasets might not be so useful.
>>> >
>>> > The main application would be to compute bins across the union of all
>>> > datasets. This is already possibly by using `np.histogram` and
>>> > discarding the first result, but that's super wasteful.
>>>
>>> assuming "union" means a combined dataset.
>>>
>>> If you stack  datasets, then the number of observations will not be
>>> correct for individual datasets.
>>>
>>> In that case an additional keyword like nobs, or whatever name would
>>> be appropriate for numpy, would be useful, e.g. use the average number
>>> of observations across datasets.
>>> Auxiliary statistic like std could then be computed on the total
>>> dataset (if that makes sense, which would not be the case if the
>>> variance across datasets is larger than the variance within datasets.
>>>
>>> Josef
>>>
>>> > ___
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion@python.org
>>> > https://mail.python.org/mailman/listinfo/numpy-discussion
>>> ___
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@python.org
>>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@python.org
>> https://mail.python.org/mailman/listinfo/numpy-discussion
>>
>> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-15 Thread Thomas Caswell
Yes I like the name.

The primary use-case for Matplotlib is that our `hist` method can take in a
list of arrays and produces N histograms in one shot. Currently with 'auto'
we only use the first data set to sort out what the bins should be and then
re-use those for the rest of the data sets.  This will let us get the bins
on the merged input, but I take Josef's point that this is not actually
what we want

Tom

On Mon, Mar 12, 2018 at 11:35 PM  wrote:

> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser
>  wrote:
> >> Given that the bin selection are data driven, transferring them across
> datasets might not be so useful.
> >
> > The main application would be to compute bins across the union of all
> > datasets. This is already possibly by using `np.histogram` and
> > discarding the first result, but that's super wasteful.
>
> assuming "union" means a combined dataset.
>
> If you stack  datasets, then the number of observations will not be
> correct for individual datasets.
>
> In that case an additional keyword like nobs, or whatever name would
> be appropriate for numpy, would be useful, e.g. use the average number
> of observations across datasets.
> Auxiliary statistic like std could then be computed on the total
> dataset (if that makes sense, which would not be the case if the
> variance across datasets is larger than the variance within datasets.
>
> Josef
>
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram

2018-03-09 Thread Kirit Thadaka
Hi!

I've created a PR to add a function called "histogram_bin_edges" which will
allow a user to calculate the bins used by the histogram for some data
without requiring the entire histogram to be calculated.

https://github.com/numpy/numpy/pull/10591#issuecomment-371863472

This function allows one set of bins to be computed, and reused across
multiple histograms which gives more easily comparable results than using
separate bins for each histogram.

Please let me know if you have any suggestions on how to improve this PR.

Thanks!

-
Kirit
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion