Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram
Oh sure, I'm not suggesting it be impossible to calculate for a single data set. If nothing else, if we had a version that accepted a list of data sets, then you could always pass in a single-element list :-). On Mar 15, 2018 22:10, "Eric Wieser"wrote: > That sounds like a reasonable extension - but I think there still exist > cases where you want to treat the data as one uniform set when computing > bins (toggling between orthogonal subsets of data) so isn't really a useful > replacement. > > I suppose this becomes relevant when `density` is passed to the individual > histogram invocations. Does matplotlib handle that correctly for stacked > histograms? > > On Thu, Mar 15, 2018, 20:14 Nathaniel Smith wrote: > >> Instead of an nobs argument, maybe we should have a version that accepts >> multiple data sets, so that we have the full information and can improve >> the algorithm over time. >> >> On Mar 15, 2018 7:57 PM, "Thomas Caswell" wrote: >> >>> Yes I like the name. >>> >>> The primary use-case for Matplotlib is that our `hist` method can take >>> in a list of arrays and produces N histograms in one shot. Currently with >>> 'auto' we only use the first data set to sort out what the bins should be >>> and then re-use those for the rest of the data sets. This will let us get >>> the bins on the merged input, but I take Josef's point that this is not >>> actually what we want >>> >>> Tom >>> >>> On Mon, Mar 12, 2018 at 11:35 PM wrote: >>> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser wrote: >> Given that the bin selection are data driven, transferring them across datasets might not be so useful. > > The main application would be to compute bins across the union of all > datasets. This is already possibly by using `np.histogram` and > discarding the first result, but that's super wasteful. assuming "union" means a combined dataset. If you stack datasets, then the number of observations will not be correct for individual datasets. In that case an additional keyword like nobs, or whatever name would be appropriate for numpy, would be useful, e.g. use the average number of observations across datasets. Auxiliary statistic like std could then be computed on the total dataset (if that makes sense, which would not be the case if the variance across datasets is larger than the variance within datasets. Josef > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> ___ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram
That sounds like a reasonable extension - but I think there still exist cases where you want to treat the data as one uniform set when computing bins (toggling between orthogonal subsets of data) so isn't really a useful replacement. I suppose this becomes relevant when `density` is passed to the individual histogram invocations. Does matplotlib handle that correctly for stacked histograms? On Thu, Mar 15, 2018, 20:14 Nathaniel Smithwrote: > Instead of an nobs argument, maybe we should have a version that accepts > multiple data sets, so that we have the full information and can improve > the algorithm over time. > > On Mar 15, 2018 7:57 PM, "Thomas Caswell" wrote: > >> Yes I like the name. >> >> The primary use-case for Matplotlib is that our `hist` method can take in >> a list of arrays and produces N histograms in one shot. Currently with >> 'auto' we only use the first data set to sort out what the bins should be >> and then re-use those for the rest of the data sets. This will let us get >> the bins on the merged input, but I take Josef's point that this is not >> actually what we want >> >> Tom >> >> On Mon, Mar 12, 2018 at 11:35 PM wrote: >> >>> On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser >>> wrote: >>> >> Given that the bin selection are data driven, transferring them >>> across datasets might not be so useful. >>> > >>> > The main application would be to compute bins across the union of all >>> > datasets. This is already possibly by using `np.histogram` and >>> > discarding the first result, but that's super wasteful. >>> >>> assuming "union" means a combined dataset. >>> >>> If you stack datasets, then the number of observations will not be >>> correct for individual datasets. >>> >>> In that case an additional keyword like nobs, or whatever name would >>> be appropriate for numpy, would be useful, e.g. use the average number >>> of observations across datasets. >>> Auxiliary statistic like std could then be computed on the total >>> dataset (if that makes sense, which would not be the case if the >>> variance across datasets is larger than the variance within datasets. >>> >>> Josef >>> >>> > ___ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion@python.org >>> > https://mail.python.org/mailman/listinfo/numpy-discussion >>> ___ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >> >> ___ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram
Yes I like the name. The primary use-case for Matplotlib is that our `hist` method can take in a list of arrays and produces N histograms in one shot. Currently with 'auto' we only use the first data set to sort out what the bins should be and then re-use those for the rest of the data sets. This will let us get the bins on the merged input, but I take Josef's point that this is not actually what we want Tom On Mon, Mar 12, 2018 at 11:35 PMwrote: > On Mon, Mar 12, 2018 at 11:20 PM, Eric Wieser > wrote: > >> Given that the bin selection are data driven, transferring them across > datasets might not be so useful. > > > > The main application would be to compute bins across the union of all > > datasets. This is already possibly by using `np.histogram` and > > discarding the first result, but that's super wasteful. > > assuming "union" means a combined dataset. > > If you stack datasets, then the number of observations will not be > correct for individual datasets. > > In that case an additional keyword like nobs, or whatever name would > be appropriate for numpy, would be useful, e.g. use the average number > of observations across datasets. > Auxiliary statistic like std could then be computed on the total > dataset (if that makes sense, which would not be the case if the > variance across datasets is larger than the variance within datasets. > > Josef > > > ___ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] PR to add a function to calculate histogram edges without calculating the histogram
Hi! I've created a PR to add a function called "histogram_bin_edges" which will allow a user to calculate the bins used by the histogram for some data without requiring the entire histogram to be calculated. https://github.com/numpy/numpy/pull/10591#issuecomment-371863472 This function allows one set of bins to be computed, and reused across multiple histograms which gives more easily comparable results than using separate bins for each histogram. Please let me know if you have any suggestions on how to improve this PR. Thanks! - Kirit ___ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion