[matplotlib-devel] Sample data: a proposal

2010-09-12 Thread Jouni K . Seppänen
A while ago there was a discussion [1] about how using the
get_sample_data function in building the documentation is a problem for
Debian packagers. Let me see if I understand the goals of
get_sample_data correctly:

* we want to enable users to run examples they find in the gallery
  without downloading extra files;

* we don't want to package all the sample data with matplotlib, either
  because it is too large, or because it changes more often than we
  release new versions.

The current sample data takes about 2.5 megabytes uncompressed, so the
size doesn't look like a real problem, but of course it is desirable
that new examples are usable with old versions unless they need new
features.

The problem that the Debian packagers have with the current system is 
(I suppose) that building the documentation requires network access and 
is not guaranteed to be repeatable.

Here's what I suggest:

1. Package the sample data in a separate zip file that users can
   download and expand in e.g. ~/.matplotlib/sample_data if they like.
   This file could be released more often than matplotlib, if needed.
   Debian can use this as one source file and package it as a separate
   deb file.

2. Make get_sample_data look first in the place where the zip file could
   have been expanded, and only if the required file is not found, try
   to obtain it from the web. Add an option to disable the network
   access. This is different from what we do now, because now
   get_sample_data always tries to check if there is a newer version
   available, which apparently doesn't work reliably on unconnected
   computers.

3. To make this work, agree that sample data files are immutable: if a
   new version is needed, it needs to have a new name (and thus the
   examples using it need to be updated). The files have not been
   changed a lot [2], so I don't think this is very much of a burden.

What do you think?

Jouni


[1] http://thread.gmane.org/gmane.comp.python.matplotlib.devel/8865
[2] Here is a summary of the changes to each file in sample_data:

=== ./aapl.csv ===

r7379 | jdh2358 | 2009-08-05 18:57:31 +0300 (Wed, 05 Aug 2009)

r6202 | jdh2358 | 2008-10-15 15:43:41 +0300 (Wed, 15 Oct 2008)

r4975 | jdh2358 | 2008-02-16 22:58:37 +0200 (Sat, 16 Feb 2008)

=== ./AAPL.dat ===

r7388 | jdh2358 | 2009-08-05 20:16:50 +0300 (Wed, 05 Aug 2009)

=== ./aapl.npy ===

r7377 | jdh2358 | 2009-08-05 18:52:29 +0300 (Wed, 05 Aug 2009)

r6203 | jdh2358 | 2008-10-15 18:39:44 +0300 (Wed, 15 Oct 2008)

=== ./axes_grid/bivariate_normal.npy ===

r7436 | leejjoon | 2009-08-09 07:34:08 +0300 (Sun, 09 Aug 2009)

=== ./ct.raw ===

r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)

r177 | jdh2358 | 2004-03-13 01:00:12 +0200 (Sat, 13 Mar 2004)

=== ./data_x_x2_x3.csv ===

r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)

r7078 | efiring | 2009-05-03 03:09:06 +0300 (Sun, 03 May 2009)

=== ./demodata.csv ===

r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)

r5100 | jdh2358 | 2008-04-30 22:53:10 +0300 (Wed, 30 Apr 2008)

=== ./eeg.dat ===

r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009)

r52 | jdh2358 | 2003-11-02 23:23:21 +0200 (Sun, 02 Nov 2003)

=== ./embedding_in_wx3.xrc ===

r7382 | jdh2358 | 

Re: [matplotlib-devel] Sample data: a proposal

2010-09-12 Thread Andrew Straw
On 09/12/2010 07:10 AM, Jouni K. Seppänen wrote:
 A while ago there was a discussion [1] about how using the
 get_sample_data function in building the documentation is a problem for
 Debian packagers. Let me see if I understand the goals of
 get_sample_data correctly:

 * we want to enable users to run examples they find in the gallery
without downloading extra files;

 * we don't want to package all the sample data with matplotlib, either
because it is too large, or because it changes more often than we
release new versions.


* Also, we want to have the sample data not to be in the same version 
control repository as MPL proper so that when we download the MPL source 
code itself, we don't get the sample data. (This is one of the sticking 
points for a move to git.)

 Here's what I suggest:

 1. Package the sample data in a separate zip file that users can
 download and expand in e.g. ~/.matplotlib/sample_data if they like.
 This file could be released more often than matplotlib, if needed.
 Debian can use this as one source file and package it as a separate
 deb file.

 2. Make get_sample_data look first in the place where the zip file could
 have been expanded, and only if the required file is not found, try
 to obtain it from the web. Add an option to disable the network
 access. This is different from what we do now, because now
 get_sample_data always tries to check if there is a newer version
 available, which apparently doesn't work reliably on unconnected
 computers.

 3. To make this work, agree that sample data files are immutable: if a
 new version is needed, it needs to have a new name (and thus the
 examples using it need to be updated). The files have not been
 changed a lot [2], so I don't think this is very much of a burden.

 What do you think?



#1 and #2 seem reasonable to me.

I don't like #3 -- for the same reasons as we want to separate the rest 
of the sample data (smaller download, smaller repository, and separation 
of code and non-essential data), I think the test comparison images 
should be with the sample data. Having to deal with renames in the tests 
would be annoying. Two alternative ideas to handle for the  versioning 
issue: A) Add a .py file in the main source repository with is a list of 
sample data filenames and checksums. If a sample data file doesn't 
exist, or its checksum is wrong, it can be downloaded. B) The source 
file could simply have the same data version number required and the 
sample data itself could be versioned.

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing
http://p.sf.net/sfu/novell-sfdev2dev
___
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] Sample data: a proposal

2010-09-12 Thread John Hunter
On Sun, Sep 12, 2010 at 10:30 AM, Andrew Straw straw...@astraw.com wrote:
 #1 and #2 seem reasonable to me.

 I don't like #3 -- for the same reasons as we want to separate the rest

I agree with Andrew here -- we don't want to hamstring our ability to
change the data just because some people would rather take a version
in place of the latest version.  If we have an rc option

  sampledata.fetch : False

then the sampledata function would only look in the sample data dir,
get the file if available, raise otherwise.  If fetch is True, it
would always go the web first and check for the latest, get it and
cache it.  Then the packagers could download the tarball, unpack it,
and not worry about mpl trying to check for a more recent version.

JDH

--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing
http://p.sf.net/sfu/novell-sfdev2dev
___
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel


Re: [matplotlib-devel] Sample data: a proposal

2010-09-12 Thread Jouni K . Seppänen
Andrew Straw straw...@astraw.com writes:

 3. To make this work, agree that sample data files are immutable: if a
 new version is needed, it needs to have a new name (and thus the
 examples using it need to be updated). The files have not been
 changed a lot [2], so I don't think this is very much of a burden.

 I don't like #3 -- for the same reasons as we want to separate the rest 
 of the sample data (smaller download, smaller repository, and separation 
 of code and non-essential data), I think the test comparison images 
 should be with the sample data. Having to deal with renames in the tests 
 would be annoying. 

If the test data is moved there, I agree that renaming won't work. 

But it seems to me that test data is different from sample data used by
examples: when running the tests for a given revision of matplotlib, you
don't want the absolute latest comparison images but the images that
correspond to that particular code revision. You also typically want to
get all of the comparison images for that revision at the same time,
since you're likely to be running the whole test suite. Also, if you are
running the test suite, I think we can assume you can get a checkout of
the test-data repository.

(A git submodule would seem to be a good fit: the main repository would
have a pointer to the appropriate revision of the test-images
repository, and people interested in running the test suite would have
to run git submodule update to check it out.)

 Two alternative ideas to handle for the  versioning 
 issue: A) Add a .py file in the main source repository with is a list of 
 sample data filenames and checksums. If a sample data file doesn't 
 exist, or its checksum is wrong, it can be downloaded.

Sounds complicated, and makes older versions unable to run newer
examples.

 B) The source file could simply have the same data version number
 required and the sample data itself could be versioned.

That might work. If I understand this correctly, the example code would
call get_sample_data(foo.dat) to get the latest revision or
get_sample_data(foo.dat, 1234) to get a specific one. These would
retrieve URLs like

http://example.com/sample-data/raw/master/foo.dat
http://example.com/sample-data/raw/1234/foo.dat

-- 
Jouni K. Seppänen
http://www.iki.fi/jks


--
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing
http://p.sf.net/sfu/novell-sfdev2dev
___
Matplotlib-devel mailing list
Matplotlib-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/matplotlib-devel