[matplotlib-devel] Sample data: a proposal
A while ago there was a discussion [1] about how using the get_sample_data function in building the documentation is a problem for Debian packagers. Let me see if I understand the goals of get_sample_data correctly: * we want to enable users to run examples they find in the gallery without downloading extra files; * we don't want to package all the sample data with matplotlib, either because it is too large, or because it changes more often than we release new versions. The current sample data takes about 2.5 megabytes uncompressed, so the size doesn't look like a real problem, but of course it is desirable that new examples are usable with old versions unless they need new features. The problem that the Debian packagers have with the current system is (I suppose) that building the documentation requires network access and is not guaranteed to be repeatable. Here's what I suggest: 1. Package the sample data in a separate zip file that users can download and expand in e.g. ~/.matplotlib/sample_data if they like. This file could be released more often than matplotlib, if needed. Debian can use this as one source file and package it as a separate deb file. 2. Make get_sample_data look first in the place where the zip file could have been expanded, and only if the required file is not found, try to obtain it from the web. Add an option to disable the network access. This is different from what we do now, because now get_sample_data always tries to check if there is a newer version available, which apparently doesn't work reliably on unconnected computers. 3. To make this work, agree that sample data files are immutable: if a new version is needed, it needs to have a new name (and thus the examples using it need to be updated). The files have not been changed a lot [2], so I don't think this is very much of a burden. What do you think? Jouni [1] http://thread.gmane.org/gmane.comp.python.matplotlib.devel/8865 [2] Here is a summary of the changes to each file in sample_data: === ./aapl.csv === r7379 | jdh2358 | 2009-08-05 18:57:31 +0300 (Wed, 05 Aug 2009) r6202 | jdh2358 | 2008-10-15 15:43:41 +0300 (Wed, 15 Oct 2008) r4975 | jdh2358 | 2008-02-16 22:58:37 +0200 (Sat, 16 Feb 2008) === ./AAPL.dat === r7388 | jdh2358 | 2009-08-05 20:16:50 +0300 (Wed, 05 Aug 2009) === ./aapl.npy === r7377 | jdh2358 | 2009-08-05 18:52:29 +0300 (Wed, 05 Aug 2009) r6203 | jdh2358 | 2008-10-15 18:39:44 +0300 (Wed, 15 Oct 2008) === ./axes_grid/bivariate_normal.npy === r7436 | leejjoon | 2009-08-09 07:34:08 +0300 (Sun, 09 Aug 2009) === ./ct.raw === r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009) r177 | jdh2358 | 2004-03-13 01:00:12 +0200 (Sat, 13 Mar 2004) === ./data_x_x2_x3.csv === r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009) r7078 | efiring | 2009-05-03 03:09:06 +0300 (Sun, 03 May 2009) === ./demodata.csv === r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009) r5100 | jdh2358 | 2008-04-30 22:53:10 +0300 (Wed, 30 Apr 2008) === ./eeg.dat === r7382 | jdh2358 | 2009-08-05 19:21:23 +0300 (Wed, 05 Aug 2009) r52 | jdh2358 | 2003-11-02 23:23:21 +0200 (Sun, 02 Nov 2003) === ./embedding_in_wx3.xrc === r7382 | jdh2358 |
Re: [matplotlib-devel] Sample data: a proposal
On 09/12/2010 07:10 AM, Jouni K. Seppänen wrote: A while ago there was a discussion [1] about how using the get_sample_data function in building the documentation is a problem for Debian packagers. Let me see if I understand the goals of get_sample_data correctly: * we want to enable users to run examples they find in the gallery without downloading extra files; * we don't want to package all the sample data with matplotlib, either because it is too large, or because it changes more often than we release new versions. * Also, we want to have the sample data not to be in the same version control repository as MPL proper so that when we download the MPL source code itself, we don't get the sample data. (This is one of the sticking points for a move to git.) Here's what I suggest: 1. Package the sample data in a separate zip file that users can download and expand in e.g. ~/.matplotlib/sample_data if they like. This file could be released more often than matplotlib, if needed. Debian can use this as one source file and package it as a separate deb file. 2. Make get_sample_data look first in the place where the zip file could have been expanded, and only if the required file is not found, try to obtain it from the web. Add an option to disable the network access. This is different from what we do now, because now get_sample_data always tries to check if there is a newer version available, which apparently doesn't work reliably on unconnected computers. 3. To make this work, agree that sample data files are immutable: if a new version is needed, it needs to have a new name (and thus the examples using it need to be updated). The files have not been changed a lot [2], so I don't think this is very much of a burden. What do you think? #1 and #2 seem reasonable to me. I don't like #3 -- for the same reasons as we want to separate the rest of the sample data (smaller download, smaller repository, and separation of code and non-essential data), I think the test comparison images should be with the sample data. Having to deal with renames in the tests would be annoying. Two alternative ideas to handle for the versioning issue: A) Add a .py file in the main source repository with is a list of sample data filenames and checksums. If a sample data file doesn't exist, or its checksum is wrong, it can be downloaded. B) The source file could simply have the same data version number required and the sample data itself could be versioned. -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing http://p.sf.net/sfu/novell-sfdev2dev ___ Matplotlib-devel mailing list Matplotlib-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-devel
Re: [matplotlib-devel] Sample data: a proposal
On Sun, Sep 12, 2010 at 10:30 AM, Andrew Straw straw...@astraw.com wrote: #1 and #2 seem reasonable to me. I don't like #3 -- for the same reasons as we want to separate the rest I agree with Andrew here -- we don't want to hamstring our ability to change the data just because some people would rather take a version in place of the latest version. If we have an rc option sampledata.fetch : False then the sampledata function would only look in the sample data dir, get the file if available, raise otherwise. If fetch is True, it would always go the web first and check for the latest, get it and cache it. Then the packagers could download the tarball, unpack it, and not worry about mpl trying to check for a more recent version. JDH -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing http://p.sf.net/sfu/novell-sfdev2dev ___ Matplotlib-devel mailing list Matplotlib-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-devel
Re: [matplotlib-devel] Sample data: a proposal
Andrew Straw straw...@astraw.com writes: 3. To make this work, agree that sample data files are immutable: if a new version is needed, it needs to have a new name (and thus the examples using it need to be updated). The files have not been changed a lot [2], so I don't think this is very much of a burden. I don't like #3 -- for the same reasons as we want to separate the rest of the sample data (smaller download, smaller repository, and separation of code and non-essential data), I think the test comparison images should be with the sample data. Having to deal with renames in the tests would be annoying. If the test data is moved there, I agree that renaming won't work. But it seems to me that test data is different from sample data used by examples: when running the tests for a given revision of matplotlib, you don't want the absolute latest comparison images but the images that correspond to that particular code revision. You also typically want to get all of the comparison images for that revision at the same time, since you're likely to be running the whole test suite. Also, if you are running the test suite, I think we can assume you can get a checkout of the test-data repository. (A git submodule would seem to be a good fit: the main repository would have a pointer to the appropriate revision of the test-images repository, and people interested in running the test suite would have to run git submodule update to check it out.) Two alternative ideas to handle for the versioning issue: A) Add a .py file in the main source repository with is a list of sample data filenames and checksums. If a sample data file doesn't exist, or its checksum is wrong, it can be downloaded. Sounds complicated, and makes older versions unable to run newer examples. B) The source file could simply have the same data version number required and the sample data itself could be versioned. That might work. If I understand this correctly, the example code would call get_sample_data(foo.dat) to get the latest revision or get_sample_data(foo.dat, 1234) to get a specific one. These would retrieve URLs like http://example.com/sample-data/raw/master/foo.dat http://example.com/sample-data/raw/1234/foo.dat -- Jouni K. Seppänen http://www.iki.fi/jks -- Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing http://p.sf.net/sfu/novell-sfdev2dev ___ Matplotlib-devel mailing list Matplotlib-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/matplotlib-devel