[issue18321] Multivolume support in tarfile module

2014-04-14 Thread Lars Gustäbel

Lars Gustäbel added the comment:

Okay, let me tell you why I reject your contribution at this point.

The patch you submitted may be well-suited for your purposes but it does not 
meet the requirements of a standard library implementation because it is not 
generic and comprehensive enough.

It contains duplicate code, spelling mistakes and needless code changes e.g.  
in test_tarfile.py.

It does not expose one set of volumes as one tar archive to the user. It is not 
possible to iterate over all members of all volumes in one go. It does not 
allow random-access.

Actually, it does not implement complete multivolume support but only the 
easy parts.  For example, it fails to read GNU tar archives that are split in 
the middle of a pax header block sequence. The other way around, when writing 
it makes a split only when it is inside the data part of a member. Hence, it is 
possible that a volume turns out smaller than max_volume_size which is not only 
inaccurate but also bad on a tape device.

If you decide that you still want multivolume support in tarfile, feel free to 
reopen this issue with a new and significantly better patch. I gave you a 
number of clues on what I think is required.

--
assignee:  - lars.gustaebel
resolution:  - rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-04-13 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

 The example I gave is based on the idea that there is a TarVolumeSet class in 
 the tarfile module that implements all the required file-object methods (e.g. 
  read(), write(), seek(), etc.) and acts as if the sequence of volumes is 
 actually one big file. It is passed to tarfile.open() as the fileobj 
 argument. This TarVolumeSet class is supposed to be subclassable to let the 
 user implement her/his own mode of operation. This way the open_volume() 
 method can do whatever the user wants it to do. The TarVolumeSet class might 
 as well have a new_volume() method for writing multivol archives, the example 
 only covered the case of reading a multivol archive.

This is not so easy, because for example if you want to move the logic in 
addfile() that manages volumes to the write() function, you have some issues:
 * write() will need to take into account blocks (BLOCKSIZE), just to be able 
to split the volumes correctly. This is something that usually shouldn't belong 
in a write() function as it has to do to tarfile.. and it can be messy that 
both layers deal with it (write() splitting volumes, and tarfile adding NUL at 
the end of a BLOCK..) This can be done I guess, but remember, we split a volume 
only in the middle of a big file, not in any other case (AFAIK). Hopefully you 
don't get huge pax headers or anything strange. Usually all other records are 
either in the begining or just occupy one block, but can we rule this problem 
out for good?

 * multivolume logic in write() needs read/write access to the current tarinfo 
being written (look for tarfile in addfile() funcion in my patch to see why). 
How do you propose this object should be accessed from write()? 

 BTW, my version of GNU tar refuses to create compressed multiple-volume 
 archives which is why I doubt the usefulness of this feature overall.

But it has multivolume support right? Which is what I am proposing here. Also, 
you can gzip (or encrypt or anything) the volumes after creating the volumes..

 [...] because a multivol tarfile is not exactly the same as a normal tarfile 
 chopped up.
 No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL 
 header that is at the start of each subsequent volume is to give GNU tar the 
 ability to detect if it is reading the correct volume. It is not essential 
 and could as well be left out.

I'm not going to discuss this, because I think that we can implement it in the 
way you propose to implement it anyway. But my patch supports it and I think 
it's an *useful* feature, so I want it in :-P

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-04-13 Thread Lars Gustäbel

Lars Gustäbel added the comment:

 [...] but remember, we split a volume only in the middle of a big file, not 
 in any other case (AFAIK). Hopefully you don't get huge pax headers or 
 anything strange. [...] 

Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two 
volume archive that is split exactly between the two blocks of an XHDTYPE pax 
header.

The result is terrifying. At the beginning of the second volume GNU tar creates 
an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL header, 
followed by an XHDTYPE header (GNUFileParts) that somehow decorates the 
following REGTYPE(!) tar header that contains the continuation of the split 
XHDTYPE header data from the previous volume. After that comes the REGTYPE file 
that the split XHDTYPE header was actually meant for as decoration.

I attached the archive to this issue.

What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna know...


 write() will need to take into account blocks (BLOCKSIZE), just to be able to 
 split the volumes correctly.

It is mandatory to do the split on a block boundary (a multiple of 512).


 * multivolume logic in write() needs read/write access to the current tarinfo 
 being written [...]. How do you propose this object should be accessed from 
 write()?

I don't know and this problem seems to be quite hard to address with my 
approach. That's too bad.


  BTW, my version of GNU tar refuses to create compressed multiple-volume 
  archives which is why I doubt the usefulness of this feature overall.
 But it has multivolume support right? Which is what I am proposing here. 
 Also, you can gzip (or encrypt or anything) the volumes after creating the 
 volumes..

Yeah, it has multivolume support, but a very limited one that is not only weird 
but isn't even usable together with compression. And sure, I can compress and 
encrypt the volumes afterwards, but I can also create a compressed archive and 
pipe it through split(1) to split it into parts. Both ways create tar archives 
that are not readable by GNU tar because they're non-standard. So what?

Please tell me, what is your actual personal use-case for this feature?

--
Added file: http://bugs.python.org/file34798/split-xhdtype.tar.gz

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-04-13 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

 [...] but remember, we split a volume only in the middle of a big file, not 
 in any other case (AFAIK). Hopefully you don't get huge pax headers or 
 anything strange. [...] 
 Hopefully? Sorry, but have you tested this? I did. I let GNU tar create a two 
 volume archive that is split exactly between the two blocks of an XHDTYPE pax 
 header.

 The result is terrifying. At the beginning of the second volume GNU tar 
 creates an XGLTYPE header as the pax replacement for a GNUTYPE_MULTIVOL 
 header, followed by an XHDTYPE header (GNUFileParts) that somehow decorates 
 the following REGTYPE(!) tar header that contains the continuation of the 
 split XHDTYPE header data from the previous volume. After that comes the 
 REGTYPE file that the split XHDTYPE header was actually meant for as 
 decoration.

 I attached the archive to this issue.

 What happens if a GNUTYPE_LONGNAME header is split in two? I don't wanna 
 know...

 write() will need to take into account blocks (BLOCKSIZE), just to be able 
 to split the volumes correctly.

 It is mandatory to do the split on a block boundary (a multiple of 512).
 BTW, my version of GNU tar refuses to create compressed multiple-volume 
 archives which is why I doubt the usefulness of this feature overall.
 But it has multivolume support right? Which is what I am proposing here. 
 Also, you can gzip (or encrypt or anything) the volumes after creating the 
 volumes..

 Yeah, it has multivolume support, but a very limited one that is not only 
 weird but isn't even usable together with compression. And sure, I can 
 compress and encrypt the volumes afterward, but I can also create a 
 compressed archive and pipe it through split(1) to split it into parts. Both 
 ways create tar archives that are not readable by GNU tar because they're 
 non-standard. So what?

 Please tell me, what is your actual personal use-case for this feature?

I'm willing modify the patch to remove the weirdness you refer to. I differ 
on that it's not usable: it might not be useful to you, but it's certainly a 
feature that covers part of the functionality of GNU tar. Actually, some of the 
unit tests are like this: use GNU Tar to compress, then extract with tarfile - 
and viceversa.

Of course you can use split. And I could use Ruby or Perl, but I'm using python 
and tarfile, and this is a GNU tar feature that is just not supported in python 
tarfile upstream, and I'm just trying to contribute this feature, if possible 
:-).

BTW, If I create a multivol tar file and then compress the volumes, that does 
not make it non-standard, in the same way that if I create a PNG file and 
then compress it and then store it in EXTFS, it doesn't make it non-standard. 
I'm just using multiple layers of standards.

I'm a contractor, and I have been asked by a client to develop a python-based 
backup tool. The client is technical and had already an idea of what he wanted 
to do: use python-tarfile and add support to multivolume and some other 
goodies, and the client also wanted to try to push the changes upstream as we 
believe it is the correct thing to do.

BTW, when we designed the backup tool, we ruled out the possibility of using 
split because split wouldn't allow to correctly list all the files in each 
file-slice separately. We wanted to be able to recover all the files of each 
volume so that if we lose other volumes, we can still recover all the data 
from the volumes we have. 

Anyway, if you are the maintainer of tarfile and you think it's not possible to 
push tar-multivolume support upstream in python tarfile for whatever reason, 
please tell me.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-03-12 Thread Lars Gustäbel

Lars Gustäbel added the comment:

 It's also consistent with how the tar command works afaik, just listing the 
 contents of the current volume.

No, GNU tar operates on the entirety of the archive and asks for the filename 
of the subsequent volume every time it hits eof in the current volume.

 You don't want to directly do a plain open in there, because you want to be 
 able to deal with read/write modes, with gzip/bzip/Stream class.

The example I gave is based on the idea that there is a TarVolumeSet class in 
the tarfile module that implements all the required file-object methods (e.g.  
read(), write(), seek(), etc.) and acts as if the sequence of volumes is 
actually one big file. It is passed to tarfile.open() as the fileobj argument. 
This TarVolumeSet class is supposed to be subclassable to let the user 
implement her/his own mode of operation. This way the open_volume() method can 
do whatever the user wants it to do. The TarVolumeSet class might as well have 
a new_volume() method for writing multivol archives, the example only covered 
the case of reading a multivol archive.

BTW, my version of GNU tar refuses to create compressed multiple-volume 
archives which is why I doubt the usefulness of this feature overall.

 [...] because a multivol tarfile is not exactly the same as a normal tarfile 
 chopped up.

No, I think it is exactly that. The only purpose of the GNUTYPE_MULTIVOL header 
that is at the start of each subsequent volume is to give GNU tar the ability 
to detect if it is reading the correct volume. It is not essential and could as 
well be left out.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-03-11 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

I guess I got it wrong, it's not part of the POSIX standard, just part of the 
GNU tar documentation. About the getmembers and getnames not reflecting the 
entirety of the archive, it's an optimization I needed and I think ccan be 
quite handy. It's also consistent with how the tar command works afaik, just 
listing the contents of the current volume. But it's true that could be done.

Regarding the TarVolumeSet idea, it could work but it's not as easy as that. 
You don't want to directly do a plain open in there, because you want to be 
able to deal with read/write modes, with gzip/bzip/Stream class. You also want 
to give the user maximum flexibility. That's why in my implementation 
new_volume_handler and open_volume functions are separated. Similarly, we could 
do something like this:

class MyTarVolumeSet(tarfile.TarVolumeSet):

def __init__(self, template, max_vol_size):
self.template = template
self.max_vol_size = max_vol_size

def new_volume_handler(self, tarfile, volume_number):
self.open_volume(self.template % volume_number, tarfile)

volumes = MyTarVolumesSet(test.tar.%03d)
with tarfile.open(fileobj=volumes, mode=w:) as tar:
for t in tar:
print(t.name)

Note that the new_volume_handler in this example receives more or less the same 
params as in my patch (not the base name, because it's already stored in the 
template), and that the open_volume really abstracts which way it will be 
opened. It could also have, as in my patch, an optional fileobj parameter, for 
a new indirection/abstraction.

In any case, this kind of abstraction would still really need some kind of 
hooking with tarfile, because a multivol tarfile is not exactly the same as a 
normal tarfile chopped up. This might be doable unilateraly by a smart 
TarVolumeSet getting the information from the tarfile and modifying/patching 
anything needed. This is one of the reasons why one would probably would still 
need access to the tarfile inside the open_volume function.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-02-02 Thread Lars Gustäbel

Lars Gustäbel added the comment:

I had the following idea: What about a separate class, let's call it 
TarVolumeSet for now, that maps a set of (virtual) volumes onto one big 
file-like object. This TarVolumeSet will be passed to a TarFile constructor as 
the fileobj argument. It is subclassable for implementing custom behavior.


class MyTarVolumeSet(tarfile.TarVolumeSet):

def __init__(self, template):
self.template = template

def open_volume(self, volume_number):
return open(self.template % volume_number, rb)

volumes = MyTarVolumesSet(test.tar.%03d)
with tarfile.open(fileobj=volumes, mode=r:) as tar:
for t in tar:
print(t.name)


In my opinion, this approach has a number of advantages: Most importantly, it 
separates the multi-volume code from the TarFile class, which reduces the 
invasiveness, complexity and maintenance burden of the original approach. The 
TarFile class would be totally agnostic about the concept of multiple volumes, 
TarVolumeSet looks just like another file-object to TarFile. Looks like the 
cleanest solution to me so far.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-30 Thread Lars Gustäbel

Lars Gustäbel added the comment:

At first, I'd like to take back my comment on this patch being too complex for 
too little benefit. That is no real argument.

Okay, I gave it a shot and I have a few more remarks:

The patch does not support iterating over a multi-volume tar archive, e.g. for 
TarFile.list(). You must implement this.

In my opinion, a tar archive is one logical unit even if it spans across 
multiple volumes. Thus, it is vital to have .getmembers() and .getnames() 
reflect the entirety of the archive, e.g. to support if filename in 
.getnames(). I think it could be a good idea to store the volume number along 
each TarInfo object for random-access.

By the way, which standard are you referring to? The only one I know of is 
POSIX pax which doesn't say anything about multiple volumes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-29 Thread Lars Gustäbel

Lars Gustäbel added the comment:

I cannot yet go into the details, because I have not tested the patch.

The comments, docstrings and quoting are not very consistent with the rest of 
the module. There are a few spelling mistakes. The open_volume() method is more 
or less a copy of the open() method which is not optimal.

The patch adds a lot of complexity to the tarfile module for a use case that 
only a few connoisseurs benefit from. It seems to alter some inherent TarFile 
mechanics that people might rely on, e.g. the members attribute contains only 
the members stored in the current volume not the overall entirety of members. 
Does this patch reliably allow random-access?

Would it be possible/easier to add the same functionality using a separate 
class MultiVolumeTarFile instead?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-29 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

 I cannot yet go into the details, because I have not tested the patch.

 The comments, docstrings and quoting are not very consistent with the rest of 
 the module. There are a few spelling mistakes. 

I can try to take care of this, though it'd be best if you can point me out 
concrete examples.

 The open_volume() method is more or less a copy of the open() method which is 
 not optimal.

I know, but trying to do something else might be even worse..

 The patch adds a lot of complexity to the tarfile module for a use case that 
 only a few connoisseurs benefit from. It seems to alter some inherent TarFile 
 mechanics that people might rely on, e.g. the members attribute contains only 
 the members stored in the current volume not the overall entirety of members. 

Well, that doesn't make much sense to me. You say that people rely on something 
like  members attribute contains only the members stored in the current volume 
not the overall entirety of members, but as you know, python tarfile lib 
didn't support volumes before this patch, so I guess people couldn't be relying 
on that anyway.

Also please bear in mind that tarfile volumes support is part of the tar 
standard, and altough not everyone uses this, it has its niche uses (sliced 
backups etc).

 Does this patch reliably allow random-access?

Yes and no. 

No: when you open a volume for reading, the list of members is cleared as you 
pointed so you cannot access to them.

Yes: you can open any volume at the begining of a file, and it read the tarfile 
from there. I do that in my use-case.

 Would it be possible/easier to add the same functionality using a separate 
 class MultiVolumeTarFile instead?

If you find open_volume similar to open() and don't like, I'm quite sure you 
would like all the needed copy-paste having this a separate class would entail. 
Also as I said before, might not make much sense as this is part of the 
standard.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-27 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

Do we have any news on this patch?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-27 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Lars, would you like to take a look?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2014-01-27 Thread Berker Peksag

Changes by Berker Peksag berker.pek...@gmail.com:


--
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-10-22 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

could you please check if my contributor form is already processed?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-10-22 Thread Christian Heimes

Christian Heimes added the comment:

You have an asterisk next to your user name. So yes, your form was received and 
processed.

--
nosy: +christian.heimes

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-07-28 Thread Eduardo Robles Elvira

Eduardo Robles Elvira added the comment:

Sure, I will fill it out. But is it required?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-06-28 Thread Eduardo Robles Elvira

New submission from Eduardo Robles Elvira:

The patch attached provides implementation for multivolume support for tarfile 
module. It contains both the changes in the module and a battery of unit tests. 
It contains support for multivolume for both GNU and PAX formats.

The main idea behind this proposal is that for dealing with new volumes, the 
user will provide a callback function. In this callback function the user 
typically calls to TarFile.open_volume(filename) with the appropiate filename. 
This is quite flexible in the sense that the callback function could even ask 
the user for the filename of the next volume (as tar command does), or do 
anything they need before returning or before calling to open_volume. 

Please feel free to comment on how to improve the implementation or the API. 
Documentation for the new feature will be provided at a later stage of the 
review process if the patch gets a good review.

--
components: Library (Lib)
files: cpython-tarfile-multivolume.patch
keywords: patch
messages: 191979
nosy: edulix
priority: normal
severity: normal
status: open
title: Multivolume support in tarfile module
versions: Python 3.4
Added file: http://bugs.python.org/file30720/cpython-tarfile-multivolume.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-06-28 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +lars.gustaebel
stage:  - patch review
type:  - enhancement

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-06-28 Thread Martin v . Löwis

Martin v. Löwis added the comment:

Eduardo, can you please fill out the contributor form?

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18321] Multivolume support in tarfile module

2013-06-28 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18321
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com