[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: I choosed to use the internal variable *concat* because - if I recollect correctly, it is calculated before successive routines; - I didn't see your solution (!), there is a very nice computed variable in front of my eyes. Mmh 1) Reliability Cannot be sure this always run with malformed files : for zinfo in zf.infolist(): We can try / except but we loose the computation. If *concat* is already computed (unless completely damaged files), IMHO my solution is better. 2) Performance What are the performance for big files? Are there file seeks due to traversing zf.infolist() ? > Daniel wrote: > the advantage is that it already works in python 2.7 so there is no need to patch Python Yes, indeed. If I am right about the pros of my patch, I stand for it. Many thanks for you attention. On Sat, 18 Apr 2020 at 15:45, Daniel Hillier wrote: > > Daniel Hillier added the comment: > > Hi Massimo, > > Unless I'm missing something about your requirements, the advantage is that > it already works in python 2.7 so there is no need to patch Python. Just > bundle the above function with your analysis tool and you're good to go. > > Cheers, > Dan > > On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala > wrote: > > > > > Massimo Sala added the comment: > > > > Hi Daniel > > > > Could you please elaborate the advantages of your loop versus my two > lines > > of code? > > I don't grasp... > > > > Thanks, Massimo > > > > On Sat, 18 Apr 2020 at 03:26, Daniel Hillier > > wrote: > > > > > > > > Daniel Hillier added the comment: > > > > > > Could something similar be achieved by looking for the earliest file > > > header offset? > > > > > > def find_earliest_header_offset(zf): > > > earliest_offset = None > > > for zinfo in zf.infolist(): > > > if earliest_offset is None: > > > earliest_offset = zinfo.header_offset > > > else: > > > earliest_offset = min(zinfo.header_offset, earliest_offset) > > > return earliest_offset > > > > > > > > > You could also adapt this using > > > > > > zinfo.compress_size + len(zinfo.FileHeader()) > > > > > > to see if there were any sections inside the archive which were not > > > referenced from the central directory. Not sure if zip files with > > arbitrary > > > bytes inside the archive would be valid everywhere, but I think they > are > > > using zipfile. > > > > > > You can also have zipped content inside an archive which has a valid > > > fileheader but no reference from the central directory. Those entries > are > > > discoverable by implementations which process content serially from the > > > start of the file but not implementations which rely on the central > > > directory. > > > > > > -- > > > nosy: +dhillier > > > > > > ___ > > > Python tracker > > > <https://bugs.python.org/issue40301> > > > ___ > > > > > > > -- > > > > ___ > > Python tracker > > <https://bugs.python.org/issue40301> > > ___ > > > > -- > > ___ > Python tracker > <https://bugs.python.org/issue40301> > ___ > -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: Sorry I forgot to mention one specific case. We have valid archives with a starting "blob": digitally signed zip files, their filename extension is ".zip.p7m". I agree your tip can be useful to other readers. Best regards, Sala On Sat, 18 Apr 2020 at 15:45, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > > Just check the first 4 bytes of the file. In "normal" ZIP archive they are > b'PK\3\4' (or b'PK\5\6' if it is empty). It is so reliable as checking the > offset, and more efficient. It is even more reliable, because a malware can > have zero ZIP archive offset, but it cannot start with b'PK\3\4'. > > -- > > ___ > Python tracker > <https://bugs.python.org/issue40301> > ___ > -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: Hi Daniel Could you please elaborate the advantages of your loop versus my two lines of code? I don't grasp... Thanks, Massimo On Sat, 18 Apr 2020 at 03:26, Daniel Hillier wrote: > > Daniel Hillier added the comment: > > Could something similar be achieved by looking for the earliest file > header offset? > > def find_earliest_header_offset(zf): > earliest_offset = None > for zinfo in zf.infolist(): > if earliest_offset is None: > earliest_offset = zinfo.header_offset > else: > earliest_offset = min(zinfo.header_offset, earliest_offset) > return earliest_offset > > > You could also adapt this using > > zinfo.compress_size + len(zinfo.FileHeader()) > > to see if there were any sections inside the archive which were not > referenced from the central directory. Not sure if zip files with arbitrary > bytes inside the archive would be valid everywhere, but I think they are > using zipfile. > > You can also have zipped content inside an archive which has a valid > fileheader but no reference from the central directory. Those entries are > discoverable by implementations which process content serially from the > start of the file but not implementations which rely on the central > directory. > > -- > nosy: +dhillier > > ___ > Python tracker > <https://bugs.python.org/issue40301> > ___ > -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: Hi Serhiy Thanks for the suggestion but I don't need to analyse different self-extraction payloads (and I think it is always unreliable, there are too many self-extractors in the wild). I spend two words about my work. I analyze ZIP archives because they are the "incarnation" also of microsoft OOXML and openoffice OASIS ODF documents. I always find these kind of files with not zero offset aren't strictly compliant documents (by their respective file formats specifications). Sometimes there is a self-extrator, sometimes there are pieces of malware blobs (outside the ZIP structure or inside it, into the compressed files), sometimes other errors. For us checking the offset is very effective: we discard "bad" documents at maximum speed before any other checks and it is more reliable than antivirus (checking against specific blobs signatures, everytime changing). With just a single test we have a 100% go/nogo result. Every colleague grasp this check, there aren't hard to read and maintain routines. Massimo On Sat, 18 Apr 2020 at 09:36, Serhiy Storchaka wrote: > > Serhiy Storchaka added the comment: > > I am not sure it would help you. There are legitimate files which contain > a payload followed by the ZIP archive (self-extracting archives, programs > with embedded ZIP archives). And the malware can make the offset of the ZIP > archive be zero. > > If you want to check whether the file looks like an executable, analyze > first few bytes of the file. All executable files should start by one of > well recognized signatures, otherwise the OS would not know how to execute > them and they would not be malware. > > -- > > ___ > Python tracker > <https://bugs.python.org/issue40301> > ___ > -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: On Sat, 18 Apr 2020 at 04:37, Steven D'Aprano wrote: If we made an exception for you, then people using Python 2.7 still couldn't use this feature: `myzipfile.offset` would fail on code using Python 2.7, 2.7.1, 2.7.2, 2.7.3, ... 2.7.17 and only work with 2.7.18. Nobody could use it unless their application required 2.7.18. Yes, it seems to me obvious it will work only with Python 2.7.18, and I see no problem. If you need new features, you have always to update (to a new MINOR version or, like you said, MAJOR version). I am used to other softwares where some features are backported to older versions and IMHO it is very useful. Sometimes you just need a specific feature and it isn't possible to update to a MAJOR version. You have to consider there are many legacy softwares, also in business, and a version leap means a lot of work and tests. Speaking in general, not only python: if the maintainers backport that specific feature, bingo! you have only to update to the same MAJOR new MINOR version. And this is good for the user base, there isn't "one size fits all". I shot my bullet but I cannot change python.org way of life. Steven many thanks for your answers and patience to explain. BTW yes I will patch python 2.7 sources and compile it... also on legacy, intranet, centos 5 servers we cannot update :-) -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Massimo Sala added the comment: Hi Steven Every software "ecosystem" has its guidelines and I am a newbie about python development. Mmh I see your concerns. I agree about your deletions of all py 3 versions before the latest 3.9. About Py 2, I remark these facts: - there are a lot of forensics tools still written for py 2; - python 2.7.18 will be forever the last python 2 and I think is it fine to end-users to have zipfile with this feature both in py 2.7 and py 3.9; - in the code there isn't any new routine to test: the change is just to expose one internal variable. I agree my request is an exception but I think you have to agree this situation is exceptional. IMHO rules must exist to help us and I think this request doesn't carry any burden. I ask you please - to reconsider my request - anyway, to put me in contact with zipfile mainteners, I don't know how to reach them but I want to hear them about this. Many thanks, Massimo -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40271] Allow shell like paths in
Massimo Sala added the comment: Gavin, zipfile works on all the operating systems where python runs. Your request is OS dependent... BSD? linux? The tilde isn't into the ZIP file specifications. I have to agree with Serhiy: the correct solution is os.path.expanduser("~/stuff") -- nosy: +massimosala ___ Python tracker <https://bugs.python.org/issue40271> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Change by Massimo Sala : -- title: zipfile module: new feature (two lines of code) -> zipfile module: new feature (two lines of code), useful for test, security and forensics ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code)
New submission from Massimo Sala : module zipfile Tag "Components": I am not sure "Library (Lib)" is the correct one. If it isn't, please fix. I use python to check zip files against malware. In these files the are binary blobs outside the ZIP archive. The malware payload isn't inside the ZIP file structure. Example: a file "openme.zip" with this content : [blob from offset 0 to offset 5678] [ZIP archive from offset 5679 to end of file] zipfile already handles this, finding the ZIP structure inside the file. My change is just to add a new public property, to expose an internal variable: the file offset of the ZIP structure. I know, I am after the code freeze of Python 2.7.18. But the change is really trivial, see the diff. I hope you can approve this patch for all the Python versions, also for 2.7, to have consistency. For 2.7 this is the last call. -- components: Library (Lib) files: py27_zipfile.patch keywords: patch messages: 366597 nosy: massimosala priority: normal severity: normal status: open title: zipfile module: new feature (two lines of code) type: enhancement versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9 Added file: https://bugs.python.org/file49067/py27_zipfile.patch ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com