Paniagua, Eric wrote:
Hi everyone,
I am experiencing a similar problem to that of Leandro. I have defined a new
datatype deriving Data, and having extension foobar. The simplest and
first thing I'd like to do with it is create a dataset of this new type
Foobar by uploading from my local machine to my Galaxy development server.
When I do the upload, I specify the file format as foobar rather than
Auto-detect. The file itself is a zip archive containing a folder
containing other files, but the extension is .foobar and not .zip. I
encounter at least 2 (I believe separate) problems while trying to upload.
1. I can see that on upload the first non-directory entry in the .foobar
file is being extracted and replacing the original. It seems this is how
*unknown* zip archives are supposed to be handled by Galaxy, but as to why
that is, I haven't a clue.
2. (This is what's similar to Leandro's case.) The dataset_XXX.dat file
produced by the upload is not actually the same as the file it is copied
from in the uploaded archive. The checksums are different, and the sizes are
different (the dataset is 1 byte longer).
I did a diff of the hexdumps of the dataset file and the corresponding file
from the uploaded archive, and discovered the following:
1. Every occurrence of '^M' (aka '\r' aka 0x0D) has been replaced with '\n'
(aka 0x0A).
2. A newline ('\n' aka 0x0A) is added to the end of the file.
This file from the archive is not a text file, it is binary, so any code in
Galaxy that tries to fix line endings shouldn't be doing this. (Where) Are
there such places?
Leandro, have you solved your problem? If not, what do you see when you do
this kind of comparison?
I am unable to reproduce those changes by stepping through the code in
upload.py which handles zip files (and replaces them by their first file
member) using the same python installation used for running Galaxy. This
suggests the problem is elsewhere.
Does anyone know why this '\r' - '\n' mapping is affecting this file?
Does anyone know why the default behaviour for uploading zip archives is to
keep one file arbitrarily and throw out the rest? Even with an argument in
favor of this behaviour, why is there not a unzippable_file_formats list
for exceptions to be made like there is for sniffing?
Any enlightenment on these matters would be greatly appreciated.
Hi Eric,
It's happening in tools/data_source/upload.py, line 286:
line_count, converted_path = sniff.convert_newlines( dataset.path,
in_place=in_place )
This should really be checking for any datatype subclassed from binary,
not binary itself. As a quick workaround, add a check for your datatype
in the if/elif blocks around line 130 to avoid the default processing.
The upload tool needs to be rewritten to fix this, but it will be a
while before this is done.
--nate
Best,
Eric
From: Leandro Hermida [soft...@leandrohermida.com]
Sent: Friday, September 16, 2011 10:03 AM
To: Paniagua, Eric
Subject: Re: [galaxy-dev] uploading binary files checksum changes, Galaxy
doing something to file?
Hi Eric,
On Fri, Sep 16, 2011 at 3:58 PM, Paniagua, Eric epani...@cshl.edu wrote:
Hi Leandro,
Is there an entry in your history for the upload? What file format does it
show? Is there any chance your original file was zipped? If Galaxy
detected it as a zip file on upload, it may have unzipped it and taken the
first file in it as the dataset.
Yes there is an history entry for the upload. The format it shows is
the new datatype I created (in datatypes_conf.xml, subclassing Binary)
which I selected in the drop-down menu before uploading the file in
the Get Data form. It is not a zip file.
That's at least the version of your problem that I've run into before.
Specifying the file format manually (rather than choosing Auto-detect) may
help if it's a similar problem. I suspect the correct solution is to write
a sniffer for your datatype to help ensure it is identified correctly by
Galaxy, but I haven't tried this yet.
Essentially the basic question is, how do you tell Galaxy not to do or
touch absolutely *anything* with an uploaded binary file??? The
checksums should always match.
Best of luck,
Eric
From: galaxy-dev-boun...@lists.bx.psu.edu
[galaxy-dev-boun...@lists.bx.psu.edu] on behalf of Leandro Hermida
[soft...@leandrohermida.com]
Sent: Friday, September 16, 2011 9:42 AM
To: Galaxy Dev
Subject: [galaxy-dev] uploading binary files checksum changes, Galaxy
doing something to file?
Hi all,
We tried to find something in the docs and mailing list no luck. We
created a new datatype the is a straight subclass of Binary and then
when we upload such a file in the Galaxy UI and check the checksums
between the original file and the file located