Re: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?

2011-09-28 Thread Nate Coraor
Paniagua, Eric wrote:
 Hi everyone,
 
 I am experiencing a similar problem to that of Leandro.  I have defined a new 
 datatype deriving Data, and having extension foobar.  The simplest and 
 first thing I'd like to do with it is create a dataset of this new type 
 Foobar by uploading from my local machine to my Galaxy development server.  
 When I do the upload, I specify the file format as foobar rather than 
 Auto-detect.  The file itself is a zip archive containing a folder 
 containing other files, but the extension is .foobar and not .zip.  I 
 encounter at least 2 (I believe separate) problems while trying to upload.
 
 1. I can see that on upload the first non-directory entry in the .foobar 
 file is being extracted and replacing the original.  It seems this is how 
 *unknown* zip archives are supposed to be handled by Galaxy, but as to why 
 that is, I haven't a clue.
 
 2. (This is what's similar to Leandro's case.)  The dataset_XXX.dat file 
 produced by the upload is not actually the same as the file it is copied 
 from in the uploaded archive.  The checksums are different, and the sizes are 
 different (the dataset is 1 byte longer).
 
 I did a diff of the hexdumps of the dataset file and the corresponding file 
 from the uploaded archive, and discovered the following:
 
 1. Every occurrence of '^M' (aka '\r' aka 0x0D) has been replaced with '\n' 
 (aka 0x0A).
 
 2. A newline ('\n' aka 0x0A) is added to the end of the file.
 
 This file from the archive is not a text file, it is binary, so any code in 
 Galaxy that tries to fix line endings shouldn't be doing this.  (Where) Are 
 there such places?
 
 Leandro, have you solved your problem?  If not, what do you see when you do 
 this kind of comparison?
 
 I am unable to reproduce those changes by stepping through the code in 
 upload.py which handles zip files (and replaces them by their first file 
 member) using the same python installation used for running Galaxy.  This 
 suggests the problem is elsewhere.
 
 Does anyone know why this '\r' - '\n' mapping is affecting this file?
 
 Does anyone know why the default behaviour for uploading zip archives is to 
 keep one file arbitrarily and throw out the rest?  Even with an argument in 
 favor of this behaviour, why is there not a unzippable_file_formats list 
 for exceptions to be made like there is for sniffing?
 
 Any enlightenment on these matters would be greatly appreciated.

Hi Eric,

It's happening in tools/data_source/upload.py, line 286:

line_count, converted_path = sniff.convert_newlines( dataset.path, 
in_place=in_place )

This should really be checking for any datatype subclassed from binary,
not binary itself.  As a quick workaround, add a check for your datatype
in the if/elif blocks around line 130 to avoid the default processing.
The upload tool needs to be rewritten to fix this, but it will be a
while before this is done.

--nate

 
 Best,
 Eric
 
 
 From: Leandro Hermida [soft...@leandrohermida.com]
 Sent: Friday, September 16, 2011 10:03 AM
 To: Paniagua, Eric
 Subject: Re: [galaxy-dev] uploading binary files checksum changes, Galaxy 
 doing something to file?
 
 Hi Eric,
 
 On Fri, Sep 16, 2011 at 3:58 PM, Paniagua, Eric epani...@cshl.edu wrote:
  Hi Leandro,
 
  Is there an entry in your history for the upload?  What file format does it 
  show?  Is there any chance your original file was zipped?  If Galaxy 
  detected it as a zip file on upload, it may have unzipped it and taken the 
  first file in it as the dataset.
 
 Yes there is an history entry for the upload.  The format it shows is
 the new datatype I created (in datatypes_conf.xml, subclassing Binary)
 which I selected in the drop-down menu before uploading the file in
 the Get Data form.  It is not a zip file.
 
  That's at least the version of your problem that I've run into before.  
  Specifying the file format manually (rather than choosing Auto-detect) may 
  help if it's a similar problem.  I suspect the correct solution is to write 
  a sniffer for your datatype to help ensure it is identified correctly by 
  Galaxy, but I haven't tried this yet.
 
 
 Essentially the basic question is, how do you tell Galaxy not to do or
 touch absolutely *anything* with an uploaded binary file??? The
 checksums should always match.
 
  Best of luck,
  Eric
  
  From: galaxy-dev-boun...@lists.bx.psu.edu 
  [galaxy-dev-boun...@lists.bx.psu.edu] on behalf of Leandro Hermida 
  [soft...@leandrohermida.com]
  Sent: Friday, September 16, 2011 9:42 AM
  To: Galaxy Dev
  Subject: [galaxy-dev] uploading binary files checksum changes,  Galaxy 
  doing something to file?
 
  Hi all,
 
  We tried to find something in the docs and mailing list no luck.  We
  created a new datatype the is a straight subclass of Binary and then
  when we upload such a file in the Galaxy UI and check the checksums
  between the original file and the file located

[galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?

2011-09-16 Thread Leandro Hermida
Hi all,

We tried to find something in the docs and mailing list no luck.  We
created a new datatype the is a straight subclass of Binary and then
when we upload such a file in the Galaxy UI and check the checksums
between the original file and the file located in the Galaxy
database/files/... directory their checksums are different!

What are we doing wrong? We simply want Galaxy to upload and no touch
the file at all.

regards,
Leandro
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] uploading binary files checksum changes, Galaxy doing something to file?

2011-09-16 Thread Paniagua, Eric
Hi Leandro,

Is there an entry in your history for the upload?  What file format does it 
show?  Is there any chance your original file was zipped?  If Galaxy detected 
it as a zip file on upload, it may have unzipped it and taken the first file in 
it as the dataset.

That's at least the version of your problem that I've run into before.  
Specifying the file format manually (rather than choosing Auto-detect) may help 
if it's a similar problem.  I suspect the correct solution is to write a 
sniffer for your datatype to help ensure it is identified correctly by Galaxy, 
but I haven't tried this yet.

Best of luck,
Eric

From: galaxy-dev-boun...@lists.bx.psu.edu [galaxy-dev-boun...@lists.bx.psu.edu] 
on behalf of Leandro Hermida [soft...@leandrohermida.com]
Sent: Friday, September 16, 2011 9:42 AM
To: Galaxy Dev
Subject: [galaxy-dev] uploading binary files checksum changes,  Galaxy doing 
something to file?

Hi all,

We tried to find something in the docs and mailing list no luck.  We
created a new datatype the is a straight subclass of Binary and then
when we upload such a file in the Galaxy UI and check the checksums
between the original file and the file located in the Galaxy
database/files/... directory their checksums are different!

What are we doing wrong? We simply want Galaxy to upload and no touch
the file at all.

regards,
Leandro
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/