Re: [galaxy-dev] Why does check_html only read first 100 lines?

2013-11-25 Thread John Chilton
Hello Dr. Davidson,

  Responses inline.

On Wed, Nov 20, 2013 at 11:13 AM, Robert Davidson
bobbledavid...@yahoo.co.uk wrote:
 Hi all,

 I recently tried uploading an couple of xml files to my local Galaxy
 installation using the standard ‘Upload File (version 1.1.3)’ tool. For some
 files this produced the error: The uploaded file contains inappropriate HTML
 content.

 Given the files had been created by the same automated code and contained
 the same tags etc, I couldn't understand why one would produce this error
 and the other not.

 Finally tracking down the function check_html() in
 galaxy-dist/lib/galaxy/datatypes/checkers.py, I discovered that my use of
 the tag metabolite had flagged up as a likely META  tag and produced the
 warning/failed upload.

 The reason this did not happen in every case is that check_html only reads
 the first 100 lines of the file and depending upon how many samples were in
 my dataset, my metabolite tag could appear before or after this cutoff.

 I've solved the problem by changing my xml tag names but my question is:

 a) why does check_html only read up to line 100?

My guess is that this results from a desire to not process every line
of multi-gigabyte FASTQ files on each upload - this would be very
costly.

https://bitbucket.org/galaxy/galaxy-central/commits/35cc1687eb7b348811bf9e4f6e9d4374b00b6f09

I have reworked the code so that there is now a HTML_CHECK_LINES
variable at this top of this file - you can simply set that to None to
process all lines in your Galaxy distribution. My hope is this check
is largely a bonus check, this content is actually locked down when
feeding the data out - indeed if you try to upload a file with html
content after line 100, Galaxy still serves it out as plain text. If
you (or anyone) does discover a way to circumvent this and produce say
a cross-site scripting attack - please contact galaxy-b...@bx.psu.edu
or myself right away and the problem will hopefully be addressed
promptly.

 b) would it be possible to change the regular expressions in check_html so
 that e.g. meta ... would be found but e.g. metaxxx ... would not?

I have pushed out an update that does exactly this. Sorry for the inconvenience.

https://bitbucket.org/galaxy/galaxy-central/commits/583c64d963d41c7e38f76e2be23531b160ec972f

Both of these updates should be included in the next galaxy-dist
(probably still a over a month away).

Thanks for the e-mail and thanks for using Galaxy.

-John


 Thanks for reading.

 Rob

 Dr Robert L Davidson
 NERC Metabolomics Facility
 School of Biosciences
 University of Birmingham
 Edgbaston, UK


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


[galaxy-dev] Why does check_html only read first 100 lines?

2013-11-20 Thread Robert Davidson
Hi all,
 
I recently tried uploading an couple of xml files to my
local Galaxy installation using the standard ‘Upload File (version 1.1.3)’
tool. For some files this produced the error: The uploaded file contains
inappropriate HTML content. 
 
Given the files had been created by the same automated
code and contained the same tags etc, I couldn't understand why one would
produce this error and the other not. 
 
Finally tracking down the function check_html() in
galaxy-dist/lib/galaxy/datatypes/checkers.py, I discovered that my use of the
tag metabolite had flagged up as a likely META  tag and
produced the warning/failed upload. 
 
The reason this did not happen in every case is that
check_html only reads the first 100 lines of the file and depending upon how
many samples were in my dataset, my metabolite tag could appear before
or after this cutoff. 
 
I've solved the problem by changing my xml tag names but
my question is: 
 
a) why does check_html only read up to line 100? 
b) would it be possible to change the regular expressions
in check_html so that e.g. meta ... would be found but e.g. metaxxx
... would not?
 
Thanks for reading.
 
Rob 
 
Dr Robert L Davidson
NERC Metabolomics Facility
School of Biosciences
University of Birmingham
Edgbaston, UK
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/