Re: [galaxy-dev] Filter data on any column and missing values

2011-05-12 Thread Peter Cock
On Thu, Apr 7, 2011 at 7:00 PM, Peter Cock  wrote:
> On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock  wrote:
>> Hi all,
>>
>> I have just found a problem using the "Filter data on any column using
>> simple expressions" tool, i.e. files tools/stats/filters.xml and
>> tools/stats/filters.py
>>
>> I have some six column tabular like this, where I have used \t for a
>> tab, and \n for the new lines:
>>
>> #ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n
>> gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n
>> gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n
>> gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
>>
>> Breakdown of my data:
>>
>> Column 1 - ID, mandatory string
>> Column 2 - HMM_Sprob_score, mandatory float
>> Column 3 - SP_len, mandatory integer
>> Column 4 - RXLR_start, optional integer
>> Column 5 - EER_start, optional integer
>> Column 6 - RXLR?, mandatory string (Y or N)
>>
>> Notice that in my output columns 4 and 5 can be empty or an integer.
>>
>> I'm trying to filter this file using c6=='Y', i.e. column six is a
>> yes. This works (one row output) but Galaxy tells me:
>>
>> Info: Filtering with c6=='Y',
>> kept 100.00% of 4 lines.
>> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
>> SP_len RXLR_start EER_start RXLR?"
>>
>> Then if I try to filter using c6=='N', i.e. column six is a no, it
>> fails to work (zero rows of output instead of three) and tells me:
>>
>> kept 0.00% of 4 lines.
>> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
>> SP_len RXLR_start EER_start RXLR?"
>>
>> Digging into the code, tools/stats/filters.py gets given the list of
>> column types from Galaxy and (regardless of which columns are to be
>> used) attempts to cast them to integers, floats, etc.
>>
>> It looks like Galaxy has decided that my columns 4 and 5 are integers
>> (based on the first row), and therefore filters.py blindly tries to
>> using int(...) on all these entries and that fails on the empty cells.
>>
>> I see several issues,
>>
>> (a) The filters.py tool only really needs to cast those columns being
>> used for the filter (fairly easy to fix)
>> (b) The galaxy column type detection seems a bit fragile (hard to
>> really fix without looking at all the data).
>> (c) Are there other tools that would break in a similar way to filter.py?
>
> Also:
> (d) This probably also explains why the filter tool doesn't like my header
> row (which starts with a #) since the captions are not numeric. Skipping
> these is probably a different bug fix though.
>
> Peter

To address these issues with the filters.py tool I've filed the
following bugs with fixes:

https://bitbucket.org/galaxy/galaxy-central/issue/535/
https://bitbucket.org/galaxy/galaxy-central/issue/536/
https://bitbucket.org/galaxy/galaxy-central/issue/537/

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Filter data on any column and missing values

2011-04-07 Thread Peter Cock
On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock  wrote:
> Hi all,
>
> I have just found a problem using the "Filter data on any column using
> simple expressions" tool, i.e. files tools/stats/filters.xml and
> tools/stats/filters.py
>
> I have some six column tabular like this, where I have used \t for a
> tab, and \n for the new lines:
>
> #ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n
> gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n
> gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n
> gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
>
> Breakdown of my data:
>
> Column 1 - ID, mandatory string
> Column 2 - HMM_Sprob_score, mandatory float
> Column 3 - SP_len, mandatory integer
> Column 4 - RXLR_start, optional integer
> Column 5 - EER_start, optional integer
> Column 6 - RXLR?, mandatory string (Y or N)
>
> Notice that in my output columns 4 and 5 can be empty or an integer.
>
> I'm trying to filter this file using c6=='Y', i.e. column six is a
> yes. This works (one row output) but Galaxy tells me:
>
> Info: Filtering with c6=='Y',
> kept 100.00% of 4 lines.
> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
> SP_len RXLR_start EER_start RXLR?"
>
> Then if I try to filter using c6=='N', i.e. column six is a no, it
> fails to work (zero rows of output instead of three) and tells me:
>
> kept 0.00% of 4 lines.
> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
> SP_len RXLR_start EER_start RXLR?"
>
> Digging into the code, tools/stats/filters.py gets given the list of
> column types from Galaxy and (regardless of which columns are to be
> used) attempts to cast them to integers, floats, etc.
>
> It looks like Galaxy has decided that my columns 4 and 5 are integers
> (based on the first row), and therefore filters.py blindly tries to
> using int(...) on all these entries and that fails on the empty cells.
>
> I see several issues,
>
> (a) The filters.py tool only really needs to cast those columns being
> used for the filter (fairly easy to fix)
> (b) The galaxy column type detection seems a bit fragile (hard to
> really fix without looking at all the data).
> (c) Are there other tools that would break in a similar way to filter.py?

Also:
(d) This probably also explains why the filter tool doesn't like my header
row (which starts with a #) since the captions are not numeric. Skipping
these is probably a different bug fix though.

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Filter data on any column and missing values

2011-04-07 Thread Peter Cock
Hi all,

I have just found a problem using the "Filter data on any column using
simple expressions" tool, i.e. files tools/stats/filters.xml and
tools/stats/filters.py

I have some six column tabular like this, where I have used \t for a
tab, and \n for the new lines:

#ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n
gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n
gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n
gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n

Breakdown of my data:

Column 1 - ID, mandatory string
Column 2 - HMM_Sprob_score, mandatory float
Column 3 - SP_len, mandatory integer
Column 4 - RXLR_start, optional integer
Column 5 - EER_start, optional integer
Column 6 - RXLR?, mandatory string (Y or N)

Notice that in my output columns 4 and 5 can be empty or an integer.

I'm trying to filter this file using c6=='Y', i.e. column six is a
yes. This works (one row output) but Galaxy tells me:

Info: Filtering with c6=='Y',
kept 100.00% of 4 lines.
Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
SP_len RXLR_start EER_start RXLR?"

Then if I try to filter using c6=='N', i.e. column six is a no, it
fails to work (zero rows of output instead of three) and tells me:

kept 0.00% of 4 lines.
Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
SP_len RXLR_start EER_start RXLR?"

Digging into the code, tools/stats/filters.py gets given the list of
column types from Galaxy and (regardless of which columns are to be
used) attempts to cast them to integers, floats, etc.

It looks like Galaxy has decided that my columns 4 and 5 are integers
(based on the first row), and therefore filters.py blindly tries to
using int(...) on all these entries and that fails on the empty cells.

I see several issues,

(a) The filters.py tool only really needs to cast those columns being
used for the filter (fairly easy to fix)
(b) The galaxy column type detection seems a bit fragile (hard to
really fix without looking at all the data).
(c) Are there other tools that would break in a similar way to filter.py?

Peter
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/