Re: [galaxy-user] primer contamination, miranalyzer

Jennifer Jackson Mon, 12 Nov 2012 09:55:00 -0800

Hello Rosie,

Pls see below


On 11/12/12 4:00 AM, Rosie Griffiths wrote:

Hi Galaxy,

Ive got 2 problems for you;

1) Ive got microRNA Illumina NGS data that I want to analyse, I put it through 
fastQC on galaxy and it showed that 71% of the reads in one overrepresented 
sequence;

Sequence                                                                        
                                                            Count               
         Percentage              Possible Source
GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG     16896622        
71.06413061961005       RNA PCR Primer, Index 1 (100% over 29bp)
CCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCTTGTAATCTC     525614  
2.2106372475809497      RNA PCR Primer, Index 12 (100% over 44bp)
CCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACC     416041  
1.7497930632000402      RNA PCR Primer, Index 2 (100% over 34bp)

What would be the best way to remove this contamination? Also is is still ok to 
use that data despite such high contamination?

You can try.

Ive currently been trying to remove the sequence by using the clip adaptor 
tool, using the following options;

library to clip 2: FASTQ Groomer on data H1
Minimum sequence length (after clipping, sequences shorter than this length 
will be discarded)  15
Enter custom clipping sequence  
GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG
enter non-zero value to keep the adapter sequence and x bases that follow it    0
Discard sequences with unknown (N) bases        No
Output options  Output only non-clipped sequences (i.e. sequences which did not 
contained the adapter)

Did you really intended to discard the sequences that were clipped? Orperhaps the option "Output both clipped and non-clipped sequences" iswhat you intended? This would envoke the additional filters set, such asminimum length after clipping (15). Currently, with the option used, anysequence that is clipped - at all- is discarded as a first step.


75% reads will have some clipping

Maximum 25% will be in output, not counting other factors (sequencesalready under 15 bp in length, etc.)

This is a very hard hit and explains the current 15% output.

See next ->


Clipped reads - discarded.

here ^^ see that any clipped sequences are discarded immediately. Are-run with the other option is recommended. It could be a negligibledifference - but seems worth a check if the goal is to recover what isusable.

Input: 23776583 reads.
Output: 3091831 reads.
discarded 1287140 too-short reads.
discarded 18984774 adapter-only reads.
discarded 412838 clipp

but then I'm only left with 13% of the reads.

2) After I've filtered and clipped the adapter I want to analyse the frequency 
of each miR. I've been using miranalyzer to do this, I use the following 
workflow

data=>groomer=>clip adapter=>filter FastQ (min quality 20)=>fastq to 
fasta=>collapse


See below


the collapse file is like this;

1-17285268

GAATTCCACCACGTTCCCGTGG

2-522760

CCACCACGTTCCCGTGG

3-101198

TATTGCACTTGTCCCGGCCTGT

4-88745

Then upload the collapse file to miranalyzer however the total reads in the 
miranalyzer output is the same as the total number of sequences in the collapse 
file, it doesn't seem to recognise the count number.

miranalyzer says the following;
        
        2.1 Input formats
        
        miRanalyzer requires a single file containing the unique reads and 
their counts. The application accepts two different input formats:

        2.1.1 A tab or space separated file as in the following example 
(read-count format):

        GAGGTAGTAGGTTGTA        49862
        ACCCGTAGAACCGACC        15490
        ...     ...
        GGAGCATCTCTCGGTC        13762
        2.1.2 A multifasta file:

        >ID1 49862
        GAGGTAGTAGGTTGTA
        >ID2 15490
        ACCCGTAGAACCGACC
        ....
        >ID  13762
        GGAGCATCTCTCGGTC
        The description field must hold the read count. If not set, it is 
supposed to be 1. The file must have extension ’fa’, ’fasta’ or ’mfa’.

Do you know how I could change my format so it can recognise the read count 
e.g. maybe change the '-' to a space?

You have this correct: Convert the fasta -> tabular, convert the dash totab, then convert tab -> fasta (setting the new column as thedescription field).


3) Ive recently got the local install of galaxy but encounter the following 
error when I try to add a file to my data libary

Are you set up as an admin? This is the default if you are runningGalaxy straight as-is without any changes. You may also be running asfor a 'production environment'. The setting in the links below have setup info for both. If you are configured and having problems, this wouldbe a good question to sent to the galaxy-...@bx.psu.edu mailing list asa brand new thread, and as a distinct question, to reach the developers.(No need to continue this thread or cc galaxy-user). Include as muchinformation about your local environment as possible (but nothingpersonal, like a password). I can't tell from this info what is goingon, but it is very likely these gurus can!


http://getgalaxy.org
http://wiki.galaxyproject.org/Admin/Config/Performance/Production%20Server

http://wiki.galaxyproject.org/Admin/Data%20Libraries
http://wiki.galaxyproject.org/Admin/Data%20Libraries/Libraries

Best wishes for your project!

Jen
Galaxy team
http://wiki.galaxyproject.org/Support


Error attempting to display contents of library (New data library): 
(OperationalError) no such column: True u'SELECT dataset_permissions.id AS 
dataset_permissions_id, dataset_permissions.create_time AS 
dataset_permissions_create_time, dataset_permissions.update_time AS 
dataset_permissions_update_time, dataset_permissions.action AS 
dataset_permissions_action, dataset_permissions.dataset_id AS 
dataset_permissions_dataset_id, dataset_permissions.role_id AS 
dataset_permissions_role_id XnFROM dataset_permissions XnWHERE True AND 
dataset_permissions.action = ?' ['access'].

Ive got the latest version of galaxy and am using chrome and mountain lion os x

changeset:   7986:12fcd068b12e
tag:         tip
user:        Daniel Blankenberg <d...@bx.psu.edu>
date:        Thu Oct 18 11:22:12 2012 -0400
summary:     Do not hide failed datasets with HideDatasetAction post job action.



Any help will be greatly appreciated

Thank you
Rosie Griffiths


___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org

___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-user] primer contamination, miranalyzer

Reply via email to