Re: [galaxy-user] Text Editing
Thanks! This worked perfectly. luce On 9 Dec, 2011, at 2:31 PM, Dave Clements wrote: > Hi Luce, > > I'm forwarding this question to the Galaxy-User mailing list, as I think this > is a pretty common situation. > > Here's how I replace text in a column. It's a two step process for each > dataset. > > First go to Text Manipulation -> Compute. > > In the Add expression text box enter > > columnNum.replace("oldVal", "newVal") > > In your case I think this is > > c4.replace("MACS_peak_", "treatment1_peak_", 1) > > "replace" is a Python character string operation, and c4 is the character > string column we are working on. I added the 1 out of paranoia. This tells > galaxy to only replace the first occurrence of the old string, in each line. > Care must be taken to avoid more replacement than you want. > > Executing this will create a dataset with a new column at the end. > > Now, use the Text Manipulation -> Cut operation to substitute the new column > in place of the old column. > > Does that do the trick? > > Thanks, > > Dave C. > > On Thu, Dec 8, 2011 at 4:24 PM, las2017 wrote: > I have two ChIPSeq datasets, and I am trying to find the common and distinct > peaks between them and visualize them. I end up with a MACS bed file for each > (listing a bunch of MACS_peaks). I then use the Intersect and Subtract tools > from the Genomic Intervals tab and end up with the peaks I want. However, > because of the way that MACS names its peaks, there can end up being some > peaks named the same way in both files (because, for example, peak 20 in > file1 is from position 300,000-300,500 but peak 20 in file 2 is from position > 320,000-320,500). So, I can end up with multiple peaks with the same name. > Because all the peak names have the same form, it can also be difficult to > tell them apart when visualizing them in the UCSC Genome Browser. > > What I would like to do is to be able to edit the bed file to change the text > MACS_peak_ to, say, treatment1_peak_ so that peak 20 would > now still be numbered 20 in both files, but would have a different label. > This would be pretty easy to do using regular expressions and sed. > > I know there have been a few posts about text manipulation, and I know that > there is a text manipulation tab, but I can't seem to find an easy way to do > what I want to do. > > Any advice? > > Thanks, luce > > > > -- > http://galaxyproject.org/ > http://getgalaxy.org/ > http://usegalaxy.org/ > http://galaxyproject.org/wiki/ > ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Text Editing
Hi Luce, I'm forwarding this question to the Galaxy-User mailing list, as I think this is a pretty common situation. Here's how I replace text in a column. It's a two step process for each dataset. First go to Text Manipulation -> Compute. In the Add expression text box enter columnNum.replace("oldVal", "newVal") In your case I think this is c4.replace("MACS_peak_", "treatment1_peak_", 1) "replace" is a Python character string operation, and c4 is the character string column we are working on. I added the 1 out of paranoia. This tells galaxy to only replace the first occurrence of the old string, in each line. Care must be taken to avoid more replacement than you want. Executing this will create a dataset with a new column at the end. Now, use the Text Manipulation -> Cut operation to substitute the new column in place of the old column. Does that do the trick? Thanks, Dave C. On Thu, Dec 8, 2011 at 4:24 PM, las2017 wrote: > I have two ChIPSeq datasets, and I am trying to find the common and > distinct peaks between them and visualize them. I end up with a MACS bed > file for each (listing a bunch of MACS_peaks). I then use the Intersect and > Subtract tools from the Genomic Intervals tab and end up with the peaks I > want. However, because of the way that MACS names its peaks, there can end > up being some peaks named the same way in both files (because, for example, > peak 20 in file1 is from position 300,000-300,500 but peak 20 in file 2 is > from position 320,000-320,500). So, I can end up with multiple peaks with > the same name. Because all the peak names have the same form, it can also > be difficult to tell them apart when visualizing them in the UCSC Genome > Browser. > > What I would like to do is to be able to edit the bed file to change the > text MACS_peak_ to, say, treatment1_peak_ so that peak 20 > would now still be numbered 20 in both files, but would have a different > label. This would be pretty easy to do using regular expressions and sed. > > I know there have been a few posts about text manipulation, and I know > that there is a text manipulation tab, but I can't seem to find an easy way > to do what I want to do. > > Any advice? > > Thanks, luce > -- http://galaxyproject.org/ http://getgalaxy.org/ http://usegalaxy.org/ http://galaxyproject.org/wiki/ ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
[galaxy-user] Text Editing
Hello Luce, I can explain the use of the tools "Text Manipulation". For each file independently, the following steps will rename the "name" identifier in column 4. I don't believe that there a more direct method, but you may discover one. This type of customization is why the tools are distinct - so they can be used in sequence to do many of the same text manipulations as on the unix line command. There is a biosed command as part of EMBOSS, but that tool works on sequence text, not text files in general. To save time in the future, these steps can be put into a workflow, with a edit of step 2 to customize the new ID text as needed when run. Starting with a 5 column MACS BED file: 1 - Save the track header line with the tool "Select first lines from a dataset" with the option to save the line 1. 2 - Create the working dataset that does not include the first line with the tool "Remove beginning of a file" with the option "Remove first: 1" lines. 3 - Split of up the existing ID with the tool 'Convert delimiters to TAB" using the "Underscores" option. This will split the fourth "name" column into three distinct columns, the last new column will be using to create the new ID. 4 - Create a column in your file named "treatment1_peak_" with the tool "Add column to an existing dataset" This will create an extra column at the end of the BED file, to be used in the new ID. The file should now be: c1 - chrom c2 - start c3 - end c4 - the text "MACS" c5 - the text "peak" c6 - the text will be a number, second part of the new ID c7 - score c8 - the text "treatment1_peak_" (or "treatment2_peak_" if the second file) 5 - Merge the two ID portions with the tool "Merge Columns together" using the option of merging column c8 with c6. This will create a new field, c9, with the text "treatment2_peak_N" where "N" is whatever the number in c6 was, per row. 6 - Create the new BED file, putting the new "name" column in the correct order and omitting the columns not needed, using the tool "Cut columns from a table" and pasting into the "Cut columns:" box the this text (no quotes): c1,c2,c3,c9,c7 7 - Add in back the track line (removed in step 1) with the tool "Concatenate datasets tail-to-head" with the options set to concatenate the output of step#1 as the first file and the output of step 6 as a second file. 8 - Use the Edit Attributes form to change the file type back to BED and assign all five columns to the proper attribute (click on pencil icon to reach form). Hopefully this is will work (it did for my test) or is enough information for you to worked out the exact steps for your particular datasets. Next time, please send data/tool questions directly "to" the galaxy-u...@bx.psu.edu mailing list. Replies should be send "reply-all". The outreach account is for other purposes. las2...@med.cornell.edu wrote: > I have two ChIPSeq datasets, and I am trying to find the common and distinct peaks between them and visualize them. I end up with a MACS bed file for each (listing a bunch of MACS_peaks). I then use the Intersect and Subtract tools from the Genomic Intervals tab and end up with the peaks I want. However, because of the way that MACS names its peaks, there can end up being some peaks named the same way in both files (because, for example, peak 20 in file1 is from position 300,000-300,500 but peak 20 in file 2 is from position 320,000-320,500). So, I can end up with multiple peaks with the same name. Because all the peak names have the same form, it can also be difficult to tell them apart when visualizing them in the UCSC Genome Browser. > > What I would like to do is to be able to edit the bed file to change the text MACS_peak_ to, say, treatment1_peak_ so that peak 20 would now still be numbered 20 in both files, but would have a different label. This would be pretty easy to do using regular expressions and sed. > > I know there have been a few posts about text manipulation, and I know that there is a text manipulation tab, but I can't seem to find an easy way to do what I want to do. > > Any advice? > > Thanks, luce -- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org/wiki/Support ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/