Re: [galaxy-user] operate on genomic intervals

2012-05-08 Thread Jennifer Jackson

Hi Jose,

That's great news! A tool tip in the wiki or UI would probably be 
helpful - your question was a good one. Meanwhile, I'll post your 
results to back to the list, it may help others who are also working to 
optimize.


Glad that it worked out so well,

Jen
Galaxy team

On 5/8/12 2:57 PM, Xianrong Wong wrote:

Thank you Jennifer!  I swapped the order of the input datasets which
reduced the running time from 2 hrs to 10 min!
Jose

On Mon, May 7, 2012 at 12:03 PM, Jennifer Jackson mailto:j...@bx.psu.edu>> wrote:

Hi Jose,

Very glad to know that you have this working.

You question is difficult to address with specificity, as these are
completely different algorithms. But in general, any alignment
algorithm (Bowtie included) has some sort of indexing strategy (some
are better than others) to minimize what is held in memory and
process bulk data through. See the Bowtie documentation for how this
is achieved.

The interval operations tools also have an indexing strategy,
specifically, the second input file is the portion loaded memory and
the first file is processed against it. So, if you want use an
extremely large dataset (or just want to the job to run quicker) try
to use it as the first input file if possible.

These tools are designed to be used together and with other tools to
create workflows, so there should pretty much always be some way to
break jobs up (as you did) to get them through the tools, on even
modest systems:
http://wiki.g2.bx.psu.edu/__Learn/Interval%20Operations


Take care,

Jen
Galaxy team



On 5/7/12 8:30 AM, Xianrong Wong wrote:

Hello Jennifer,  thanks for the advise.  It worked when I did it a
chromosome at a time.  Is there a reason why this is so much more
computationally heavy as compared to bowtie?  (mapping 190
million reads
took only 3-4 hours for me)
Jose

On Fri, May 4, 2012 at 7:55 PM, Jennifer Jackson mailto:j...@bx.psu.edu>
>> wrote:

Hello Jose,

It sounds as if the job is running out of memory. Since you are
already working on a cloud, I am going to make the
assumption that
you have explored the server options with high-capacity memory
there. But if not, that is one place to start, in particular
your
EC2 Instance type, as described on this wiki:
http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning


>

However, even if that was an option, you may want to consider
running your in data through in another way - by running smaller
jobs, then merging results, to avoid the large jobs. For
example, in
the last step where you join to the "full BamHI delimited
bin file",
instead join to groups of bins in that file (perhaps grouped by
chromosome), then combine the results to produce the full
output.

Hopefully this helps provide some options,

Jen
Galaxy team


On 5/4/12 2:18 PM, Xianrong Wong wrote:

Hello,
 I have binned the mouse genome into fragments
based on
restriction enzyme cut sites.  So each bin is a fragment
flanked
by say
BamHI.  The output file is in the interval format: chr#
start
and end
coordinates of each bin.  I want to count how many times
each
bin has
reads that align to it.  I mapped my reads using bowtie and
generated a
dataset (interval format) for the aligned reads.  I then
used
join in
"operate on genomic intervals" and asked it to return intervals that
innerjoin the "bin file".  The subsequent steps involve
grouping and
counting and then joining back to the 1st dataset (BamHI
delimited
bins).  I have tried this workflow on small datasets and
it worked.
However when I subject my full alignment file and full BamHI
delimited
bin file, the tool fails.  I am doing this on cloud.
  Any advice
would
be appreciated!

Jose



  _

The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
   

Re: [galaxy-user] operate on genomic intervals

2012-05-07 Thread Jennifer Jackson

Hi Jose,

Very glad to know that you have this working.

You question is difficult to address with specificity, as these are 
completely different algorithms. But in general, any alignment algorithm 
(Bowtie included) has some sort of indexing strategy (some are better 
than others) to minimize what is held in memory and process bulk data 
through. See the Bowtie documentation for how this is achieved.


The interval operations tools also have an indexing strategy, 
specifically, the second input file is the portion loaded memory and the 
first file is processed against it. So, if you want use an extremely 
large dataset (or just want to the job to run quicker) try to use it as 
the first input file if possible.


These tools are designed to be used together and with other tools to 
create workflows, so there should pretty much always be some way to 
break jobs up (as you did) to get them through the tools, on even modest 
systems: http://wiki.g2.bx.psu.edu/Learn/Interval%20Operations


Take care,

Jen
Galaxy team


On 5/7/12 8:30 AM, Xianrong Wong wrote:

Hello Jennifer,  thanks for the advise.  It worked when I did it a
chromosome at a time.  Is there a reason why this is so much more
computationally heavy as compared to bowtie?  (mapping 190 million reads
took only 3-4 hours for me)
Jose

On Fri, May 4, 2012 at 7:55 PM, Jennifer Jackson mailto:j...@bx.psu.edu>> wrote:

Hello Jose,

It sounds as if the job is running out of memory. Since you are
already working on a cloud, I am going to make the assumption that
you have explored the server options with high-capacity memory
there. But if not, that is one place to start, in particular your
EC2 Instance type, as described on this wiki:
http://wiki.g2.bx.psu.edu/__Admin/Cloud/CapacityPlanning


However, even if that was an option, you may want to consider
running your in data through in another way - by running smaller
jobs, then merging results, to avoid the large jobs. For example, in
the last step where you join to the "full BamHI delimited bin file",
instead join to groups of bins in that file (perhaps grouped by
chromosome), then combine the results to produce the full output.

Hopefully this helps provide some options,

Jen
Galaxy team


On 5/4/12 2:18 PM, Xianrong Wong wrote:

Hello,
 I have binned the mouse genome into fragments based on
restriction enzyme cut sites.  So each bin is a fragment flanked
by say
BamHI.  The output file is in the interval format: chr# start
and end
coordinates of each bin.  I want to count how many times each
bin has
reads that align to it.  I mapped my reads using bowtie and
generated a
dataset (interval format) for the aligned reads.  I then used
join in
"operate on genomic intervals" and asked it to return intervals that
innerjoin the "bin file".  The subsequent steps involve grouping and
counting and then joining back to the 1st dataset (BamHI delimited
bins).  I have tried this workflow on small datasets and it worked.
However when I subject my full alignment file and full BamHI
delimited
bin file, the tool fails.  I am doing this on cloud.  Any advice
would
be appreciated!

Jose


_
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org .  Please keep all
replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

http://lists.bx.psu.edu/__listinfo/galaxy-dev


To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org




--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] operate on genomic intervals

2012-05-04 Thread Jennifer Jackson

Hello Jose,

It sounds as if the job is running out of memory. Since you are already 
working on a cloud, I am going to make the assumption that you have 
explored the server options with high-capacity memory there. But if not, 
that is one place to start, in particular your EC2 Instance type, as 
described on this wiki: 
http://wiki.g2.bx.psu.edu/Admin/Cloud/CapacityPlanning


However, even if that was an option, you may want to consider running 
your in data through in another way - by running smaller jobs, then 
merging results, to avoid the large jobs. For example, in the last step 
where you join to the "full BamHI delimited bin file", instead join to 
groups of bins in that file (perhaps grouped by chromosome), then 
combine the results to produce the full output.


Hopefully this helps provide some options,

Jen
Galaxy team

On 5/4/12 2:18 PM, Xianrong Wong wrote:

Hello,
 I have binned the mouse genome into fragments based on
restriction enzyme cut sites.  So each bin is a fragment flanked by say
BamHI.  The output file is in the interval format: chr# start and end
coordinates of each bin.  I want to count how many times each bin has
reads that align to it.  I mapped my reads using bowtie and generated a
dataset (interval format) for the aligned reads.  I then used join in
"operate on genomic intervals" and asked it to return intervals that
innerjoin the "bin file".  The subsequent steps involve grouping and
counting and then joining back to the 1st dataset (BamHI delimited
bins).  I have tried this workflow on small datasets and it worked.
However when I subject my full alignment file and full BamHI delimited
bin file, the tool fails.  I am doing this on cloud.  Any advice would
be appreciated!

Jose


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


[galaxy-user] operate on genomic intervals

2012-05-04 Thread Xianrong Wong
Hello,
I have binned the mouse genome into fragments based on restriction
enzyme cut sites.  So each bin is a fragment flanked by say BamHI.  The
output file is in the interval format: chr# start and end coordinates of
each bin.  I want to count how many times each bin has reads that align to
it.  I mapped my reads using bowtie and generated a dataset (interval
format) for the aligned reads.  I then used join in "operate on genomic
intervals" and asked it to return intervals that innerjoin the "bin file".
The subsequent steps involve grouping and counting and then joining back to
the 1st dataset (BamHI delimited bins).  I have tried this workflow on
small datasets and it worked.  However when I subject my full alignment
file and full BamHI delimited bin file, the tool fails.  I am doing this on
cloud.  Any advice would be appreciated!

Jose
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Operate on genomic intervals: Join

2012-01-18 Thread Jennifer Jackson

Hello Steve,

Yes, this is true, many of the Interval Operations tools do not 
interpret strand. Meaning that you would need to first filter the data 
by strand, then perform the operation per strand to obtain an inner join 
result.


Help, including links to screencasts:
http://wiki.g2.bx.psu.edu/Learn/Interval%20Operations

Best,

Jen
Galaxy team

On 11/3/11 1:13 PM, Stephen Eacker wrote:

Hello,
I using operate on genomic intervals on some data and it always seems to ignore 
the strand information.  Am I missing something or does "operate of genomic 
intervals" disregard strand information?  Is there a tool that does the inner join 
function and  takes into account strand information?

thanks,

Steve

Stephen Eacker, Ph.D.
Postdoctoral Fellow
Dawson Lab
Institute for Cell Engineering
Johns Hopkins Medical Institute
(443) 287-5605
seack...@jhmi.edu




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/wiki/Support
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


[galaxy-user] Operate on genomic intervals: Join

2011-11-03 Thread Stephen Eacker
Hello,
I using operate on genomic intervals on some data and it always seems 
to ignore the strand information.  Am I missing something or does "operate of 
genomic intervals" disregard strand information?  Is there a tool that does the 
inner join function and  takes into account strand information?

thanks,

Steve

Stephen Eacker, Ph.D.
Postdoctoral Fellow
Dawson Lab
Institute for Cell Engineering
Johns Hopkins Medical Institute
(443) 287-5605
seack...@jhmi.edu




___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Operate on genomic intervals

2011-10-28 Thread Jennifer Jackson

Hi Sarah,

If you just want to compare rows between two files without interpreting 
genome positional information (similar to the unix "comm" command), then 
please see the tools under "Join, Subtract and Group". In particular, 
"Subtract Whole Dataset", but perhaps also "Compare two Datasets".


You will likely need to convert the datatype to be "tabular" before 
using the "Compare" function. Do this by using the "Edit Attributes -> 
Change data type" form, located by click on the pencil icon in the top 
right corner of a dataset's box in the history panel. The instructions 
in the "Compare" tool form state to use "Text Manipulation->Convert", 
but if your file is already in interval format, a specific type of 
tabular data, then this is not an appropriate (or needed) step.


Hopefully this helps!

Best,

Jen
Galaxy team

On 10/28/11 8:19 AM, Sarah wrote:

Hi All,

I have two files containing multiple scaffolds and their genomic intervals. Now 
I want to substract the intervals of one file from the other, but apparently 
this tool is only available if all the data in the chromosome column have the 
same name.

Does anybody known a way how to handle files with different names in the 
chromosome column?

Thanks a lot in advance,

Sarah


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/wiki/Support
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


[galaxy-user] Operate on genomic intervals

2011-10-28 Thread Sarah
Hi All,

I have two files containing multiple scaffolds and their genomic intervals. Now 
I want to substract the intervals of one file from the other, but apparently 
this tool is only available if all the data in the chromosome column have the 
same name.

Does anybody known a way how to handle files with different names in the 
chromosome column?

Thanks a lot in advance,

Sarah


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/