Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-27 Thread Lilach Friedman
Hi Jennifer,
Is there a way to directly upload my files from the public Galaxy to my
cloud Galaxy instance (in AWS)? Or should I download them first to my
computer, and then to upload them? (It takes a lot of time because of the
low  uploading speed).

Thanks,
   Lilach

2012/6/26 Jennifer Jackson j...@bx.psu.edu

  Hello Lilach,

 Currently, the human reference genome indexed for the GATK-beta tools is
 'hg_g1k_v37'. The GATK-beta tools are under active revision by our team, so
 we expect there to be little to no change to the beta version on the main
 public instance until this is completed.

 Attempting to convert data between different builds is not recommended.
 These tools are very sensitive to exact inputs, which extends to naming
 conventions, etc. The best practice path is to start and continue an
 analysis project with the same exact genome build throughout.

 If you want to use the hg19 indexes provided by the GATK project, a cloud
 instance is the current option (using a hg19 genome as a 'custom genome'
 will exceed the processing limits available on the public Galaxy instance).
 Following the links on the GATK tools can provide more information about
 sources, including links on the GATK web site which will note the exact
 contents of the both of these genome versions, downloads, and other
 resources.

 Hopefully this helps to clear up any confusion,

 Best,

 Jen
 Galaxy team


 On 6/21/12 7:50 AM, Lilach Friedman wrote:

 Hi Jennifer,
 Thank you for this reply.

 I made a new BWA file, this time using the hg19(full) genome.
 However, when I am trying to use DepthOfCoverage, the reference genomr is
 stucked on the hg_g1k_v37 (this is the only option to select), and I cannot
 change it to hg19(full). Most probably, because I selected hg_g1k_v37 in
 the previous time I tried to use DepthOfCoverage.
 It seems as a bug? How can I change it?

 Thanks,
   Lilach


 2012/6/18 Jennifer Jackson j...@bx.psu.edu

  Hi Lilach,

 The problem with this analysis probably has to do with a mismatch between
 the genomes: the intervals obtained from UCSC (hg19) and the BAM from your
 BWA (hg_g1k_v37) run.

 UCSC does not contain the genome 'hg_g1k_v37' - the genome available from
 UCSC is 'hg19'.

 Even though these are technically the same human release, on a practical
 level, they have a different arrangement for some of the chromosomes. You
 can compare NBCI GRCh37http://www.ncbi.nlm.nih.gov/genome/assembly/2758/
 with UCSC hg19 http://genome.ucsc.edu for an explanation. Reference
 genomes must be *exact* in order to be used with tools - base for base.
 When they are exact, the identifier will be exact between Galaxy and the
 source (UCSC, Ensembl) or the full Build name will provide enough
 information to make a connection to NCBI or other.

 Sometimes genomes are similar enough that a dataset sourced from one can
 be used with another, if the database attribute is changed and the data
 from the regions that differ is removed. This may be possible in your case,
 only trying will let you know how difficult it actually is with your
 analysis. The GATK pipeline is very sensitive to exact inputs. You will
 need to be careful with genome database assignments, etc. Following the
 links on the tool forms to the GATK help pages can provide some more detail
 about expected inputs, if this is something that you are going to try.

 Good luck with the re-run!

 Jen
 Galaxy team


 On 6/18/12 4:42 AM, Lilach Friedman wrote:

   Hi,
 I am trying to used Depth of Coverage to see the coverages is specific
 intervals.
 The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy
 and the file type was changed to intervals.

 I gave to Depth of Coverage two BAM files (resulted from BWA, selection
 of only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM)
 and the intervals file (in advanced GATK options).
 The consensus genome is hg_g1k_v37.

 I got the following error message:

  An error occurred running this job: *Picked up _JAVA_OPTIONS:
 -Djava.io.tmpdir=/space/g2main
 # ERROR
 --
 # ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
 # ERROR The invalid argume


 *Is it a bug, or did I do anything wrong?

 I will be grateful for any help.

 Thanks!
Lilach*
 *


  ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/


 --
 Jennifer Jacksonhttp://galaxyproject.org


 --
 Jennifer 

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-27 Thread Lilach Friedman
May I join to the question of Carlos? what is exactly hg_g1k_v37? and how
can I get the intervals of specific genes in this format?

Thanks,
  Lilach


2012/6/27 Lilach Friedman lilac...@gmail.com

 Hi Jennifer,
 Is there a way to directly upload my files from the public Galaxy to my
 cloud Galaxy instance (in AWS)? Or should I download them first to my
 computer, and then to upload them? (It takes a lot of time because of the
 low  uploading speed).

 Thanks,
Lilach


 2012/6/26 Jennifer Jackson j...@bx.psu.edu

  Hello Lilach,

 Currently, the human reference genome indexed for the GATK-beta tools is
 'hg_g1k_v37'. The GATK-beta tools are under active revision by our team, so
 we expect there to be little to no change to the beta version on the main
 public instance until this is completed.

 Attempting to convert data between different builds is not recommended.
 These tools are very sensitive to exact inputs, which extends to naming
 conventions, etc. The best practice path is to start and continue an
 analysis project with the same exact genome build throughout.

 If you want to use the hg19 indexes provided by the GATK project, a cloud
 instance is the current option (using a hg19 genome as a 'custom genome'
 will exceed the processing limits available on the public Galaxy instance).
 Following the links on the GATK tools can provide more information about
 sources, including links on the GATK web site which will note the exact
 contents of the both of these genome versions, downloads, and other
 resources.

 Hopefully this helps to clear up any confusion,

 Best,

 Jen
 Galaxy team


 On 6/21/12 7:50 AM, Lilach Friedman wrote:

 Hi Jennifer,
 Thank you for this reply.

 I made a new BWA file, this time using the hg19(full) genome.
 However, when I am trying to use DepthOfCoverage, the reference genomr is
 stucked on the hg_g1k_v37 (this is the only option to select), and I cannot
 change it to hg19(full). Most probably, because I selected hg_g1k_v37 in
 the previous time I tried to use DepthOfCoverage.
 It seems as a bug? How can I change it?

 Thanks,
   Lilach


 2012/6/18 Jennifer Jackson j...@bx.psu.edu

  Hi Lilach,

 The problem with this analysis probably has to do with a mismatch
 between the genomes: the intervals obtained from UCSC (hg19) and the BAM
 from your BWA (hg_g1k_v37) run.

 UCSC does not contain the genome 'hg_g1k_v37' - the genome available
 from UCSC is 'hg19'.

 Even though these are technically the same human release, on a practical
 level, they have a different arrangement for some of the chromosomes. You
 can compare NBCI GRCh37http://www.ncbi.nlm.nih.gov/genome/assembly/2758/
 with UCSC hg19 http://genome.ucsc.edu for an explanation. Reference
 genomes must be *exact* in order to be used with tools - base for base.
 When they are exact, the identifier will be exact between Galaxy and the
 source (UCSC, Ensembl) or the full Build name will provide enough
 information to make a connection to NCBI or other.

 Sometimes genomes are similar enough that a dataset sourced from one can
 be used with another, if the database attribute is changed and the data
 from the regions that differ is removed. This may be possible in your case,
 only trying will let you know how difficult it actually is with your
 analysis. The GATK pipeline is very sensitive to exact inputs. You will
 need to be careful with genome database assignments, etc. Following the
 links on the tool forms to the GATK help pages can provide some more detail
 about expected inputs, if this is something that you are going to try.

 Good luck with the re-run!

 Jen
 Galaxy team


 On 6/18/12 4:42 AM, Lilach Friedman wrote:

   Hi,
 I am trying to used Depth of Coverage to see the coverages is specific
 intervals.
 The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy
 and the file type was changed to intervals.

 I gave to Depth of Coverage two BAM files (resulted from BWA, selection
 of only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM)
 and the intervals file (in advanced GATK options).
 The consensus genome is hg_g1k_v37.

 I got the following error message:

  An error occurred running this job: *Picked up _JAVA_OPTIONS:
 -Djava.io.tmpdir=/space/g2main
 # ERROR
 --
 # ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
 # ERROR The invalid argume


 *Is it a bug, or did I do anything wrong?

 I will be grateful for any help.

 Thanks!
Lilach*
 *


  ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-27 Thread Jennifer Jackson

Hello Lilach,

The genome build 'hg_g1k_v37' is build b37 in the GATK documentation. 
Hg19 is also included (as a distinct build). I encourage you to examine 
these if you are interested in crossing over between genomes or 
identifying other projects that have data based on the same genome build.


http://www.broadinstitute.org/gsa/wiki/index.php/Introduction_to_the_GATK -
http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

 GATK resource bundle: A collection of standard files for working with 
human resequencing data with the GATK.


The standard reference sequence we use in the GATK is the the b37 
edition from the Human Genome Reference Consortium. All of the key GATK 
data files are available against this reference sequence. Additionally, 
we used to use UCSC-style (chr1, not 1) for build hg18, and provide 
lifted-over files from b37 to hg18 for those still using those files.


b37 resources: the standard data set
* Reference sequence (standard 1000 Genomes fasta) along with fai and 
dict files

more, please follow link for details ...

hg19 resources: lifted over from b37
* Includes the UCSC-style hg19 reference along with all lifted over VCF 
files.


Hopefully this helps,

Jen
Galaxy team

On 6/27/12 7:09 AM, Lilach Friedman wrote:
May I join to the question of Carlos? what is exactly hg_g1k_v37? and 
how can I get the intervals of specific genes in this format?


Thanks,
  Lilach


2012/6/27 Lilach Friedman lilac...@gmail.com mailto:lilac...@gmail.com

Hi Jennifer,
Is there a way to directly upload my files from the public Galaxy
to my cloud Galaxy instance (in AWS)? Or should I download them
first to my computer, and then to upload them? (It takes a lot of
time because of the low  uploading speed).

Thanks,
   Lilach


2012/6/26 Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu

Hello Lilach,

Currently, the human reference genome indexed for the
GATK-beta tools is 'hg_g1k_v37'. The GATK-beta tools are under
active revision by our team, so we expect there to be little
to no change to the beta version on the main public instance
until this is completed.

Attempting to convert data between different builds is not
recommended. These tools are very sensitive to exact inputs,
which extends to naming conventions, etc. The best practice
path is to start and continue an analysis project with the
same exact genome build throughout.

If you want to use the hg19 indexes provided by the GATK
project, a cloud instance is the current option (using a hg19
genome as a 'custom genome' will exceed the processing limits
available on the public Galaxy instance). Following the links
on the GATK tools can provide more information about sources,
including links on the GATK web site which will note the exact
contents of the both of these genome versions, downloads, and
other resources.

Hopefully this helps to clear up any confusion,

Best,

Jen
Galaxy team


On 6/21/12 7:50 AM, Lilach Friedman wrote:

Hi Jennifer,
Thank you for this reply.

I made a new BWA file, this time using the hg19(full) genome.
However, when I am trying to use DepthOfCoverage, the
reference genomr is stucked on the hg_g1k_v37 (this is the
only option to select), and I cannot change it to hg19(full).
Most probably, because I selected hg_g1k_v37 in the previous
time I tried to use DepthOfCoverage.
It seems as a bug? How can I change it?

Thanks,
  Lilach


2012/6/18 Jennifer Jackson j...@bx.psu.edu
mailto:j...@bx.psu.edu

Hi Lilach,

The problem with this analysis probably has to do with a
mismatch between the genomes: the intervals obtained from
UCSC (hg19) and the BAM from your BWA (hg_g1k_v37) run.

UCSC does not contain the genome 'hg_g1k_v37' - the
genome available from UCSC is 'hg19'.

Even though these are technically the same human release,
on a practical level, they have a different arrangement
for some of the chromosomes. You can compare NBCI GRCh37
http://www.ncbi.nlm.nih.gov/genome/assembly/2758/  with
UCSC hg19 http://genome.ucsc.edufor an explanation.
Reference genomes must be /exact/ in order to be used
with tools - base for base. When they are exact, the
identifier will be exact between Galaxy and the source
(UCSC, Ensembl) or the full Build name will provide
enough information to make a connection to NCBI or other.

Sometimes genomes are similar enough that a dataset
sourced from one can be used with another, if the
database attribute is changed and the data 

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-27 Thread Jennifer Jackson

Hi Lilach,

Regarding the cloud instance, you can load data from the public main 
instance of Galaxy just like any other URL.  On the Get Data - Upload 
Data form on your cloud instance , paste in the URLs of the datasets 
from main. The URL can be captured by right-clicking on a dataset's disk 
icon and then Copy link location (on a Mac; do the equivalent if using 
a PC).


It is generally better to transfer one URL per job, if the data is 
large, since jobs have a certain amount of time to complete. If you lump 
together several large file URLs into one job, there could be a chance 
that it could time out. It is fine to execute several jobs concurrently.


Best,

Jen
Galaxy team

On 6/27/12 6:51 AM, Lilach Friedman wrote:

Hi Jennifer,
Is there a way to directly upload my files from the public Galaxy to 
my cloud Galaxy instance (in AWS)? Or should I download them first to 
my computer, and then to upload them? (It takes a lot of time because 
of the low  uploading speed).


Thanks,
   Lilach

2012/6/26 Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu

Hello Lilach,

Currently, the human reference genome indexed for the GATK-beta
tools is 'hg_g1k_v37'. The GATK-beta tools are under active
revision by our team, so we expect there to be little to no change
to the beta version on the main public instance until this is
completed.

Attempting to convert data between different builds is not
recommended. These tools are very sensitive to exact inputs, which
extends to naming conventions, etc. The best practice path is to
start and continue an analysis project with the same exact genome
build throughout.

If you want to use the hg19 indexes provided by the GATK project,
a cloud instance is the current option (using a hg19 genome as a
'custom genome' will exceed the processing limits available on the
public Galaxy instance). Following the links on the GATK tools can
provide more information about sources, including links on the
GATK web site which will note the exact contents of the both of
these genome versions, downloads, and other resources.

Hopefully this helps to clear up any confusion,

Best,

Jen
Galaxy team


On 6/21/12 7:50 AM, Lilach Friedman wrote:

Hi Jennifer,
Thank you for this reply.

I made a new BWA file, this time using the hg19(full) genome.
However, when I am trying to use DepthOfCoverage, the reference
genomr is stucked on the hg_g1k_v37 (this is the only option to
select), and I cannot change it to hg19(full). Most probably,
because I selected hg_g1k_v37 in the previous time I tried to use
DepthOfCoverage.
It seems as a bug? How can I change it?

Thanks,
  Lilach


2012/6/18 Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu

Hi Lilach,

The problem with this analysis probably has to do with a
mismatch between the genomes: the intervals obtained from
UCSC (hg19) and the BAM from your BWA (hg_g1k_v37) run.

UCSC does not contain the genome 'hg_g1k_v37' - the genome
available from UCSC is 'hg19'.

Even though these are technically the same human release, on
a practical level, they have a different arrangement for some
of the chromosomes. You can compare NBCI GRCh37
http://www.ncbi.nlm.nih.gov/genome/assembly/2758/with UCSC
hg19 http://genome.ucsc.edufor an explanation. Reference
genomes must be /exact/ in order to be used with tools - base
for base. When they are exact, the identifier will be exact
between Galaxy and the source (UCSC, Ensembl) or the full
Build name will provide enough information to make a
connection to NCBI or other.

Sometimes genomes are similar enough that a dataset sourced
from one can be used with another, if the database attribute
is changed and the data from the regions that differ is
removed. This may be possible in your case, only trying will
let you know how difficult it actually is with your analysis.
The GATK pipeline is very sensitive to exact inputs. You will
need to be careful with genome database assignments, etc.
Following the links on the tool forms to the GATK help pages
can provide some more detail about expected inputs, if this
is something that you are going to try.

Good luck with the re-run!

Jen
Galaxy team


On 6/18/12 4:42 AM, Lilach Friedman wrote:

Hi,
I am trying to used Depth of Coverage to see the coverages
is specific intervals.
The intervals were taken from UCSC (exons of 2 genes),
loaded to Galaxy and the file type was changed to intervals.

I gave to Depth of Coverage two BAM files (resulted from
BWA, selection of only raws with the Matching pattern:
XT:A:U, and then SAM-to-BAM)

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-26 Thread Carlos Borroto
Hi Lilach,

Sorry for the late response. Jen just confirmed the disadvantages of
my approach. I don't know how difficult could be for you to double
check the coordinates you have in your interval file are correct for
hg_g1k_v37. If you feel confident they will work and want to proceed,
you could do something like this outside of galaxy, you could also I'm
sure find a way to do it inside galaxy:

sed 's/^chr//' interval_file.csv  interval_file_g1k.csv

If you have coordinates for the mitochondrial chromosome you might
have to do also:
sed 's/^MT/M/' interval_file.csv  interval_file_g1k.csv

As if I remember correctly UCSC uses chrMT and GATK expects just M.
Please double check this as I'm not sure.

It would be also nice is there were a confirmation on what exactly
hg_g1k_v37 is, and where you could find annotations for it.
Annotations from Ensembl would do?

Regards,
Carlos

On Mon, Jun 25, 2012 at 5:22 PM, Jennifer Jackson j...@bx.psu.edu wrote:
 Hello Lilach,

 Currently, the human reference genome indexed for the GATK-beta tools is
 'hg_g1k_v37'. The GATK-beta tools are under active revision by our team, so
 we expect there to be little to no change to the beta version on the main
 public instance until this is completed.

 Attempting to convert data between different builds is not recommended.
 These tools are very sensitive to exact inputs, which extends to naming
 conventions, etc. The best practice path is to start and continue an
 analysis project with the same exact genome build throughout.

 If you want to use the hg19 indexes provided by the GATK project, a cloud
 instance is the current option (using a hg19 genome as a 'custom genome'
 will exceed the processing limits available on the public Galaxy instance).
 Following the links on the GATK tools can provide more information about
 sources, including links on the GATK web site which will note the exact
 contents of the both of these genome versions, downloads, and other
 resources.

 Hopefully this helps to clear up any confusion,

 Best,

 Jen
 Galaxy team


 On 6/21/12 7:50 AM, Lilach Friedman wrote:

 Hi Jennifer,
 Thank you for this reply.

 I made a new BWA file, this time using the hg19(full) genome.
 However, when I am trying to use DepthOfCoverage, the reference genomr is
 stucked on the hg_g1k_v37 (this is the only option to select), and I cannot
 change it to hg19(full). Most probably, because I selected hg_g1k_v37 in the
 previous time I tried to use DepthOfCoverage.
 It seems as a bug? How can I change it?

 Thanks,
   Lilach


 2012/6/18 Jennifer Jackson j...@bx.psu.edu

 Hi Lilach,

 The problem with this analysis probably has to do with a mismatch between
 the genomes: the intervals obtained from UCSC (hg19) and the BAM from your
 BWA (hg_g1k_v37) run.

 UCSC does not contain the genome 'hg_g1k_v37' - the genome available from
 UCSC is 'hg19'.

 Even though these are technically the same human release, on a practical
 level, they have a different arrangement for some of the chromosomes. You
 can compare NBCI GRCh37  with UCSC hg19 for an explanation. Reference
 genomes must be exact in order to be used with tools - base for base. When
 they are exact, the identifier will be exact between Galaxy and the source
 (UCSC, Ensembl) or the full Build name will provide enough information to
 make a connection to NCBI or other.

 Sometimes genomes are similar enough that a dataset sourced from one can
 be used with another, if the database attribute is changed and the data from
 the regions that differ is removed. This may be possible in your case, only
 trying will let you know how difficult it actually is with your analysis.
 The GATK pipeline is very sensitive to exact inputs. You will need to be
 careful with genome database assignments, etc. Following the links on the
 tool forms to the GATK help pages can provide some more detail about
 expected inputs, if this is something that you are going to try.

 Good luck with the re-run!

 Jen
 Galaxy team


 On 6/18/12 4:42 AM, Lilach Friedman wrote:

 Hi,
 I am trying to used Depth of Coverage to see the coverages is specific
 intervals.
 The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy
 and the file type was changed to intervals.

 I gave to Depth of Coverage two BAM files (resulted from BWA, selection of
 only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM)
 and the intervals file (in advanced GATK options).
 The consensus genome is hg_g1k_v37.

 I got the following error message:

 An error occurred running this job: Picked up _JAVA_OPTIONS:
 -Djava.io.tmpdir=/space/g2main
 # ERROR
 --
 # ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
 # ERROR The invalid argume


 Is it a bug, or did I do anything wrong?

 I will be grateful for any help.

 Thanks!
    Lilach


 ___
 The Galaxy User 

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-25 Thread Jennifer Jackson

Hello Lilach,

Currently, the human reference genome indexed for the GATK-beta tools is 
'hg_g1k_v37'. The GATK-beta tools are under active revision by our team, 
so we expect there to be little to no change to the beta version on the 
main public instance until this is completed.


Attempting to convert data between different builds is not recommended. 
These tools are very sensitive to exact inputs, which extends to naming 
conventions, etc. The best practice path is to start and continue an 
analysis project with the same exact genome build throughout.


If you want to use the hg19 indexes provided by the GATK project, a 
cloud instance is the current option (using a hg19 genome as a 'custom 
genome' will exceed the processing limits available on the public Galaxy 
instance). Following the links on the GATK tools can provide more 
information about sources, including links on the GATK web site which 
will note the exact contents of the both of these genome versions, 
downloads, and other resources.


Hopefully this helps to clear up any confusion,

Best,

Jen
Galaxy team

On 6/21/12 7:50 AM, Lilach Friedman wrote:

Hi Jennifer,
Thank you for this reply.

I made a new BWA file, this time using the hg19(full) genome.
However, when I am trying to use DepthOfCoverage, the reference genomr 
is stucked on the hg_g1k_v37 (this is the only option to select), and 
I cannot change it to hg19(full). Most probably, because I selected 
hg_g1k_v37 in the previous time I tried to use DepthOfCoverage.

It seems as a bug? How can I change it?

Thanks,
  Lilach


2012/6/18 Jennifer Jackson j...@bx.psu.edu mailto:j...@bx.psu.edu

Hi Lilach,

The problem with this analysis probably has to do with a mismatch
between the genomes: the intervals obtained from UCSC (hg19) and
the BAM from your BWA (hg_g1k_v37) run.

UCSC does not contain the genome 'hg_g1k_v37' - the genome
available from UCSC is 'hg19'.

Even though these are technically the same human release, on a
practical level, they have a different arrangement for some of the
chromosomes. You can compare NBCI GRCh37
http://www.ncbi.nlm.nih.gov/genome/assembly/2758/  with UCSC
hg19 http://genome.ucsc.edu for an explanation. Reference
genomes must be /exact/ in order to be used with tools - base for
base. When they are exact, the identifier will be exact between
Galaxy and the source (UCSC, Ensembl) or the full Build name will
provide enough information to make a connection to NCBI or other.

Sometimes genomes are similar enough that a dataset sourced from
one can be used with another, if the database attribute is changed
and the data from the regions that differ is removed. This may be
possible in your case, only trying will let you know how difficult
it actually is with your analysis. The GATK pipeline is very
sensitive to exact inputs. You will need to be careful with genome
database assignments, etc. Following the links on the tool forms
to the GATK help pages can provide some more detail about expected
inputs, if this is something that you are going to try.

Good luck with the re-run!

Jen
Galaxy team


On 6/18/12 4:42 AM, Lilach Friedman wrote:

Hi,
I am trying to used Depth of Coverage to see the coverages is
specific intervals.
The intervals were taken from UCSC (exons of 2 genes), loaded to
Galaxy and the file type was changed to intervals.

I gave to Depth of Coverage two BAM files (resulted from BWA,
selection of only raws with the Matching pattern: XT:A:U, and
then SAM-to-BAM)
and the intervals file (in advanced GATK options).
The consensus genome is hg_g1k_v37.

I got the following error message:

An error occurred running this job: /Picked up _JAVA_OPTIONS:
-Djava.io.tmpdir=/space/g2main
# ERROR

--
# ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
# ERROR The invalid argume


/Is it a bug, or did I do anything wrong?

I will be grateful for any help.

Thanks!
   Lilach/
/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
atusegalaxy.org  http://usegalaxy.org.  Please keep all replies on the 
list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


-- 
Jennifer Jackson

http://galaxyproject.org



--
Jennifer Jackson
http://galaxyproject.org



___
The 

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-24 Thread Lilach Friedman
Hi Carlos,
Thank you very much for this explanation.

The format of my intervals file is:

chr133289059732890664NM_59_cds_1_0_chr13_32890598_f0+chr1332893213
32893462NM_59_cds_2_0_chr13_32893214_f0+chr133289921232899321
NM_59_cds_3_0_chr13_32899213_f0+chr133290023732900287
NM_59_cds_4_0_chr13_32900238_f0+etc...

Can you please explain me how to change this format so I will be able to
give it as an input to DepthOfCoverage

Thanks,
   Lilach

2012/6/21 Carlos Borroto carlos.borr...@gmail.com

 On Thu, Jun 21, 2012 at 10:50 AM, Lilach Friedman lilac...@gmail.com
 wrote:
  Hi Jennifer,
  Thank you for this reply.
 
  I made a new BWA file, this time using the hg19(full) genome.
  However, when I am trying to use DepthOfCoverage, the reference genomr is
  stucked on the hg_g1k_v37 (this is the only option to select), and I
 cannot
  change it to hg19(full). Most probably, because I selected hg_g1k_v37 in
 the
  previous time I tried to use DepthOfCoverage.
  It seems as a bug? How can I change it?
 

 Hi Lilach,

 I have been dealing with these issues for some time now.

 The only genome you can use with Picard and GATK tools in Galaxy is
 hg_g1k_v37. I think this is why.

 From GATK Wiki[1]:
 If you are using human data, your reads must be aligned to one of the
 official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The
 contig ordering in the reference you used must exactly match that of
 one of the official references canonical orderings. These are defined
 by historical karotyping of largest to smallest chromosomes, followed
 by the X, Y, and MT. The order is thus 1, 2, 3, ..., 10, 11, 12, ...
 20, 21, 22, X, Y, MT. The GATK will detect misordered contigs (for
 example, lexicographically sorted) and throw an error. This draconian
 approach, though unnecessary technically, ensures that all
 supplementary data provided with the GATK works correctly. You can use
 ReorderSam to fix a BAM file aligned to a missorted reference
 sequence.

 [1]
 http://www.broadinstitute.org/gsa/wiki/index.php/Input_files_for_the_GATK

 So far what I have done when presented with a BAM file produced with
 reference with lexicographical chromosomes ordering, is to use
 Picard's ReorderSam tool, also in Galaxy, selecting hg_g1k_v37 as
 reference. You might not be able to this, as if a recall correctly
 hg19 also use chr1, chr2... instead of 1, 2, ... In that case more
 work needs to be done and at that point is almost easier to just remap
 with the correct reference for use with GATK. In your case it seems
 you already have it. What you might need to do is resort your
 intervals file and probably change the chromosomes identifiers, this I
 think can be done inside Galaxy.

 I would love to hear comments about this approach, as sometime I do
 worry like Hiram's comment hints to, that hg19 and hg_g1k_v37 might
 not be completely identical beside the chromosome ordering. In that
 case my resorted BAM or intervals files might be incorrect.

 Hope it helps,
 Carlos

  Thanks,
Lilach
 
 
 
  2012/6/18 Jennifer Jackson j...@bx.psu.edu
 
  Hi Lilach,
 
  The problem with this analysis probably has to do with a mismatch
 between
  the genomes: the intervals obtained from UCSC (hg19) and the BAM from
 your
  BWA (hg_g1k_v37) run.
 
  UCSC does not contain the genome 'hg_g1k_v37' - the genome available
 from
  UCSC is 'hg19'.
 
  Even though these are technically the same human release, on a practical
  level, they have a different arrangement for some of the chromosomes.
 You
  can compare NBCI GRCh37  with UCSC hg19 for an explanation. Reference
  genomes must be exact in order to be used with tools - base for base.
 When
  they are exact, the identifier will be exact between Galaxy and the
 source
  (UCSC, Ensembl) or the full Build name will provide enough information
 to
  make a connection to NCBI or other.
 
  Sometimes genomes are similar enough that a dataset sourced from one can
  be used with another, if the database attribute is changed and the data
 from
  the regions that differ is removed. This may be possible in your case,
 only
  trying will let you know how difficult it actually is with your
 analysis.
  The GATK pipeline is very sensitive to exact inputs. You will need to be
  careful with genome database assignments, etc. Following the links on
 the
  tool forms to the GATK help pages can provide some more detail about
  expected inputs, if this is something that you are going to try.
 
  Good luck with the re-run!
 
  Jen
  Galaxy team
 
 
  On 6/18/12 4:42 AM, Lilach Friedman wrote:
 
  Hi,
  I am trying to used Depth of Coverage to see the coverages is specific
  intervals.
  The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy
  and the file type was changed to intervals.
 
  I gave to Depth of Coverage two BAM files (resulted from BWA, selection
 of
  only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM)
  and the intervals file (in advanced GATK 

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-21 Thread Lilach Friedman
Hi Jennifer,
Thank you for this reply.

I made a new BWA file, this time using the hg19(full) genome.
However, when I am trying to use DepthOfCoverage, the reference genomr is
stucked on the hg_g1k_v37 (this is the only option to select), and I cannot
change it to hg19(full). Most probably, because I selected hg_g1k_v37 in
the previous time I tried to use DepthOfCoverage.
It seems as a bug? How can I change it?

Thanks,
  Lilach


2012/6/18 Jennifer Jackson j...@bx.psu.edu

  Hi Lilach,

 The problem with this analysis probably has to do with a mismatch between
 the genomes: the intervals obtained from UCSC (hg19) and the BAM from your
 BWA (hg_g1k_v37) run.

 UCSC does not contain the genome 'hg_g1k_v37' - the genome available from
 UCSC is 'hg19'.

 Even though these are technically the same human release, on a practical
 level, they have a different arrangement for some of the chromosomes. You
 can compare NBCI GRCh37http://www.ncbi.nlm.nih.gov/genome/assembly/2758/
 with UCSC hg19 http://genome.ucsc.edu for an explanation. Reference
 genomes must be *exact* in order to be used with tools - base for base.
 When they are exact, the identifier will be exact between Galaxy and the
 source (UCSC, Ensembl) or the full Build name will provide enough
 information to make a connection to NCBI or other.

 Sometimes genomes are similar enough that a dataset sourced from one can
 be used with another, if the database attribute is changed and the data
 from the regions that differ is removed. This may be possible in your case,
 only trying will let you know how difficult it actually is with your
 analysis. The GATK pipeline is very sensitive to exact inputs. You will
 need to be careful with genome database assignments, etc. Following the
 links on the tool forms to the GATK help pages can provide some more detail
 about expected inputs, if this is something that you are going to try.

 Good luck with the re-run!

 Jen
 Galaxy team


 On 6/18/12 4:42 AM, Lilach Friedman wrote:

  Hi,
 I am trying to used Depth of Coverage to see the coverages is specific
 intervals.
 The intervals were taken from UCSC (exons of 2 genes), loaded to Galaxy
 and the file type was changed to intervals.

 I gave to Depth of Coverage two BAM files (resulted from BWA, selection of
 only raws with the Matching pattern: XT:A:U, and then SAM-to-BAM)
 and the intervals file (in advanced GATK options).
 The consensus genome is hg_g1k_v37.

 I got the following error message:

  An error occurred running this job: *Picked up _JAVA_OPTIONS:
 -Djava.io.tmpdir=/space/g2main
 # ERROR
 --
 # ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
 # ERROR The invalid argume


 *Is it a bug, or did I do anything wrong?

 I will be grateful for any help.

 Thanks!
Lilach*
 *


 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/


 --
 Jennifer Jacksonhttp://galaxyproject.org


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-18 Thread Jennifer Jackson

Hi Lilach,

The problem with this analysis probably has to do with a mismatch 
between the genomes: the intervals obtained from UCSC (hg19) and the BAM 
from your BWA (hg_g1k_v37) run.


UCSC does not contain the genome 'hg_g1k_v37' - the genome available 
from UCSC is 'hg19'.


Even though these are technically the same human release, on a practical 
level, they have a different arrangement for some of the chromosomes. 
You can compare NBCI GRCh37 
http://www.ncbi.nlm.nih.gov/genome/assembly/2758/  with UCSC hg19 
http://genome.ucsc.edu for an explanation. Reference genomes must be 
/exact/ in order to be used with tools - base for base. When they are 
exact, the identifier will be exact between Galaxy and the source (UCSC, 
Ensembl) or the full Build name will provide enough information to make 
a connection to NCBI or other.


Sometimes genomes are similar enough that a dataset sourced from one can 
be used with another, if the database attribute is changed and the data 
from the regions that differ is removed. This may be possible in your 
case, only trying will let you know how difficult it actually is with 
your analysis. The GATK pipeline is very sensitive to exact inputs. You 
will need to be careful with genome database assignments, etc. Following 
the links on the tool forms to the GATK help pages can provide some more 
detail about expected inputs, if this is something that you are going to 
try.


Good luck with the re-run!

Jen
Galaxy team

On 6/18/12 4:42 AM, Lilach Friedman wrote:

Hi,
I am trying to used Depth of Coverage to see the coverages is specific 
intervals.
The intervals were taken from UCSC (exons of 2 genes), loaded to 
Galaxy and the file type was changed to intervals.


I gave to Depth of Coverage two BAM files (resulted from BWA, 
selection of only raws with the Matching pattern: XT:A:U, and then 
SAM-to-BAM)

and the intervals file (in advanced GATK options).
The consensus genome is hg_g1k_v37.

I got the following error message:

An error occurred running this job: /Picked up _JAVA_OPTIONS: 
-Djava.io.tmpdir=/space/g2main
# ERROR 
--

# ERROR A USER ERROR has occurred (version 1.4-18-g80a4ce0):
# ERROR The invalid argume


/Is it a bug, or did I do anything wrong?

I will be grateful for any help.

Thanks!
   Lilach/
/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-18 Thread Hiram Clawson

I'm curious what is this genome called 'hg_g1k_v37'
and how does it correspond to NCBI GRCh37 which is
identical to UCSC hg19 ?

--Hiram


Jennifer Jackson wrote:
UCSC does not contain the genome 'hg_g1k_v37' - the genome available 
from UCSC is 'hg19'.


Even though these are technically the same human release, on a practical 
level, they have a different arrangement for some of the chromosomes. 
You can compare NBCI GRCh37 
http://www.ncbi.nlm.nih.gov/genome/assembly/2758/  with UCSC hg19 
http://genome.ucsc.edu for an explanation. Reference genomes must be 
/exact/ in order to be used with tools - base for base. When they are 
exact, the identifier will be exact between Galaxy and the source (UCSC, 
Ensembl) or the full Build name will provide enough information to make 
a connection to NCBI or other.


Sometimes genomes are similar enough that a dataset sourced from one can 
be used with another, if the database attribute is changed and the data 
from the regions that differ is removed. This may be possible in your 
case, only trying will let you know how difficult it actually is with 
your analysis. The GATK pipeline is very sensitive to exact inputs. You 
will need to be careful with genome database assignments, etc. Following 
the links on the tool forms to the GATK help pages can provide some more 
detail about expected inputs, if this is something that you are going to 
try.

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Problem with Depth of Coverage on BAM files (GATK tools)

2012-06-18 Thread Church, Deanna (NIH/NLM/NCBI) [E]
If hg_g1K_v37 == 1000 Genomes version of GRCh37 then it is the GRCh37
Primary assembly + a decoy sequence to try to soak up off target reads.
The chromosome coordinates are the same but the sequences included in the
packages are different.
Here is the description from the 1000 Genomes site:
http://www.1000genomes.org/category/assembly

Deanna

On 6/18/12 3:30 PM, Hiram Clawson hi...@soe.ucsc.edu wrote:

I'm curious what is this genome called 'hg_g1k_v37'
and how does it correspond to NCBI GRCh37 which is
identical to UCSC hg19 ?

--Hiram


Jennifer Jackson wrote:
 UCSC does not contain the genome 'hg_g1k_v37' - the genome available
 from UCSC is 'hg19'.
 
 Even though these are technically the same human release, on a
practical 
 level, they have a different arrangement for some of the chromosomes.
 You can compare NBCI GRCh37
 http://www.ncbi.nlm.nih.gov/genome/assembly/2758/  with UCSC hg19
 http://genome.ucsc.edu for an explanation. Reference genomes must be
 /exact/ in order to be used with tools - base for base. When they are
 exact, the identifier will be exact between Galaxy and the source
(UCSC, 
 Ensembl) or the full Build name will provide enough information to make
 a connection to NCBI or other.
 
 Sometimes genomes are similar enough that a dataset sourced from one
can 
 be used with another, if the database attribute is changed and the data
 from the regions that differ is removed. This may be possible in your
 case, only trying will let you know how difficult it actually is with
 your analysis. The GATK pipeline is very sensitive to exact inputs. You
 will need to be careful with genome database assignments, etc.
Following 
 the links on the tool forms to the GATK help pages can provide some
more 
 detail about expected inputs, if this is something that you are going
to 
 try.
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/