Re: [Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-05-02 Thread Marcin Kosiński
Thanks a lot! So I'll contact you after mid-May :)

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-05-01 Thread Obenchain, Valerie
Hi Marcin,

I can help you add these to ExperimentHub after the release. There are a
few other things I need to tidy so timing will be about mid-May.

Note that the new data format should not be data.frames but instead
follow what we discussed here:

https://tracker.bioconductor.org/issue1335

Valerie


On 04/26/2016 11:35 AM, Marcin Kosiński wrote:
> I have read from vignette that
>
> 2 Adding resources
>
> Resources are contributed to ExperimentHub in the form of a package. The
> package contains the resource metadata, man pages, vignette and any
> supporting R functions the author wants to provide. This is a similar
> design to the existing Bioconductor experimental data packages except the
> data are uploaded to AWS S3 buckets instead of stored in a data/ directory
> as part of the pacakge.
>
> New packages should be submitted to the Bioconductor tracker and will have
> a full review. Contact packa...@bioconductor.org for more information.
>
>
> So If I'd like to provide newer datasets from the newest TCGA release of
> data snapshot then I should upload new packages via bioconductor tracker
> but in a little different package design than in Experimental Data package.
>
> You said that
>
> *ExperimentHub will be back in active development, including addition of
> new resources, immediately after our next release, May 4, so the timing is
> fairly good.*
>
> Does it mean I should upload these data packages before May 4th or after?
>
> 2016-04-18 20:04 GMT+02:00 Marcin Kosiński :
>
>>
>> 2016-04-16 22:55 GMT+02:00 Martin Morgan :
>>
>>>
>>> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>>>
 Hello,

 I would like to ask you all for an advice in the following issue.

 Last year I have started working with data from The Cancer Genome Atlas.
 During that work out team (https://github.com/orgs/RTCGA/people) have
 prepared some tools for downloading and integrating datasets from TCGA
 study and provided them in the R package called RTCGA
 , which
 is
 available on Bioconductor.

 Later on we were working on tools for visualizing and analyzing the most
 popular datasets from TCGA so we have prepared data packages with those
 datasets and submitted them to Bioconductor in 8 separate packages. You
 can
 read more about them here http://rtcga.github.io/RTCGA/

 *I have a question about updating those data packages.* TCGA release
 datasets snapshots over time. In the RTCGA family of R packages there are
 available datasets from the release date 2015-11-01 but currently one can
 check that there was newer release 2016-01-28

 tail(RTCGA::checkTCGA('Dates'))
 [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
 "2016-01-28"

 I am wondering whether should we upload newer datasets to those data
 packages. We have found that there are great differences in results of
 data
 analysis depending on from which release date one has took datasets. More
 about this issue can be found here:
 http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata

 The current state of RTCGA family of R packages is listed below

 RTCGA.clinical
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0

>>>
>>>
 RTCGA.rnaseq
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

 RTCGA.mutations
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

 ---

 RTCGA.methylation
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1


 RTCGA.CNV
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5


 RTCGA.RPPA
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
- BiocRelease: NOT YET AVAILABLE
- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6


 RTCGA.mRNA
 <
 http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
- 

Re: [Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-04-26 Thread Marcin Kosiński
I have read from vignette that

2 Adding resources

Resources are contributed to ExperimentHub in the form of a package. The
package contains the resource metadata, man pages, vignette and any
supporting R functions the author wants to provide. This is a similar
design to the existing Bioconductor experimental data packages except the
data are uploaded to AWS S3 buckets instead of stored in a data/ directory
as part of the pacakge.

New packages should be submitted to the Bioconductor tracker and will have
a full review. Contact packa...@bioconductor.org for more information.


So If I'd like to provide newer datasets from the newest TCGA release of
data snapshot then I should upload new packages via bioconductor tracker
but in a little different package design than in Experimental Data package.

You said that

*ExperimentHub will be back in active development, including addition of
new resources, immediately after our next release, May 4, so the timing is
fairly good.*

Does it mean I should upload these data packages before May 4th or after?

2016-04-18 20:04 GMT+02:00 Marcin Kosiński :

>
>
> 2016-04-16 22:55 GMT+02:00 Martin Morgan :
>
>>
>>
>> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>>
>>> Hello,
>>>
>>> I would like to ask you all for an advice in the following issue.
>>>
>>> Last year I have started working with data from The Cancer Genome Atlas.
>>> During that work out team (https://github.com/orgs/RTCGA/people) have
>>> prepared some tools for downloading and integrating datasets from TCGA
>>> study and provided them in the R package called RTCGA
>>> , which
>>> is
>>> available on Bioconductor.
>>>
>>> Later on we were working on tools for visualizing and analyzing the most
>>> popular datasets from TCGA so we have prepared data packages with those
>>> datasets and submitted them to Bioconductor in 8 separate packages. You
>>> can
>>> read more about them here http://rtcga.github.io/RTCGA/
>>>
>>> *I have a question about updating those data packages.* TCGA release
>>> datasets snapshots over time. In the RTCGA family of R packages there are
>>> available datasets from the release date 2015-11-01 but currently one can
>>> check that there was newer release 2016-01-28
>>>
>>> tail(RTCGA::checkTCGA('Dates'))

>>> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
>>> "2016-01-28"
>>>
>>> I am wondering whether should we upload newer datasets to those data
>>> packages. We have found that there are great differences in results of
>>> data
>>> analysis depending on from which release date one has took datasets. More
>>> about this issue can be found here:
>>> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>>>
>>> The current state of RTCGA family of R packages is listed below
>>>
>>> RTCGA.clinical
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
>>> >
>>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>- BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0
>>>
>>
>>
>>
>>> RTCGA.rnaseq
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
>>> >
>>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>
>>> RTCGA.mutations
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
>>> >
>>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>>- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>>
>>> ---
>>>
>>> RTCGA.methylation
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
>>> >
>>>- BiocRelease: NOT YET AVAILABLE
>>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>>>
>>>
>>> RTCGA.CNV
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
>>> >
>>>- BiocRelease: NOT YET AVAILABLE
>>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>>>
>>>
>>> RTCGA.RPPA
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
>>> >
>>>- BiocRelease: NOT YET AVAILABLE
>>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>>>
>>>
>>> RTCGA.mRNA
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
>>> >
>>>- BiocRelease: NOT YET AVAILABLE
>>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>>>
>>>
>>> RTCGA.miRNASeq
>>> <
>>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html
>>> >
>>>- BiocRelease: NOT YET AVAILABLE
>>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>>>
>>>
>>> I think that having datasets from the newest snapshot date is vital for
>>> data analysis, but 

Re: [Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-04-18 Thread Marcin Kosiński
2016-04-16 22:55 GMT+02:00 Martin Morgan :

>
>
> On 04/16/2016 01:09 PM, Marcin Kosiński wrote:
>
>> Hello,
>>
>> I would like to ask you all for an advice in the following issue.
>>
>> Last year I have started working with data from The Cancer Genome Atlas.
>> During that work out team (https://github.com/orgs/RTCGA/people) have
>> prepared some tools for downloading and integrating datasets from TCGA
>> study and provided them in the R package called RTCGA
>> , which
>> is
>> available on Bioconductor.
>>
>> Later on we were working on tools for visualizing and analyzing the most
>> popular datasets from TCGA so we have prepared data packages with those
>> datasets and submitted them to Bioconductor in 8 separate packages. You
>> can
>> read more about them here http://rtcga.github.io/RTCGA/
>>
>> *I have a question about updating those data packages.* TCGA release
>> datasets snapshots over time. In the RTCGA family of R packages there are
>> available datasets from the release date 2015-11-01 but currently one can
>> check that there was newer release 2016-01-28
>>
>> tail(RTCGA::checkTCGA('Dates'))
>>>
>> [1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
>> "2016-01-28"
>>
>> I am wondering whether should we upload newer datasets to those data
>> packages. We have found that there are great differences in results of
>> data
>> analysis depending on from which release date one has took datasets. More
>> about this issue can be found here:
>> http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata
>>
>> The current state of RTCGA family of R packages is listed below
>>
>> RTCGA.clinical
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.clinical.html
>> >
>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>- BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0
>>
>
>
>
>> RTCGA.rnaseq
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.rnaseq.html
>> >
>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>
>> RTCGA.mutations
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mutations.html
>> >
>>- BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
>>- BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0
>>
>> ---
>>
>> RTCGA.methylation
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.methylation.html
>> >
>>- BiocRelease: NOT YET AVAILABLE
>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1
>>
>>
>> RTCGA.CNV
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.CNV.html
>> >
>>- BiocRelease: NOT YET AVAILABLE
>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5
>>
>>
>> RTCGA.RPPA
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.RPPA.html
>> >
>>- BiocRelease: NOT YET AVAILABLE
>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6
>>
>>
>> RTCGA.mRNA
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.mRNA.html
>> >
>>- BiocRelease: NOT YET AVAILABLE
>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3
>>
>>
>> RTCGA.miRNASeq
>> <
>> http://www.bioconductor.org/packages/3.3/data/experiment/html/RTCGA.miRNASeq.html
>> >
>>- BiocRelease: NOT YET AVAILABLE
>>- BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4
>>
>>
>> I think that having datasets from the newest snapshot date is vital for
>> data analysis, but I wouldn't like to create situations in which 2
>> separate
>> analysts use RTCGA.clinical and got different results because they used
>> different data versions. That's why I have started versioning data
>> packages
>> with the number that corresponds to the release date.
>>
>
> This isn't very helpful. There is only ever one version of
> 'RTCGA.clinical' available per Bioc version, so whether its version is
> 20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.
>
> Probably you want to include the TCGA release in the package _name_,
> 'RTCGA.clinical.20151101'. Probably you want to have multiple versions
> available at any one time.
>

Thanks for comments. I haven't considered making separate packages for
separate data releases.


>
> I don't think the experiment data archive is the best solution for
> distributing large collections of curated data. It places a burden on our
> mirrors to sync the repository and on  the svn repository to store it. The
> packages are built twice weekly even though the data is very static and in
> your case based on unchanging base R data structures. The data are not very
> 'granular', even though you've done a good job of making the individual
> data sets accessible, so a user 

Re: [Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-04-16 Thread Martin Morgan



On 04/16/2016 01:09 PM, Marcin Kosiński wrote:

Hello,

I would like to ask you all for an advice in the following issue.

Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
, which is
available on Bioconductor.

Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You can
read more about them here http://rtcga.github.io/RTCGA/

*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28


tail(RTCGA::checkTCGA('Dates'))

[1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
"2016-01-28"

I am wondering whether should we upload newer datasets to those data
packages. We have found that there are great differences in results of data
analysis depending on from which release date one has took datasets. More
about this issue can be found here:
http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata

The current state of RTCGA family of R packages is listed below

RTCGA.clinical

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0





RTCGA.rnaseq

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

RTCGA.mutations

   - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
   - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

---

RTCGA.methylation

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1


RTCGA.CNV

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5


RTCGA.RPPA

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6


RTCGA.mRNA

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3


RTCGA.miRNASeq

   - BiocRelease: NOT YET AVAILABLE
   - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4


I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2 separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data packages
with the number that corresponds to the release date.


This isn't very helpful. There is only ever one version of 
'RTCGA.clinical' available per Bioc version, so whether its version is 
20151101.1.0 or 1.1.0 wouldn't make a difference to the end user.


Probably you want to include the TCGA release in the package _name_, 
'RTCGA.clinical.20151101'. Probably you want to have multiple versions 
available at any one time.


I don't think the experiment data archive is the best solution for 
distributing large collections of curated data. It places a burden on 
our mirrors to sync the repository and on  the svn repository to store 
it. The packages are built twice weekly even though the data is very 
static and in your case based on unchanging base R data structures. The 
data are not very 'granular', even though you've done a good job of 
making the individual data sets accessible, so a user interested in 
ovarian cancers, say, would need to download all data anyway.


Instead I think that these should be ExperimentHub resources. How to add 
resources is described in the vignette to the companion package 
ExperimentHubData


   http://bioconductor.org/packages/devel/bioc/html/ExperimentHubData.html

The data would be stored in Amazon S3 so globally accessible; it would 
not be under version control. The ExperimentHub / AnnotationHub cache 
would manage local versions, rather than R's package system.


[Bioc-devel] Update of data packages in RTCGA Family/Factory of R Packages

2016-04-16 Thread Marcin Kosiński
Hello,

I would like to ask you all for an advice in the following issue.

Last year I have started working with data from The Cancer Genome Atlas.
During that work out team (https://github.com/orgs/RTCGA/people) have
prepared some tools for downloading and integrating datasets from TCGA
study and provided them in the R package called RTCGA
, which is
available on Bioconductor.

Later on we were working on tools for visualizing and analyzing the most
popular datasets from TCGA so we have prepared data packages with those
datasets and submitted them to Bioconductor in 8 separate packages. You can
read more about them here http://rtcga.github.io/RTCGA/

*I have a question about updating those data packages.* TCGA release
datasets snapshots over time. In the RTCGA family of R packages there are
available datasets from the release date 2015-11-01 but currently one can
check that there was newer release 2016-01-28

> tail(RTCGA::checkTCGA('Dates'))
[1] "2015-02-04" "2015-04-02" "2015-06-01" "2015-08-21" "2015-11-01"
"2016-01-28"

I am wondering whether should we upload newer datasets to those data
packages. We have found that there are great differences in results of data
analysis depending on from which release date one has took datasets. More
about this issue can be found here:
http://rtcga.github.io/RTCGA/Usecases.html#tcga-and-the-curse-of-bigdata

The current state of RTCGA family of R packages is listed below

RTCGA.clinical

  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01  || package ver 20151101.1.0

RTCGA.rnaseq

  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

RTCGA.mutations

  - BiocRelease: snapshot from 2015-08-21 || package ver 1.0.0
  - BiocDevel: snapshot from 2015-11-01 || package ver 20151101.0.0

---

RTCGA.methylation

  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.1


RTCGA.CNV

  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.5


RTCGA.RPPA

  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.6


RTCGA.mRNA

  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.3


RTCGA.miRNASeq

  - BiocRelease: NOT YET AVAILABLE
  - BiocDevel: snapshot from 2015-11-0 || package ver 0.99.4


I think that having datasets from the newest snapshot date is vital for
data analysis, but I wouldn't like to create situations in which 2 separate
analysts use RTCGA.clinical and got different results because they used
different data versions. That's why I have started versioning data packages
with the number that corresponds to the release date.

What do you think about such an issue? You can post advices here or on our
issue list: https://github.com/RTCGA/RTCGA/issues

Thanks for comments,
Marcin

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel