[Bioc-devel] Merging or renaming a fork, and appropriate journal for package updates
Hi folks, I'm wrapping up my dissertation and one of the chapters touches on a summer of patching a Bioconductor package that currently lives as a separate GitHub fork (the list of changes is here [1]). 2 of the questions I've been asked by a member of my committee are whether to: (1) associate a publication with the work, and (2) republish the code with the original or as a separate package. For (1) while I appreciate the traditional unit of research output is the publication, I'm struggling to think of a suitable journal for what essentially is discussion about enhancements and some bugfixes. I've seen R package updates published in the Journal of Statistical Software (JSS) which might be appropriate? I suppose any place that gets indexed on pubmed would work (yes, JSS is part of the NLM catalog [2]). What would you suggest? Perhaps the Bioconductor project collect metrics for publication activity about its packages to get more funding and has some preference? For (2) I would prefer to merge back with the original Bioconductor package. I tried upstreaming an early changeset [3], but besides my issue being open, there are currently 2 other open GitHub issues with no response which makes me wonder if upstream is dead. If that's the case, would someone from the Bioconductor core team be willing to work with me to proxy commit to git.bioconductor.org? I've made some API breaking changes, so I expect I would need to create at least 2 branches: one that can be commit with a deprecation warning for upcoming API breaking changes, and a second branch with API breaking changes to be commit at the subsequent Bioconductor release. Or maybe I would need to create a branch for each feature change; honestly I don't know if that would be or less work but certainly it would be easier to read the git history. Pariksheet [1] https://github.com/coregenomics/groHMM/blob/1.99.x/NEWS [2] https://www.ncbi.nlm.nih.gov/nlmcatalog/101307056 [3] https://github.com/Kraus-Lab/groHMM/issues/2 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [R-pkg-devel] How does one install a libtool generated libfoo.so.1 file into ./libs/?
Hi Simon and Vladimir, >> On Oct 19, 2021, at 4:13 PM, Pariksheet Nanda wrote: >> The trouble is, R's installation process will only copy compiled files from ./libs/ that have exactly the extension ".so" and files ending with ".so.1" are ignored. --snip-- >> library(tsshmm) >> ... >> Error: package or namespace load failed for 'tsshmm' indyn.load(file, DLLpath = DLLpath, ...): >> unable to load shared object '/home/omsai/R/x86_64-pc-linux-gnu-library/4.1/tsshmm/libs/tsshmm.so': >> libghmm.so.1: cannot open shared object file: No such file or directory > Pariksheet On 10/19/21 5:00 AM, Simon Urbanek wrote: dynamic linking won't work, compile a static version with PIC enabled. If the subproject is autoconf-compatible this means using --disable-shared --with-pic. Then you only need to add libfoo.a to your PKG_LIBS. > > Simon On 10/19/21 6:39 AM, Vladimir Dergachev wrote: > > The simplest thing to try is to compile the library statically and link it > into your package. No extra files - no trouble. > > You can also try renaming the file from *.so.1 to *.so. > > Vladimir Dergachev Thank you both for your suggestions! I will link the code statically with PIC per your consensus. I found when linking the R package library, one also has to link the dependencies of the static library; in this case libghmm depends on libxml-2.0 > 2.6 and so I have to link libxml2 to my R package library after finding libxml2 with pkg-config. Thanks for the quick replies, Pariksheet __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] How does one install a libtool generated libfoo.so.1 file into ./libs/?
Hi folks, On 10/18/21 11:13 PM, Pariksheet Nanda wrote: The trouble is, R's installation process will only copy compiled files from ./libs/ that have exactly the extension ".so" and files ending with ".so.1" are ignored. --snip-- So is there some mechanism to copy arbitrary files or symlinks to the final install location? I prefer not to patch upstreams Makefile.am to remove their -version-info, but currently that's the only option I can think of. It turns out removing -version-info or setting it to 0.0.0 will still try to link against libghmm.so.0 which is still problematic. I don't see how to disable libtool's versioning. So after playing around, the only way I can think of doing is is eliminating the dependency file by compiling it statically and linking it with the dynamic library, because when I try merging the 2 dynamic libraries with libtool it gives the same error of not finding "libghmm.so.1". I have a patch that works on my Debian machines, but not yet on the Ubuntu CI Image: https://gitlab.com/coregenomics/tsshmm/-/commit/e9608f01deb7baa13684d2bd65fe11e93f6c2e08 Also pasting the short diff below for search-ability. Pariksheet Pariksheet $ GIT_PAGER=cat git log -1 --patch commit e9608f01deb7baa13684d2bd65fe11e93f6c2e08 (HEAD -> master, origin/master, origin/HEAD) Author: Pariksheet Nanda Date: Tue Oct 19 01:43:09 2021 -0400 BLD: Link bundled dependency statically to workaround load errors diff --git a/configure.ac b/configure.ac index 87b4d31..4d0be6e 100644 --- a/configure.ac +++ b/configure.ac @@ -135,8 +135,6 @@ AS_IF([test x$with_ghmm_strategy = x], ) # AS_IF ) # AS_IF -AC_SUBST(GHMM_LIBS, -lghmm) - # If GHMM_ROOT was provided, set the header and library paths. # # Check for the existance of include/ and lib/ sub-directories and if both are @@ -180,7 +178,10 @@ AS_IF(test x$found_ghmm_system != xyes && AM_CONDITIONAL(BUNDLED_GHMM, true) [AX_SUBDIRS_CONFIGURE([src/ghmm-0.9-rc3], [[CFLAGS=$CFLAGS], - [--enable-gsl=no], + [--enable-static], + [--disable-shared], + [--with-pic], + [--enable-gsl=no], [--disable-gsltest], [--with-rng=mt], [--with-python=no], @@ -191,8 +192,14 @@ AS_IF(test x$found_ghmm_system != xyes && [AS_IF([test -d $GHMM_ROOT], [], AC_MSG_FAILURE(Directory of bundled GHMM not found.))] [AC_SUBST(GHMM_CPPFLAGS, ["-I$GHMM_ROOT/.."])] - # Using -rpath=. prefers the bundled over any system installation. - [AC_SUBST(GHMM_LDFLAGS, ["-Wl,-rpath=. -L$GHMM_ROOT/.libs"])] + # We don't need GMM_LIBS or GHMM_LDFLAGS because we can directly merge + # libraries using tsshmm_la_LIBADD per + # https://stackoverflow.com/a/13978856 and + # https://www.gnu.org/software/automake/manual/html_node/Libtool-Convenience-Libraries.html + # + # However we now need to link against libghmm's libxml2 dependency + # because we're merging libraries. + [PKG_CHECK_MODULES([LIBXML2], [libxml-2.0 >= 2.6])] AC_MSG_NOTICE(Applying patches to GHMM to fix errors and warnings from "R CMD check") # Patch bug in upstream's configure bug: # @@ -239,7 +246,9 @@ AS_IF(test x$found_ghmm_system != xyes && #include ' src/ghmm-0.9-rc3/tests/mcmc.c [touch -r src/ghmm-0.9-rc3/tests/mcmc.c{.bak,}] [diff -u src/ghmm-0.9-rc3/tests/mcmc.c{.bak,}] - AC_MSG_NOTICE(Finished patching GHMM) + AC_MSG_NOTICE(Finished patching GHMM), + # Only link if we're not using the static bundled dependency. + [AC_SUBST(GHMM_LIBS, -lghmm)] ) # AS_IF # Variables for Doxygen. diff --git a/src/Makefile.am b/src/Makefile.am index 617a4e7..0e38b4a 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -9,18 +9,19 @@ endif lib_LTLIBRARIES= tsshmm.la tsshmm_la_CFLAGS = $(PKG_CFLAGS) tsshmm_la_CPPFLAGS = $(PKG_CPPFLAGS) +if BUNDLED_GHMM +tsshmm_la_LIBADD = @GHMM_ROOT@/libghmm.la +tsshmm_la_LDFLAGS = -module $(PKG_LIBS) @LIBXML2_LIBS@ +else tsshmm_la_LDFLAGS = -module $(PKG_LIBS) +endif tsshmm_la_SOURCES = R_init_tsshmm.c R_wrap_tsshmm.c models.c \ simulate.c train.c tss.c viterbi.c ACLOCAL_AMFLAGS = -I tools # Hook that runs after the default "all" rule. -if BUNDLED_GHMM -all-local : tsshmm.so libghmm.so -else all-local : tsshmm.so -endif # One of the limitations with POSIX-compliant `make` is not being able to # specify multiple outputs from a single rule. Therefore, even though libtool @@ -30,14 +31,8 @@ tsshmm.so : tsshmm.la cp -av .libs/tsshmm.so.0.0.0 $@ chmod -x $@ -if BUNDLED_GHMM -libghmm.so : @GHMM_ROOT@/libghmm.la - cp -av @GHMM
[R-pkg-devel] How does one install a libtool generated libfoo.so.1 file into ./libs/?
Hi folks, My package [1] depends on a C library libghmm-dev that's available in many GNU/Linux package managers. However, it's not available on all platforms and if this dependency is not installed, my autoconf generated configure script defaults to falling back to compiling and installing the dependency from my bundled copy of upstream's pristine source tarball [2]. Now, because upstream uses automake which in turn uses libtool, I also use automake and libtool in my build process to hook into their build artifacts using SUBDIRS and *-local automake rules [3]. As you may know libtool appends `-version-info` to its generated shared libraries in the form "libfoo.so.1.2.3". I'm linking against the bundled library which only sets the first value, namely libghmm.so.1. The trouble is, R's installation process will only copy compiled files from ./libs/ that have exactly the extension ".so" and files ending with ".so.1" are ignored. My current workaround is to set -Wl,-rpath to the location of the generated ".so.1" file. This allows the installation process to complete and sneakily pass the 2 canonical tests: ** testing if installed package can be loaded from temporary location ---snip--- ** testing if installed package can be loaded from final location However, not surprisingly, when I try to load the library from the final location after the temporary directory has been deleted it fails with: library(tsshmm) ... Error: package or namespace load failed for 'tsshmm' indyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/home/omsai/R/x86_64-pc-linux-gnu-library/4.1/tsshmm/libs/tsshmm.so': libghmm.so.1: cannot open shared object file: No such file or directory I can rename the dependency from ".so.1" to ".so" to also get the dependent library to the final location. But it still fails with the above error because the library links against the ".so.1" file and I would need an accompanying symlink. I tried creating a symlink but can't think of how to get the symlink to the final location. If my Makefile writes the symlink into ./inst/libs/libghmm.so.1 during compile time it is not actually installed; perhaps because the ./inst/ sub-directories are only copied earlier on when staging and are ignored later? If I were to create that dangling symlink inside ./inst/libs/ instead of generating it later during compile time, devtools::install() complains about the broken symlink with: cp: cannot stat 'tsshmm/inst/libs/libghmm.so.1': No such file or directory So is there some mechanism to copy arbitrary files or symlinks to the final install location? I prefer not to patch upstreams Makefile.am to remove their -version-info, but currently that's the only option I can think of. I can't find helpful discussion surrounding this in the mailing list archives. Last week when I've posted for help with my package on another issue on the Bioconductor mailing list, one adventurous soul tried installing the package using `remotes::install_gitlab("coregenomics/tsshmm")`. This won't work because I haven't committed the generated autotools files; if anyone wants to play with it, you'll have to follow the 2 additional steps run by the continuous integration script, namely, unpacking ./src/ghmm-0.9-rc3.tar.gz into ./src/ and running `autoreconf -ivf` in the package's top-level directory where configure.ac is located. Any help appreciated, Pariksheet [1] https://gitlab.com/coregenomics/tsshmm [2] The only patches I apply to the dependency are to fix 2 bugs for compiling, and to remedy a warning severe enough to be flagged by `R CMD check`. [3] You can see my Makefile.am here: https://gitlab.com/coregenomics/tsshmm/-/blob/master/src/Makefile.am __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [Bioc-devel] Strange "internal logical NA value has been modified" error
Hi Hervé, On 10/13/21 12:43 PM, Hervé Pagès wrote: On 12/10/2021 15:43, Pariksheet Nanda wrote: The function in question is: replace_unstranded <- function (gr) { idx <- strand(gr) == "*" if (length(idx) == 0L) ^ Not related to the "internal logical NA value has been modified" error but shouldn't you be doing '!any(idx)' instead of 'length(idx) == 0L' here? Indeed. Although in a roundabout way the result somehow satisfied the unit tests, idx is a poor choice of name because it's really a mask, and your suggestion of OR-ing the mask FALSE values with any() is more intuitive. The name is_unstranded might be less cryptic than mask. Applying your suggestion of the correct condition uncovered a bug where return(gr) was returning the unsorted value, which is inconsistent with the behavior of the final statement returns a sorted value. So changed to return(sort(gr)) for a consistent contract. Fixed in f6892ea Best, H. return(gr) sort(c( gr[! idx], `strand<-`(gr[idx], value = "+"), `strand<-`(gr[idx], value = "-"))) } Pariksheet ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] [External] Re: Strange "internal logical NA value has been modified" error
0/13/21 8:27 AM, luke-tier...@uiowa.edu wrote: *Message sent from a system outside of UConn.* The most likely culprit is C code that is modifying a logical vector without checking whether this is legitimate for R semantics (i.e. making sure MAYBE_REFERENCED or at least MAYBE_SHARED is FALSE). If that is the case, then this is legitimate for C code to do in principle, so UBSAN and valgrind won't help. You need to set a gdb watchpoint on the location, catch where it is modified, and look up the call stack from there. The error signaled in the GC is a sanity check for catching that this sort of misbehavior has happened in C code. But it is a check after the fact; it can't tell you more that that the problem happened sometime before it was detected. Best, luke On Wed, 13 Oct 2021, Martin Morgan wrote: The problem with using gdb is you'd find yourself in the garbage collector, but perhaps quite removed from where the corruption occurred, e.g., gc() might / will likely be triggered after you've returned to the top-level evaluation loop, and the part of your code that did the corruption might be off the stack. The problem with devtools::check() (and R CMD check) is that running the unit tests occurs in a separate process, so things like setting a global option (and even system variable from within R) may not be visible in the process doing the check. Conversely, for the same reasons, it seems like the problem can be tickled by running the tests alone. So R -f /tests/testthat.R would seem to be a good enough starting point. Actually, I liked Henrik's UBSAN suggestion, which requires the least amount of work. I think I'd then try R -d valgrind -f /tests/testthat.R and then further into the weeds... actually from the section of R-exts you mention R_C_BOUNDS_CHECK=yes R -f /tests/testthat.R might also be promising. Martin On 10/12/21, 10:30 PM, "Bioc-devel on behalf of Pariksheet Nanda" pariksheet.na...@uconn.edu> wrote: Hi all, On 10/12/21 6:43 PM, Pariksheet Nanda wrote: > > Error in `...`: internal logical NA value has been modified In the R source code, this error is in src/main/memory.c so I was thinking one way of investigating might be to run `R --debugger gdb`, then running R to load the symbols and either: 1) set a breakpoint for when it reaches that particular line in memory.c:R_gc_internal and then walk up the stack, 2) or set a watch point on memory.c:R_gc_internal:R_LogicalNAValue (somehow; having trouble getting gdb to reach that context). 3) Then I thought, maybe this is getting far into the weeds and instead I could check the most common C related error by enabling bounds checking of my C arrays per section 4.4 of the R-exts manual: $ R -q > options(CBoundsCheck = TRUE) > Sys.setenv(R_C_BOUNDS_CHECK = "yes") # Try both ways *shrug* > devtools::test() ... # All tests still pass. > devtools::check() ... # No change :( Maybe I'm not sure I'm using that option correctly? Or the option is ignored in devtools::check(). Or indeed, the error is not from over running C array boundaries. It turns out that using the precompiled debug symbols[1] isn't all that useful here because I don't get line numbers in gdb without the source files and many symbols are optimized out, so it looks like I would need to compile R from source with -ggdb first instead of using the Debian packages. Hopefully this is still the right approach? Pariksheet [1] After install r-base-core-dbg on Debian for the debug symbols. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Strange "internal logical NA value has been modified" error
Thanks, Martin and Henrik! My previous confusing reply from a few minutes ago was due my university GMail hiding your replies in All Mail. I'll consider both your suggestions carefully and thank you again for the quick and thoughtful replies. Pairksheet On 10/12/21 8:03 PM, Henrik Bengtsson wrote: *Message sent from a system outside of UConn.* In addition to checking with Valgrind, the ASan/UBsan and rchk platforms on R-Hub (https://builder.r-hub.io/) can probably also be useful; rhub::check(platform = "linux-x86_64-rocker-gcc-san") rhub::check(platform = "ubuntu-rchk") /Henrik On Tue, Oct 12, 2021 at 4:54 PM Martin Morgan wrote: It is from base R https://github.com/wch/r-source/blob/a984cc29b9b8d8821f8eb2a1081d9e0d1d4df56e/src/main/memory.c#L3214 and likely indicates memory corruption, not necessarily in the code that triggers the error (this is when the garbage collector is triggered...). Probably in *your* C code :) since it's the least tested. Probably writing out of bounds. This could be quite tricky to debug. I'd try to get something close to a minimal reproducible example. I'd try to take devtools out of the picture, maybe running the test/testhat.R script from the command line using Rscript, or worst case creating a shell package that adds minimal code and can be checked with R CMD build --no-build-vignettes / R CMD check. You could try inserting gc() before / after the unit test; it might make it clear that the unit test isn't the problem. You could also try gctorture(TRUE); this will make your code run extremely painfully slowly, which puts a big premium on having a minimal reproducible example; you could put this near the code chunks that are causing problems. You might have success running under valgrind, something like R -d valgrind -f minimal_script.R. Hope those suggestions help! Martin On 10/12/21, 6:43 PM, "Bioc-devel on behalf of Pariksheet Nanda" wrote: Hi folks, I've been told to ask some of my more fun questions on this mailing list instead of Slack. I'm climbing the ladder of submitting my first Bioconductor package (https://gitlab.com/coregenomics/tsshmm) and feel like there are gremlins that keep adding rungs to the top of the ladder. The latest head scratcher from running devtools::check() is a unit test for a trivial 2 line function failing with this gem of an error: > test_check("tsshmm") ══ Failed tests ── Error (test-tss.R:11:5): replace_unstranded splits unstranded into + and - ── Error in `tryCatchOne(expr, names, parentenv, handlers[[1L]])`: internal logical NA value has been modified Backtrace: █ 1. ├─testthat::expect_equal(...) test-tss.R:11:4 2. │ └─testthat::quasi_label(enquo(expected), expected.label, arg = "expected") 3. │ └─rlang::eval_bare(expr, quo_get_env(quo)) 4. └─GenomicRanges::GRanges(c("chr:100:+", "chr:100:-")) 5. └─methods::as(seqnames, "GRanges") 6. └─GenomicRanges:::asMethod(object) 7. └─GenomicRanges::GRanges(ans_seqnames, ans_ranges, ans_strand) 8. └─GenomicRanges:::new_GRanges(...) 9. └─S4Vectors:::normarg_mcols(mcols, Class, ans_len) 10. └─S4Vectors::make_zero_col_DFrame(x_len) 11. └─S4Vectors::new2("DFrame", nrows = nrow, check = FALSE) 12. └─methods::new(...) 13. ├─methods::initialize(value, ...) 14. └─methods::initialize(value, ...) 15. └─methods::validObject(.Object) 16. └─base::try(...) 17. └─base::tryCatch(...) 18. └─base:::tryCatchList(expr, classes, parentenv, handlers) 19. └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]]) [ FAIL 1 | WARN 0 | SKIP 0 | PASS 109 ] The full continuous integration log is here: https://gitlab.com/coregenomics/tsshmm/-/jobs/1673603868 The function in question is: replace_unstranded <- function (gr) { idx <- strand(gr) == "*" if (length(idx) == 0L) return(gr) sort(c( gr[! idx], `strand<-`(gr[idx], value = "+"), `strand<-`(gr[idx], value = "-"))) } Also online here: https://gitlab.com/coregenomics/tsshmm/-/blob/ef5e19a0e2f68fca93665bc417afbcfb6d437189/R/hmm.R#L170-178 ... and the unit test is: test_that("replace_unstranded splits unstranded into + and -", { expect_equal(replace_unstranded(GRanges(
Re: [Bioc-devel] Strange "internal logical NA value has been modified" error
Hi all, On 10/12/21 6:43 PM, Pariksheet Nanda wrote: Error in `...`: internal logical NA value has been modified In the R source code, this error is in src/main/memory.c so I was thinking one way of investigating might be to run `R --debugger gdb`, then running R to load the symbols and either: 1) set a breakpoint for when it reaches that particular line in memory.c:R_gc_internal and then walk up the stack, 2) or set a watch point on memory.c:R_gc_internal:R_LogicalNAValue (somehow; having trouble getting gdb to reach that context). 3) Then I thought, maybe this is getting far into the weeds and instead I could check the most common C related error by enabling bounds checking of my C arrays per section 4.4 of the R-exts manual: $ R -q > options(CBoundsCheck = TRUE) > Sys.setenv(R_C_BOUNDS_CHECK = "yes") # Try both ways *shrug* > devtools::test() ... # All tests still pass. > devtools::check() ... # No change :( Maybe I'm not sure I'm using that option correctly? Or the option is ignored in devtools::check(). Or indeed, the error is not from over running C array boundaries. It turns out that using the precompiled debug symbols[1] isn't all that useful here because I don't get line numbers in gdb without the source files and many symbols are optimized out, so it looks like I would need to compile R from source with -ggdb first instead of using the Debian packages. Hopefully this is still the right approach? Pariksheet [1] After install r-base-core-dbg on Debian for the debug symbols. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Strange "internal logical NA value has been modified" error
Hi folks, I've been told to ask some of my more fun questions on this mailing list instead of Slack. I'm climbing the ladder of submitting my first Bioconductor package (https://gitlab.com/coregenomics/tsshmm) and feel like there are gremlins that keep adding rungs to the top of the ladder. The latest head scratcher from running devtools::check() is a unit test for a trivial 2 line function failing with this gem of an error: > test_check("tsshmm") ══ Failed tests ── Error (test-tss.R:11:5): replace_unstranded splits unstranded into + and - ── Error in `tryCatchOne(expr, names, parentenv, handlers[[1L]])`: internal logical NA value has been modified Backtrace: █ 1. ├─testthat::expect_equal(...) test-tss.R:11:4 2. │ └─testthat::quasi_label(enquo(expected), expected.label, arg = "expected") 3. │ └─rlang::eval_bare(expr, quo_get_env(quo)) 4. └─GenomicRanges::GRanges(c("chr:100:+", "chr:100:-")) 5. └─methods::as(seqnames, "GRanges") 6. └─GenomicRanges:::asMethod(object) 7. └─GenomicRanges::GRanges(ans_seqnames, ans_ranges, ans_strand) 8. └─GenomicRanges:::new_GRanges(...) 9. └─S4Vectors:::normarg_mcols(mcols, Class, ans_len) 10. └─S4Vectors::make_zero_col_DFrame(x_len) 11. └─S4Vectors::new2("DFrame", nrows = nrow, check = FALSE) 12. └─methods::new(...) 13. ├─methods::initialize(value, ...) 14. └─methods::initialize(value, ...) 15. └─methods::validObject(.Object) 16. └─base::try(...) 17. └─base::tryCatch(...) 18. └─base:::tryCatchList(expr, classes, parentenv, handlers) 19. └─base:::tryCatchOne(expr, names, parentenv, handlers[[1L]]) [ FAIL 1 | WARN 0 | SKIP 0 | PASS 109 ] The full continuous integration log is here: https://gitlab.com/coregenomics/tsshmm/-/jobs/1673603868 The function in question is: replace_unstranded <- function (gr) { idx <- strand(gr) == "*" if (length(idx) == 0L) return(gr) sort(c( gr[! idx], `strand<-`(gr[idx], value = "+"), `strand<-`(gr[idx], value = "-"))) } Also online here: https://gitlab.com/coregenomics/tsshmm/-/blob/ef5e19a0e2f68fca93665bc417afbcfb6d437189/R/hmm.R#L170-178 ... and the unit test is: test_that("replace_unstranded splits unstranded into + and -", { expect_equal(replace_unstranded(GRanges("chr:100")), GRanges(c("chr:100:+", "chr:100:-"))) expect_equal(replace_unstranded(GRanges(c("chr:100", "chr:200:+"))), sort(GRanges(c("chr:100:+", "chr:100:-", "chr:200:+" }) Also online here: https://gitlab.com/coregenomics/tsshmm/-/blob/ef5e19a0e2f68fca93665bc417afbcfb6d437189/tests/testthat/test-tss.R#L11-L12 What's interesting is this is *not* reproducible by running devtools::test() but only devtools::check() so as far as I know there isn't a way to interactively debug this while devtools::check() is going on? Every few days I've seen on that "internal ... value has been modified" which prevents me from running nearly any R commands. Originally I would restart R, but then I found I could clear that error by running gc(). No idea what causes it. Maybe some S4 magic? Yes, I have downloaded the mailing lists for bioc-devel, r-devel, r-help, and r-package-devel and see no mention of "value has been modified" [1]. Any help appreciated. Pariksheet [1] Mailing lists downloader: #!/bin/bash -x for url in https://stat.ethz.ch/pipermail/{bioc-devel,r-{devel,help,package-devel}}/ do dir=$(basename $url) wget \ --timestamping \ --no-remove-listing \ --recursive \ --level 1 \ --no-directories \ --no-host-directories \ --cut-dirs 2 \ --directory-prefix "$dir" \ --accept '*.txt.gz' \ --relative \ --no-parent \ $url done ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] GLAD quotation
Hi Patricia, Whoops, my mistake! GLAD is indeed a Bioconductor package: $ R -q > BiocManager::available("glad") [1] "GLAD" "GladiaTOX" > You don't need to purchase any software license. You can install the package freely inside R. See: https://bioconductor.org/packages/release/bioc/html/GLAD.html If you're not familiar with using R, a good place to start is the Software Carpentry lesson: https://swcarpentry.github.io/r-novice-gapminder/ Pariksheet On Thu, Aug 22, 2019 at 6:33 AM P q wrote: > Dear Support assistant, > > I am a doctoral student of Dr. Nicolas Carrels at FIOCRUZ and I am in > charge to ask for softwares quotations for the lab. I would like a > quotation for 4 years GLAD software license and It is for non-profit > research. The quotation document must contain these informations > below: > > 1)Head Researcher/ Scientist: Nicolas Carrels CPF 84166770500 > 2)Institution: Fundacao Oswaldo Cruz -FIOCRUZ-Centro de Desenvolv. > Tecn. em Saude Publica-CDTS > 3)Address: Av.Brasil, 4036 - predio da expansão - 8˚ andar - sala 814 > cep 21040-361 - Rio de Janeiro-RJ - Brasil > > > If you do have any questions, please, contact me by e-mail or by > telephone: +55 21 965515609 > > > Best Regards, > Patricia Queiroz Monteiro > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-develdata=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C904fbfa340e94e99ccf208d726ec221f%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C637020667966450005sdata=FBmicjyZJN2DfCI9%2By%2BVxNxubI9cJGUQVn5enwnNuMs%3Dreserved=0 > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] GLAD quotation
Hi Patricia, You've got the wrong e-mail address. Bioconductor doesn't sell proprietary software licenses. This mailing list is for developers to discuss technical matters with the R or Bioconductor software packages. I don't know what the website or e-mail address you need for purchasing the GLAD software; my NCBI and web searches are failing me. You might need to ask Prof. Carrels? Good luck, Pariksheet On Thu, Aug 22, 2019 at 6:33 AM P q wrote: > Dear Support assistant, > > I am a doctoral student of Dr. Nicolas Carrels at FIOCRUZ and I am in > charge to ask for softwares quotations for the lab. I would like a > quotation for 4 years GLAD software license and It is for non-profit > research. The quotation document must contain these informations > below: > > 1)Head Researcher/ Scientist: Nicolas Carrels CPF 84166770500 > 2)Institution: Fundacao Oswaldo Cruz -FIOCRUZ-Centro de Desenvolv. > Tecn. em Saude Publica-CDTS > 3)Address: Av.Brasil, 4036 - predio da expansão - 8˚ andar - sala 814 > cep 21040-361 - Rio de Janeiro-RJ - Brasil > > > If you do have any questions, please, contact me by e-mail or by > telephone: +55 21 965515609 > > > Best Regards, > Patricia Queiroz Monteiro > > [[alternative HTML version deleted]] > > ___ > Bioc-devel@r-project.org mailing list > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-develdata=02%7C01%7Cpariksheet.nanda%40uconn.edu%7C904fbfa340e94e99ccf208d726ec221f%7C17f1a87e2a254eaab9df9d439034b080%7C0%7C0%7C637020667966450005sdata=FBmicjyZJN2DfCI9%2By%2BVxNxubI9cJGUQVn5enwnNuMs%3Dreserved=0 > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] IRanges should support long vectors
Hi Hervé, Indeed, an IRanges with 2^31 elements is 17.1 GB. The reason I was interested in IRanges, was GRanges are needed to create the BSgenome::BSgenomeViews. More broadly, my use case is chopping up a large genome into a fixed kmer size so that repetitive "unmappable" regions can be removed. https://github.com/coregenomics/kmap My interest in long vectors is to address issue #8 https://github.com/coregenomics/kmap/issues/8 The workaround I've imagined so far is to have my kmap::kmerize function return an iterator that creates GRanges less than length 2^31. Using iterators doesn't even need any additional packages: they're implemented in the BiocParallel bpiterator unit tests as returning a function that keeps returning objects until it returns NULL. But looking at how much more efficient your GPos, etc functions are, perhaps maybe BSgenomeViews requiring a GRanges is not as reasonable? I don't even know of a sane way to mock a BSgenome object for writing tests. It's irritating to have to use actual small genomes for tests. Pariksheet On Tue, May 28, 2019 at 3:35 AM Pages, Herve wrote: > Hi Pariksheet, > > On 5/25/19 12:49, Pariksheet Nanda wrote: > > Hello, > > R 3.0 added support for long vectors, but it's not yet possible to use them > with IRanges. Without long vector support it's not possible to construct > an IRanges object with more than 2^31 elements: > > > > ir <- IRanges(start = 1:(2^31 - 1), width = 1) > ir <- IRanges(start = 1:2^31, width = 1) > > Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") > : > long vectors not supported yet: memory.c:3715 > In addition: Warning message: > In .normargSEW0(start, "start") : > NAs introduced by coercion to integer range > > Right. This is a known limitation of IRanges objects and Vector > derivatives in general. > > I wonder what's your use case? > > FWIW supporting long Vector derivatives (including long IRanges) has been > on the TODO list for a while. Unfortunately it seems that we keep getting > distracted by other things. > > Note that even when long IRanges objects are supported, computing on them > will not be very efficient because the memory footprint of these objects > will be very big (> 16Gb). It is much more interesting (and fun) to use > long Vector derivatives that have a **small** memory footprint like long > Rle's or long StitchedIPos/StitchedGPos objects: > > library(S4Vectors) > > x <- Rle(1:15, 1e9) > x > # integer-Rle of length 150 with 15 runs > # Lengths: 10 10 10 ... 10 10 > # Values : 1 2 3 ... 14 15 > > object.size(x) > # 1288 bytes > > library(IRanges) > > ipos <- IPos(IRanges(1, 2e9)) > ipos > # StitchedIPos object with 20 positions and 0 metadata columns: > # pos > # > #[1] 1 > #[2] 2 > #[3] 3 > #[4] 4 > #[5] 5 > #...... > # [16] 16 > # [17] 17 > # [18] 18 > # [19] 19 > # [20] 20 > > object.size(ipos) > # 2736 bytes > > library(GenomicRanges) > > gpos <- GPos("chr1:1-5e8") # not a real organism ;-) > gpos > # StitchedGPos object with 5 positions and 0 metadata columns: > # seqnames pos strand > # > # [1] chr1 1 * > # [2] chr1 2 * > # [3] chr1 3 * > # [4] chr1 4 * > # [5] chr1 5 * > # ... ... ...... > # [49996] chr1 49996 * > # [49997] chr1 49997 * > # [49998] chr1 49998 * > # [4] chr1 4 * > # [5] chr1 5 * > # --- > # seqinfo: 1 sequence from an unspecified genome; no seqlengths > > object.size(gpos) > # 10552 bytes > > > We're not here yet but the goal would be to have light-weight objects that > can represent all the genomic positions in the Human genome. > > H. > > > This is true when using the latest version from GitHub > > > > BiocManager::install("Bioconductor/IRanges") > sessionInfo() > > R version 3.6.0 (2019-04-26) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago) > > M
[Bioc-devel] IRanges should support long vectors
Hello, R 3.0 added support for long vectors, but it's not yet possible to use them with IRanges. Without long vector support it's not possible to construct an IRanges object with more than 2^31 elements: > ir <- IRanges(start = 1:(2^31 - 1), width = 1) > ir <- IRanges(start = 1:2^31, width = 1) Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") : long vectors not supported yet: memory.c:3715 In addition: Warning message: In .normargSEW0(start, "start") : NAs introduced by coercion to integer range > This is true when using the latest version from GitHub > BiocManager::install("Bioconductor/IRanges") > sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.7 (Santiago) Matrix products: default BLAS: /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRblas.so LAPACK: /home/pan14001/spack/opt/spack/linux-rhel6-x86_64/gcc-7.4.0/r-3.6.0-r7m53dthhqtxyrrdghjuiw2otasowvbl/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] IRanges_2.19.5 S4Vectors_0.22.0BiocGenerics_0.30.0 loaded via a namespace (and not attached): [1] ps_1.3.0 prettyunits_1.0.2 withr_2.1.2crayon_1.3.4 [5] rprojroot_1.3-2assertthat_0.2.1 R6_2.4.0 backports_1.1.4 [9] magrittr_1.5 cli_1.1.0 curl_3.3 remotes_2.0.4 [13] callr_3.2.0tools_3.6.0compiler_3.6.0 processx_3.3.1 [17] pkgbuild_1.0.3 BiocManager_1.30.4 > Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocManager to install Depends/Imports/Suggests
Hi Levi, Why not use devtools which already does this? Setting `dependencies = TRUE` installs the packages listed in Imports and Suggests, and BiocManager::repositories() like BiocInstaller::biocinstallRepos() returns a list of repositories. See inline below: On Mon, Jul 9, 2018 at 4:51 AM, Levi Waldron wrote: > It would be useful to be able to use BiocManager to install > the Depends/Imports/Suggests of a source package not on Bioconductor, e.g.: > > BiocManager::install("Bioconductor/BiocWorkshops") #works but only if all > Depends/Imports are already installed > devtools::install("Bioconductor/BiocWorkshops", repos = BiocManager::repositories(), dependencies = TRUE) > Also from a local package, e.g.: > > BiocManager::install("mypackage_0.1.tar.gz") # or, > BiocManager::install("mypackage") > devtools::install("mypackage_0.1.tar.gz", repos = BiocManager::repositories(), dependencies = TRUE) devtools::install("mypackage", repos = BiocManager::repositories(), dependencies = TRUE) devtools::install(".", repos = BiocManager::repositories(), dependencies = TRUE) Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocInstaller: next generation
Hi Henrik, On Thu, May 10, 2018 at 1:21 AM, Henrik Bengtsson < henrik.bengts...@gmail.com> wrote: > > > May I suggest the package name: > > * Bioconductor > > The potential downside would be possible confusions between the version of > this package versus the actual Bioconductor repository. Could the > Bioconductor *package* have a version x.y.z that reflects the *repository* > x.y version? This is a nice suggestion that also crossed my mind, but users new to both R and Bioconductor might think "but I have 'Bioconductor' installed, why can't I run this script?", and it might complicate web namespace / presence by entrapping searches for the Bioconductor system to the single package. > /Henrik Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Including data for @examples to run
Hi Adam, On Wed, Apr 25, 2018 at 2:35 PM, Adam Pricewrote: > > There are a few reasons why I'm using \dontrun{} for my examples and want > to know if there is any way to actually run my examples. > > My package incorporates some automated data management and requires in > practice that certain directories exist. You might consider decoupling the code logic that creates or requires such a directory structure, and you could have a wrapper function that takes the input + output paths as function parameters with some defaults. An R-ish way of setting package-wide defaults that might be a good fit for your use case is using options(), so you could have the flexibility of your input and output paths falling back to getOption(...) if they are not provided. Although that can be a little fragile if a user sets options() between related function calls. Another possibility might be to instantiate a class to keep a single instance of your path structure. I imagine there must be existing bioconductor packages that do this sort of thing, especially those that lightly wrap around other programs that create directories, like the Rbowtie2 package (Rbowtie2 checks for files and directory structures, but doesn't make use of classes). Maybe others on the list know of packages that come to mind or can comment on these ideas. Are you by any chance writing unit tests for your package? One of the really nice benefits of separating out the directory requirement is making your code more testable. I know that Bioconductor doesn't formally require tests for their packages, but even so they are very useful and often you can answer architectural decisions about how to best structure your code by how nicely it satisfies tests. > I am storing some package environmental variables in my package like this: > myPackage_env <- new.env(parent=emptyenv()) I feel like in general using environmental variables for solving problems in computing is like reaching for the sledge hammer in your toolbox. Certainly, it has many legitimate uses, especially in cluster computing where you have to setup the environment for packages to find each other. Would it be possible to see a link to some your package code? Then we can comment more about the paths and environmental variables, and be more specific about alternatives and suggestions. These are very good questions. A lot of workflows don't lend themselves to being done entirely in memory, can rely on lots of existing files, and require better integration with software outside of the Bioconductor system, and it's fun to learn about how to tackle them. > -Adam Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocCheck - warning: files are over 5MB
> disk efficient compression algorithm Whoops, meant to say compression format. Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] BiocCheck - warning: files are over 5MB
Hi Claris, On Sat, Mar 10, 2018 at 2:49 AM, Claris Baby via Bioc-devel < bioc-devel@r-project.org> wrote: > > [1] "The following files are over 5MB in size: > 'dataset/Caenorhabditis_elegans.WBcel235.dna.chromosome.I.fa'." > This as well as other data like .gff files, that are being used > for the reference based assembly are all much more than 5mb. > But the total package size is less than 500mb. Assuming that's not a typo, 500 mb is very large and inappropriate for a package. It's generally good practice to separate code and data where possible, not least because it bloats code version control. If your package size is close to 500 mb, you should think about stashing the data and accessing it using something like the AnnotationHub or BiocFileCache (some others on the mailing list might have better and more specific suggestions as I've not yet had to deal with this particular problem, if you confirm that the package is indeed that big). > Is it essential that each file within the package is less than > 5mb. If so, it would be very kind if anyone could suggest how > to reduce the size of the genomic data files. Can you gzip compress those data files? Text based files usually compress quite well and many functions like import() from tracklayer will automagically decompress them so you might not even need to change much in your code. .gz isn't the most disk efficient compression algorithm out there; .bz2 compresses better and is actually what R natively uses for save() and load() of .RData files, and .xz typically yields even better lossless compression but, for cross-platform compatibility that Bioconductor strives for, using .gz might be best to try first. > Claris Baby Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] parallel processing in R?
Hi Bhakti, On Mon, Dec 11, 2017 at 12:19 PM, Dwivedi, Bhaktiwrote: > Is there a way to parallelize ConsensusClusterPlus package? > https://bioconductor.org/packages/release/bioc/html/ > ConsensusClusterPlus.html > We are developing a R shiny tool that performs consensus clustering in > addition to other genome-wide analyses. The consensus clustering step is > taking the longest. > Can I do parallel processing in R/R shiny? What parallel package (if any) > I can implement in the R code to do parallel computing? > The canonical package for parallel computing in Bioconductor is BiocParallel: https://bioconductor.org/packages/release/bioc/html/BiocParallel.html One can choose what backend to use for parallelism. You can switch from using SerialParam for cheap and cheerful lapply-like functionality to the default MultiCoreParam and SnowParam which nicely logs useful things like memory usage. It does not look like ConsensusClusterPlus is importing any parallel package of it's own that you need to fight against, so best case scenario is you look at the function of interest you want to run many times and run that function with BiocParallel's bplapply. Or if there are multiple levels of parallelism like internal and external looping then you might have to dive into ConsensusClusterPlus and inject bplapply statements ideally allowing some bpparam() argument passing for the inner and outer loops. > Bhakti > Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Help with R CMD check NOTEs
Hi Anusha, On Wed, Oct 18, 2017 at 2:30 PM, Anusha Nagari < anusha.nag...@utsouthwestern.edu> wrote: > > Can you please let me know how to go about the following NOTE. Or if this is something that should be really taken care of for a successful package build and install: > > * checking re-building of vignette outputs ... NOTE > Warnings in re-building vignettes: > Warning: file stem ‘/fig2’ is not portable > Warning: file stem ‘/fig3’ is not portable > > @Pariksheet: I am working on the groHMM package. https://github.com/Kraus-Lab/groHMM My best guess is you might want to revise your figure label names without the number. LaTeX commands like macro names consider numbers to be an invalid character class. So you could try replacing instances of fig2 and fig3 in the vignette with something like figTwo and figThree. I couldn't reproduce the build NOTE to confirm the fault / fix. The bioconductor.org 1.10.0 tarball doesn't see to produce the error, BiocCheck fails, and I had trouble building the vignette from the GitHub master branch. Hope that helps. > Anusha Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Help with R CMD check NOTEs
Hi Anusha On Wed, Oct 18, 2017 at 12:04 PM, Anusha Nagari <anusha.nag...@utsouthwestern.edu> wrote: > > Depends: includes the non-default packages: > ‘MASS’ ‘parallel’ ‘S4Vectors’ ‘IRanges’ ‘GenomeInfoDb’ > ‘GenomicRanges’ ‘GenomicAlignments’ ‘rtracklayer’ > Adding so many packages to the search path is excessive and importing > selectively is preferable. Move those to the "Imports" section in your package DESCRIPTION file. > * checking re-building of vignette outputs ... NOTE > Warnings in re-building vignettes: > Warning: file stem ‘/fig2’ is not portable > Warning: file stem ‘/fig3’ is not portable Hmm... I think we'll have to look at the exact vignette to see what's going on. Presumably that's a LaTeX vignette. Can you advise the package name you are working on and/or link to the the source code? > Anusha Pariksheet --- Pariksheet Nanda PhD Candidate in Genetics and Genomics System Administrator, Storrs HPC Cluster University of Connecticut ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Cannot install Bioconductor packages with biocLite() after loading QuasR
Hi folks, It looks like loading QuasR breaks biocLite() because it magically wants to use biocLite() in qAlign(): $ find -not -name '*.Rnw' -exec grep -E '(BiocInstaller|biocLite)' {} + 2>/dev/null ./DESCRIPTION: S4Vectors (>= 0.9.25), IRanges, BiocInstaller, Biobase, ./R/qAlign.R: biocLite(genome, suppressUpdates=TRUE, lib=lib.loc) ./NAMESPACE:importFrom(BiocInstaller, biocLite) $ Here's the error in a fresh R session: > suppressPackageStartupMessages(library(QuasR)) > BiocInstaller::biocLite("BSgenome.Hsapiens.UCSC.hg38") Error: failed to update BiocInstaller: namespace ‘BiocInstaller’ is imported by ‘QuasR’ so cannot be unloaded > What would be a good way to fix this? I think trying to use biocLite() from inside a package is a bit naughty and installing packages should be left up to the user instead? Reproducible in R 3.4.1 and a daily build: > sessionInfo() R Under development (unstable) (2017-08-01 r73012) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-08-01-jyjbn6hodegfxzvg6aojsdu7fmrdzi3y/rlib/R/lib/libRblas.so LAPACK: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-08-01-jyjbn6hodegfxzvg6aojsdu7fmrdzi3y/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] QuasR_1.17.0 Rbowtie_1.17.0GenomicRanges_1.29.12 [4] GenomeInfoDb_1.13.4 IRanges_2.11.12 S4Vectors_0.15.6 [7] BiocGenerics_0.23.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.12 RColorBrewer_1.1-2 [3] BiocInstaller_1.27.3 compiler_3.5.0 [5] XVector_0.17.1 prettyunits_1.0.2 [7] progress_1.1.2 GenomicFeatures_1.29.8 [9] bitops_1.0-6 GenomicFiles_1.13.10 [11] tools_3.5.0zlibbioc_1.23.0 [13] biomaRt_2.33.4 digest_0.6.12 [15] bit_1.1-12 BSgenome_1.45.1 [17] RSQLite_2.0memoise_1.1.0 [19] tibble_1.3.4 lattice_0.20-35 [21] rlang_0.1.2Matrix_1.2-11 [23] DelayedArray_0.3.19DBI_0.7 [25] GenomeInfoDbData_0.99.1hwriter_1.3.2 [27] stringr_1.2.0 rtracklayer_1.37.3 [29] Biostrings_2.45.4 bit64_0.9-7 [31] grid_3.5.0 Biobase_2.37.2 [33] R6_2.2.2 AnnotationDbi_1.39.2 [35] XML_3.98-1.9 BiocParallel_1.11.6 [37] latticeExtra_0.6-28magrittr_1.5 [39] blob_1.1.0 Rsamtools_1.29.1 [41] matrixStats_0.52.2 GenomicAlignments_1.13.5 [43] ShortRead_1.35.1 assertthat_0.2.0 [45] SummarizedExperiment_1.7.5 stringi_1.1.5 [47] RCurl_1.95-4.8 VariantAnnotation_1.23.8 > Pariksheet --- Pariksheet Nanda PhD Candidate in Genetics and Genomics System Administrator, Storrs HPC Cluster University of Connecticut ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
On Fri, Apr 7, 2017 at 1:13 AM, Hervé Pagèswrote: > > This is the expected behavior. > > Some background: BSgenomeViews are list-like objects where the *list > elements* (i.e. the elements one extracts with [[) are the DNA > sequences from the views --snip-- > The important difference is that with [[ I get a DNAString object > (the content of the view) and with [ I get a BSgenomeViews object > of length 1. Thank you, Hervé! I was failing to make the connection with the `[[` accessor. On Fri, Apr 7, 2017 at 1:16 AM, Michael Lawrence wrote: > > I'm curious as to why you are looping over the views in the first > place. Maybe we could arrive at a vectorized solution, which is often > but not always simpler and faster. Hi Michael! Broad background is I'm acculturating an undergraduate student to writing a bioconductor package and applying software engineering practices of version control, unit testing, documenting, dependency setup and validation in a different environment on our university HPC cluster, etc. The student also came along to LibrePlanet to better understand the culture of software freedom :o) The package goal is to use Biostrings to look for repeating DNA sequences of a fixed kmer size and subset to portions of the genome without repeats (an aligner can do this ofc, but the goal is to teach R and engineering practices). I appreciate your thoughtfulness for vectorizing the code to best use BSgenomeViews, but please don't spend more than 10 minutes as I have to balance changes to the code with the student's learning and coding "voice" and may not do proper justice for more of your effort. My slowness to reply was getting the project further along to be more understandable. Here was the line which I've updating as Hervé suggested to use seq_along(): https://github.com/coregenomics/kmap/blob/4adaed6b8007e8ea39f39ff57a42a821445d3d46/R/BiostringsProjectNEW.R#L185 (I'm having a hard time thinking of how to summarizing a small example out of context). Although in that line ranges_hits() is only operating on single indices, ranges_hits() was written to process groups of indices to reduce multi-processor communication. Generating such sets of indices would involve applying width() to the views inside mappable() to break in into chunks of, say, a million bases for matchPDict(). Again, I'm linking to the code for anything that stands out at you, but I will feel bad if you spend a lot of time on it. > H. > Michael Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] Iterating over BSgenomeViews returns DNAString instead of BSgenomeViews
Hi bioconductor devs, The BSgenomeViews class has been very useful in efficiently propagating metadata for running Biostring operations. I noticed something unexpected when iterating over views - it seems to return the Biostrings object instead of a single length Views object, and thus loses the associated view metadata. Is this intentional? Below is some example code, the output and sessionInfo(). Yes, I also confirmed this happens in the development version of R / bioconductor 3.5. On a side note, for unit testing it's been difficult to mock a BSgenome object due to the link to physical files, and as a workaround I use a small, arbitrary BSgenome package. Can one construct a BSgenome from its package bundled extdata? The man page examples use packaged genomes. Code to reproduce the issue: -- library(BSgenome) genome <- getBSgenome("BSgenome.Hsapiens.UCSC.hg19") gr <- GRanges(c("chr1:25001-28000", "chr2:30001-31000")) views <- Views(genome, gr) views lapply(views, class) -- Result: -- > views BSgenomeViews object with 2 views and 0 metadata columns: seqnames ranges strand dna [1] chr1 [25001, 28000] * [GCTTCAGCCT...TTATTTATTG] [2] chr2 [30001, 31000] * [GACCCTCCTG...AGCAGGTGGT] --- seqinfo: 93 sequences (1 circular) from hg19 genome > lapply(views, class) [[1]] [1] "DNAString" attr(,"package") [1] "Biostrings" [[2]] [1] "DNAString" attr(,"package") [1] "Biostrings" > -- Tested against these configurations: 1) R 3.3.2 + BSgenome 1.42.0 (stable bioconductor 3.4) 2) R 2017-04-05 installed via llnl/spack + BSgenome 1.43.7 (devel bioconductor 3.5) sessionInfo for configuration #2 above: -- > sessionInfo() R Under development (unstable) (2017-04-05 r72488) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS Matrix products: default BLAS: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRblas.so LAPACK: /share/apps/spack/opt/spack/linux-ubuntu16-x86_64/gcc-5.4.0/r-2017-04-05-4tkzhsu6sdpwmlvnv275jf6x766gwnpy/rlib/R/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.43.7 [3] rtracklayer_1.35.10 Biostrings_2.43.7 [5] XVector_0.15.2GenomicRanges_1.27.23 [7] GenomeInfoDb_1.11.10 IRanges_2.9.19 [9] S4Vectors_0.13.15 BiocGenerics_0.21.3 loaded via a namespace (and not attached): [1] zlibbioc_1.21.0GenomicAlignments_1.11.12 [3] BiocParallel_1.9.5 lattice_0.20-35 [5] tools_3.5.0SummarizedExperiment_1.5.7 [7] grid_3.5.0 Biobase_2.35.1 [9] matrixStats_0.52.1 Matrix_1.2-9 [11] GenomeInfoDbData_0.99.0bitops_1.0-6 [13] RCurl_1.95-4.8 DelayedArray_0.1.7 [15] compiler_3.5.0 Rsamtools_1.27.15 [17] XML_3.98-1.6 > BiocInstaller::biocValid() [1] TRUE > --- Pariksheet Nanda PhD Candidate in Genetics and Genomics System Administrator, Storrs HPC Cluster University of Connecticut [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Bioc-devel] OrganismDb package for Drosophila.melanogaster
On Tue, Nov 15, 2016 at 7:34 PM, Martin Morgan wrote: > On 11/15/2016 09:52 AM, Obenchain, Valerie wrote: >> On 11/15/2016 03:32 AM, Pariksheet Nanda wrote: >>> >>> It would be great to have an OrganismDb package for >>> Drosophila.melanogaster, similar to Homo.sapiens, Mus.musculus and >>> Rattus.norvegicus. --snip-- >>> In other words, like Rattus.norvegicus, it might be good do add a UCSC >>> "refGene" TxDb package for dm6 as "ensGene" doesn't appear to be as good of >>> a candidate (at least without some ugliness)? I was looking at creating a >>> dm6 UCSC "refGene" TxDb. >> >> You can use GenomicFeatures::makeTxDbFromUCSC() to create the TxDb. The >> man page, ?makeTxDbFromUCSC, also has helper functions that display >> available genomes, tables and tracks. > > I'm not completely sure of the result, but > > library(OrganismDb) > odb = makeOrganismDbFromUCSC("dm6", tableName="refGene") > > might be most of the way there? Thanks Valerie and Martin for pointing out the make*() functions! As my lab uses the same UCSC tables frequently, I used the make*Package() functions (namely, GenomicFeatures::makeTxDbPackageFromUCSC and OrganismDbi::makeOrganismPackage). For others who run OrganismDbi::makeOrganismPackage, don't forget to edit the generated PACKAGE file and add your new TxDb package to "Depends". >> Valerie > Martin Pariksheet [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[Bioc-devel] OrganismDb package for Drosophila.melanogaster
Hi folks, It would be great to have an OrganismDb package for Drosophila.melanogaster, similar to Homo.sapiens, Mus.musculus and Rattus.norvegicus. While trying to do this on my own using the Homo.sapiens package as a starting point, I found the most similar looking keys to relate org.Dm.eg.db and TxDb.Dmelanogaster.UCSC.dm6.ensGene to be "ENSEMBL" and "GENEID" though there's a ".1" tacked to the end "GENEID" which makes it harder to supply the graphInfo object to OrganismDbi:::.loadOrganismDbiPkg: !> key_ <- function(db, key) sort(as.character( +select(db, keys(db, key), key, key)[[key]])) > key_head <- function(db, key) head(key_(db, key)) > key_head(TxDb.Dmelanogaster.UCSC.dm6.ensGene, "GENEID") 'select()' returned 1:1 mapping between keys and columns [1] "FBgn003.1" "FBgn008.1" "FBgn014.1" "FBgn015.1" [5] "FBgn017.1" "FBgn018.1" > key_head(org.Dm.eg.db, "ENSEMBL") [1] "FBgn008" "FBgn014" "FBgn015" "FBgn017" "FBgn018" [6] "FBgn022" > In other words, like Rattus.norvegicus, it might be good do add a UCSC "refGene" TxDb package for dm6 as "ensGene" doesn't appear to be as good of a candidate (at least without some ugliness)? I was looking at creating a dm6 UCSC "refGene" TxDb. I imagine one would query the UCSC public MySQL server and then do the SQLite conversion. Although the conversion to SQLite seems a bit finagly as the datatypes differ between MySQL and SQLite and I'm having a hard time finding a well supported tool to do it; I don't want to introduce errors or harm reproducibility. What do you use for MySQL to SQLite conversion? Or would it be more sensible for you benevolent dictators to generate the package(s)? Pariksheet --- Pariksheet Nanda PhD Candidate in Genetics and Genomics System Administrator, Storrs HPC Cluster University of Connecticut --- > sessionInfo() R Under development (unstable) (2016-11-13 r71655) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.1 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] Rattus.norvegicus_1.3.1 [2] TxDb.Rnorvegicus.UCSC.rn5.refGene_3.4.0 [3] org.Rn.eg.db_3.4.0 [4] Mus.musculus_1.3.1 [5] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0 [6] org.Mm.eg.db_3.4.0 [7] Homo.sapiens_1.3.1 [8] GO.db_3.4.0 [9] OrganismDbi_1.17.1 [10] org.Hs.eg.db_3.4.0 [11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [12] org.Dm.eg.db_3.4.0 [13] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.3.0 [14] GenomicFeatures_1.27.2 [15] AnnotationDbi_1.37.0 [16] Biobase_2.35.0 [17] GenomicRanges_1.27.6 [18] GenomeInfoDb_1.11.4 [19] IRanges_2.9.8 [20] S4Vectors_0.13.2 [21] BiocGenerics_0.21.0 [22] BiocInstaller_1.25.2 loaded via a namespace (and not attached): [1] compiler_3.4.0 XVector_0.15.0 [3] bitops_1.0-6 tools_3.4.0 [5] zlibbioc_1.21.0biomaRt_2.31.1 [7] RSQLite_1.0.0 lattice_0.20-34 [9] Matrix_1.2-7.1 graph_1.53.0 [11] DBI_0.5-1 rtracklayer_1.35.1 [13] Biostrings_2.43.0 grid_3.4.0 [15] XML_3.98-1.5 RBGL_1.51.0 [17] BiocParallel_1.9.1 Rsamtools_1.27.2 [19] GenomicAlignments_1.11.1 SummarizedExperiment_1.5.3 [21] RCurl_1.95-4.8 > [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel