Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-08 Thread Pages, Herve
Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one 
worker per chromosome (which is the easy way to go) wouldn't be optimal 
because of the size differences between the chromosomes. So a better 
approach is to try to give each worker the same amount of work by 
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller 
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a 
collection of patterns. You say you want to find "all genomic 
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not 
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
> 
> multicrispr 
> 
>  provides 
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC). 
> One task involves finding all genomic (mis)matches of a 23-bp candidate 
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an 
> approach that is successful, though not fast. An alternative would be to 
> switch to short read mapping rather than (Bio)string matching, which 
> involves a one-time indexing effort, but subsequent fast alignment.
> 
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`, 
> whereas I know from vcountPDict that some Cas9 candidates have hundreds 
> of genomic matches.
> 
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit 
> on `maxHits`.
> 
> Feedback request…
> 
> Michael, would QuasR/(R)bowtie be a good approach to do this?
> 
> Wei, did I overlook a way to do this with Rsubread?
> 
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
> 
> Thankyou J
> 
> Aditya
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[Bioc-devel] Devel annotations

2019-11-08 Thread Shepherd, Lori
As part of our post release tasks we branch the annotation repository.  The 
devel annotations will be unavailable for the next 1-2 hours.
Please do not be alarmed if you see things like:

cannot open the connection to 
'https://bioconductor.org/packages/3.11/data/annotation/src/contrib/PACKAGES'

It should be back online shortly.

Cheers,


Lori Shepherd

Bioconductor Core Team

Roswell Park Comprehensive Cancer Center

Department of Biostatistics & Bioinformatics

Elm & Carlton Streets

Buffalo, New York 14263


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel