Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-11 Thread Bhagwat, Aditya
True :-)

From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Éric Fournier 
[fournier.eri...@crchudequebec.ulaval.ca]
Sent: Saturday, November 09, 2019 5:12 PM
To: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] From Biostring matching to short read mapping

Hi,

it might be worthwhile to note that the concern about different chromosome 
sizes only applies if you have more workers than chromosomes. If you're running 
on 2-8 threads, the longer chromosome might hold up a thread while another 
processes two short ones.

Cheers,
-Eric




Date: Fri, 8 Nov 2019 18:19:27 +
From: "Pages, Herve" 
To: "Bhagwat, Aditya" ,
"bioc-devel@r-project.org" 
Subject: Re: [Bioc-devel] From Biostring matching to short read
mapping
Message-ID: <84550bd2-9ded-04a3-6ef6-52746c66f...@fredhutch.org>
Content-Type: text/plain; charset="windows-1252"

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr=DwMFAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo=>
>  provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request�
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Herv� Pag�s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-10 Thread Bhagwat, Aditya
Thank you Wei,

I actually love Rsubread, and use it with much appreciation on RNAseq projects, 
thank you for its creation :-)

Cheers,

Aditya



From: Wei Shi [s...@wehi.edu.au]
Sent: Saturday, November 09, 2019 12:02 PM
To: Bhagwat, Aditya; Pages, Herve; bioc-devel@r-project.org
Cc: Michael Stadler (michael.stad...@fmi.ch)
Subject: Re: From Biostring matching to short read mapping

Hi Aditya,

Yes you are correct that Subread reports no more than 16 alignments per reads. 
One reason for this limitation is because subread detects indels in the read 
(Bowtie does not detect indels) and it has to set a limit on the number of 
candidate locations being considered due to the computational cost 
consideration.

Thanks for considering Subread and good luck for your project.

Wei


From: Bhagwat, Aditya 
Sent: Saturday, November 9, 2019 8:06 pm
To: Pages, Herve; bioc-devel@r-project.org
Cc: Wei Shi; Michael Stadler (michael.stad...@fmi.ch)
Subject: RE: From Biostring matching to short read mapping

Thankyou Michael, I got Rbowtie working, now functionalizing it for use within 
multicrispr. I noticed that in QuasR, you actually create a package with bowtie 
indices which you then use for future purposes. Interesting workflow, think I 
will make use of that functionality.

Thankyou Herve. Yes, parallellizing would speed up things. I use `vcountPDict` 
because I want to do the offtargetanalysis for a set of 23 bp cas9 sites. 
vcountPDict must be more efficient than looping, I thought, maybe this is only 
marginally so, I noticed there's an sapply underlying vcountPDict. Is there a 
BSgenome way to parallellize, like a parallel bsapply or so?

And Rsubread I concluded is really limited to only a small number of 
co-alignments, and so not suited for offtargetanalysis.

Cheers,

Aditya


From: Pages, Herve [hpa...@fredhutch.org]
Sent: Friday, November 08, 2019 7:19 PM
To: Bhagwat, Aditya; bioc-devel@r-project.org
Cc: Wei Shi (s...@wehi.edu.au); Michael Stadler (michael.stad...@fmi.ch)
Subject: Re: From Biostring matching to short read mapping

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> 
>  provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request�
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Herv� Pag�s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___

The information in this email is confidential and intend...{{dropped:15}}

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-09 Thread Pages, Herve
On 11/9/19 08:12, Éric Fournier wrote:
> Hi,
> 
> it might be worthwhile to note that the concern about different chromosome 
> sizes only applies if you have more workers than chromosomes. If you're 
> running on 2-8 threads, the longer chromosome might hold up a thread while 
> another processes two short ones.

Exactly. This is why a naive 'bplapply(seq_along(chromosomes), ...)' 
strategy might be ok when a small number of workers is used but won't 
scale well if we want to use dozens of workers. A good parallelization 
strategy should be able to break down big chromosomes in smaller pieces.

H.

> 
> Cheers,
> -Eric
> 
> 
> 
> 
> Date: Fri, 8 Nov 2019 18:19:27 +
> From: "Pages, Herve" 
> To: "Bhagwat, Aditya" ,
>  "bioc-devel@r-project.org" 
> Subject: Re: [Bioc-devel] From Biostring matching to short read
>  mapping
> Message-ID: <84550bd2-9ded-04a3-6ef6-52746c66f...@fredhutch.org>
> Content-Type: text/plain; charset="windows-1252"
> 
> Hi Aditya,
> 
> Should not be too hard to parallelize. With some gotchas: using one
> worker per chromosome (which is the easy way to go) wouldn't be optimal
> because of the size differences between the chromosomes. So a better
> approach is to try to give each worker the same amount of work by
> splitting the set of chromosomes in groups of more or less equal sizes.
> The split can either preserve full chromosomes or break them in smaller
> pieces. The later will allow using a lot more workers than the former.
> I'll try to come up with some code that I'll share here.
> 
> BTW the *PDict() family in Biostrings is for finding the matches of a
> collection of patterns. You say you want to find "all genomic
> (mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
> using vmatchPattern() (or vcountPattern()) for that?
> 
> Cheers,
> H.
> 
> 
> On 11/7/19 02:11, Bhagwat, Aditya wrote:
>> Dear bioc-devel,
>>
>> multicrispr
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr=DwMFAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo=>
>>  provides
>> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
>> One task involves finding all genomic (mis)matches of a 23-bp candidate
>> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
>> approach that is successful, though not fast. An alternative would be to
>> switch to short read mapping rather than (Bio)string matching, which
>> involves a one-time indexing effort, but subsequent fast alignment.
>>
>> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
>> whereas I know from vcountPDict that some Cas9 candidates have hundreds
>> of genomic matches.
>>
>> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
>> on `maxHits`.
>>
>> Feedback request�
>>
>> Michael, would QuasR/(R)bowtie be a good approach to do this?
>>
>> Wei, did I overlook a way to do this with Rsubread?
>>
>> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>>
>> Thankyou J
>>
>> Aditya
>>
> 
> --
> Herv� Pag�s
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
> 
>   [[alternative HTML version deleted]]
> 
> 
> ___
> Bioc-devel@r-project.org mailing list
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel=DwICAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=Gn7ePk_oWSVdI0hwo_p4vLLD1L0Txmz9e835vnyFyCc=d69FGAwKsfrk8ywu_HN3bvQjHxbz4eaSunLV2-bq8dQ=
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-09 Thread Éric Fournier
Hi,

it might be worthwhile to note that the concern about different chromosome 
sizes only applies if you have more workers than chromosomes. If you're running 
on 2-8 threads, the longer chromosome might hold up a thread while another 
processes two short ones.

Cheers,
-Eric




Date: Fri, 8 Nov 2019 18:19:27 +
From: "Pages, Herve" 
To: "Bhagwat, Aditya" ,
"bioc-devel@r-project.org" 
Subject: Re: [Bioc-devel] From Biostring matching to short read
mapping
Message-ID: <84550bd2-9ded-04a3-6ef6-52746c66f...@fredhutch.org>
Content-Type: text/plain; charset="windows-1252"

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__gitlab.gwdg.de_loosolab_software_multicrispr=DwMFAg=eRAMFD45gAfqt84VtBcfhQ=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA=B3ZdDoy-Ur4VIfZr68ORA8dplv90DuCcehJEWpkwWUU=UsUGsKc2SVyrBHDWnEJS0FVy1wIhoeq2WA4nlLmtmfo=>
>  provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request�
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Herv� Pag�s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-09 Thread Wei Shi
Hi Aditya,

Yes you are correct that Subread reports no more than 16 alignments per reads. 
One reason for this limitation is because subread detects indels in the read 
(Bowtie does not detect indels) and it has to set a limit on the number of 
candidate locations being considered due to the computational cost 
consideration.

Thanks for considering Subread and good luck for your project.

Wei


From: Bhagwat, Aditya 
Sent: Saturday, November 9, 2019 8:06 pm
To: Pages, Herve; bioc-devel@r-project.org
Cc: Wei Shi; Michael Stadler (michael.stad...@fmi.ch)
Subject: RE: From Biostring matching to short read mapping

Thankyou Michael, I got Rbowtie working, now functionalizing it for use within 
multicrispr. I noticed that in QuasR, you actually create a package with bowtie 
indices which you then use for future purposes. Interesting workflow, think I 
will make use of that functionality.

Thankyou Herve. Yes, parallellizing would speed up things. I use `vcountPDict` 
because I want to do the offtargetanalysis for a set of 23 bp cas9 sites. 
vcountPDict must be more efficient than looping, I thought, maybe this is only 
marginally so, I noticed there's an sapply underlying vcountPDict. Is there a 
BSgenome way to parallellize, like a parallel bsapply or so?

And Rsubread I concluded is really limited to only a small number of 
co-alignments, and so not suited for offtargetanalysis.

Cheers,

Aditya


From: Pages, Herve [hpa...@fredhutch.org]
Sent: Friday, November 08, 2019 7:19 PM
To: Bhagwat, Aditya; bioc-devel@r-project.org
Cc: Wei Shi (s...@wehi.edu.au); Michael Stadler (michael.stad...@fmi.ch)
Subject: Re: From Biostring matching to short read mapping

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> 
>  provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request�
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Herv� Pag�s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___

The information in this email is confidential and intend...{{dropped:15}}

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-09 Thread Bhagwat, Aditya
Thankyou Michael, I got Rbowtie working, now functionalizing it for use within 
multicrispr. I noticed that in QuasR, you actually create a package with bowtie 
indices which you then use for future purposes. Interesting workflow, think I 
will make use of that functionality.

Thankyou Herve. Yes, parallellizing would speed up things. I use `vcountPDict` 
because I want to do the offtargetanalysis for a set of 23 bp cas9 sites. 
vcountPDict must be more efficient than looping, I thought, maybe this is only 
marginally so, I noticed there's an sapply underlying vcountPDict. Is there a 
BSgenome way to parallellize, like a parallel bsapply or so?

And Rsubread I concluded is really limited to only a small number of 
co-alignments, and so not suited for offtargetanalysis.

Cheers,

Aditya


From: Pages, Herve [hpa...@fredhutch.org]
Sent: Friday, November 08, 2019 7:19 PM
To: Bhagwat, Aditya; bioc-devel@r-project.org
Cc: Wei Shi (s...@wehi.edu.au); Michael Stadler (michael.stad...@fmi.ch)
Subject: Re: From Biostring matching to short read mapping

Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one
worker per chromosome (which is the easy way to go) wouldn't be optimal
because of the size differences between the chromosomes. So a better
approach is to try to give each worker the same amount of work by
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a
collection of patterns. You say you want to find "all genomic
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
>
> multicrispr
> 
>  provides
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC).
> One task involves finding all genomic (mis)matches of a 23-bp candidate
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an
> approach that is successful, though not fast. An alternative would be to
> switch to short read mapping rather than (Bio)string matching, which
> involves a one-time indexing effort, but subsequent fast alignment.
>
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`,
> whereas I know from vcountPDict that some Cas9 candidates have hundreds
> of genomic matches.
>
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit
> on `maxHits`.
>
> Feedback request…
>
> Michael, would QuasR/(R)bowtie be a good approach to do this?
>
> Wei, did I overlook a way to do this with Rsubread?
>
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
>
> Thankyou J
>
> Aditya
>

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] From Biostring matching to short read mapping

2019-11-08 Thread Pages, Herve
Hi Aditya,

Should not be too hard to parallelize. With some gotchas: using one 
worker per chromosome (which is the easy way to go) wouldn't be optimal 
because of the size differences between the chromosomes. So a better 
approach is to try to give each worker the same amount of work by 
splitting the set of chromosomes in groups of more or less equal sizes.
The split can either preserve full chromosomes or break them in smaller 
pieces. The later will allow using a lot more workers than the former.
I'll try to come up with some code that I'll share here.

BTW the *PDict() family in Biostrings is for finding the matches of a 
collection of patterns. You say you want to find "all genomic 
(mis)matches of a 23-bp candidate Cas9 sequence". Any reason you're not 
using vmatchPattern() (or vcountPattern()) for that?

Cheers,
H.


On 11/7/19 02:11, Bhagwat, Aditya wrote:
> Dear bioc-devel,
> 
> multicrispr 
> 
>  provides 
> functions for Crispr/Cas9 gRNA design (and is being prepared for BioC). 
> One task involves finding all genomic (mis)matches of a 23-bp candidate 
> Cas9 sequence. Currently this is done with `Biostrings::vcountPDict`, an 
> approach that is successful, though not fast. An alternative would be to 
> switch to short read mapping rather than (Bio)string matching, which 
> involves a one-time indexing effort, but subsequent fast alignment.
> 
> `Rsubread::align` seems to be limited to max. 16 `nBestLocations`, 
> whereas I know from vcountPDict that some Cas9 candidates have hundreds 
> of genomic matches.
> 
> `QuasR::qAlign` (connecting to Bowtie) does not mention an upper limit 
> on `maxHits`.
> 
> Feedback request…
> 
> Michael, would QuasR/(R)bowtie be a good approach to do this?
> 
> Wei, did I overlook a way to do this with Rsubread?
> 
> Herve, is there an elegant way to speed up vcountPDict (parallelize?)
> 
> Thankyou J
> 
> Aditya
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel