Re: [R] RSQLite slowness
Dear Martin Morgan, Thanks for all those links! Yes, my question can be characterized like that I think, traditional way writing a temporary table into the database and left JOINing the others vs. parameterized query. A relevant example would be to first create the database from the compressed tsv files: for i in gene_info gene2refseq; do wget -c https://ftp.ncbi.nlm.nih.gov/gene/DATA/$i.gz gzip -d $i.gz sqlite3 gene_info.sqlite ".mode tabs" ".import $i $i" rm $i; done then run this R code: H <- data.frame(Group = c(1, 1, 2, 2), NM = c("NM_001267695", "NM_001007636", "NM_001003706", "NM_001353612")) conn <- DBI::dbConnect(RSQLite::SQLite(), "gene_info.sqlite") DBI::dbWriteTable(conn, "H", H, overwrite=T) statement.1 <- "SELECT * FROM gene2refseq R LEFT JOIN gene_info I ON I.GeneID = R.GeneID WHERE R.`RNA_nucleotide_accession.version` LIKE '%' || ? || '%'" time.1 <- proc.time() x1 <- DBI::dbGetQuery( conn=conn, statement=statement.1, param=list(H$NM)) time.1 <- proc.time() - time.1 statement.2 <- "SELECT * FROM H LEFT JOIN gene2refseq R ON R.`RNA_nucleotide_accession.version` LIKE '%' || H.NM || '%' LEFT JOIN gene_info I ON I.GeneID = R.GeneID" time.2 <- proc.time() x2 <- DBI::dbGetQuery( conn=conn, statement=statement.2) time.2 <- proc.time() - time.2 DBI::dbDisconnect(conn) saveRDS(object=x1, file="ex1_x1.rds", compress="xz") saveRDS(object=x2, file="ex1_x2.rds", compress="xz") saveRDS(object=list("Time x1"=list(time.1), "Time x2"=list(time.2)), file="ex1_t.rds", compress="xz") I got these timings in the ex1_t.rds file: $`Time x1` user system elapsed 571.731 182.006 772.199 $`Time x2` user system elapsed 200.068 90.529 295.086 As you can see, statement.1 takes a lot longer to process compared to statement.2 ... When I add the rest of the 31 search terms, the difference gets a lot bigger like I pointed out initially, beyond the full hour vs. only a few minutes. Best, Rasmus signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
On 2021-10-06 12:11 -0700, Jeff Newmiller wrote: > FWIW all SQL implementations work > better with indexes An index seems to be a good way to improve sql performance, I'll look into it. Best, Rasmus signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
Dear Ivan, Thanks for that explaination! I think it explains the slowness clearly. It is possible to use carray is in Rust [1] so it might be available in R in the future(?) I'll look into rusqlite some time at least. sqlite is supposed to be one of the fastest sql implementations. Realm [2] might be faster, but I don't know how to use it in R so ... I thought someone might suggest Redis, but I think it is way slower. sqlite with indexes and the traditional way of doing that by writing in a temporary table might just be it ... Best, Rasmus [1] https://docs.rs/rusqlite/0.24.2/rusqlite/vtab/array/index.html [2] https://en.wikipedia.org/wiki/Realm_(database) [3] https://en.wikipedia.org/wiki/Redis signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
On Wed, 6 Oct 2021 16:23:15 + Rasmus Liland wrote: >"SELECT * FROM gene2refseq > LEFT JOIN gene_info ON > gene_info.GeneID = gene2refseq.GeneID > WHERE gene2refseq.`RNA_nucleotide_accession.version` > LIKE ?" <...> > x1 <- DBI::dbGetQuery(conn=conn, > statement=statement, > param=list(Håkan20210914$RNANucleotideAccession)) I think that the problem here is that you pass a vector as a bound parameter to LIKE, when parameter placeholders usually expect a scalar. DBI transparently handles this: >> The elements of the `params` argument do not need to be scalars, >> vectors of arbitrary length (including length 0) are supported. For >> queries, calling dbFetch() binding such parameters returns >> concatenated results, equivalent to binding and fetching for each >> set of values and connecting via rbind(). I think this means that DBI runs a SELECT for each value in Håkan20210914$RNANucleotideAccession, which is understandably slower than a single query. Unfortunately, it's hard to pass vectors of values to queries with bound parameters; the SQL engines I know don't have a syntax for "WHERE param IN (:multi_placeholder:)". SQLite comes with carray [1], but I don't know whether it's exposed by RSQLite (could be hard to do in a pointer-safe way), and you're already aware of the traditional way of doing that: create a temporary table, populate it and JOIN with the rest of the query. -- Best regards, Ivan [1] https://www.sqlite.org/carray.html __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
Since the sqlite package is contributed, it is NOT related to "core R", and is in fact technically off-topic on this list. FWIW all SQL implementations work better with indexes, but AFAIK the R data frame support does nothing with indexes. This may be related to your question, or not. I am not a regular sqlite user. As for fast reading of tsv files, I think arrow, readr, and data.table packages all offer high-performance import functions that could be relevant. On October 6, 2021 11:49:55 AM PDT, Rasmus Liland wrote: >Thank you Bert, I set up a new thread on >BioStars [1]. So far, I'm a bit >unfamilliar with Bioconductor (but will >hopefully attend a course about it in >November, which I'm kinda hyped about), >other than installing and updating R >packages using BiocManager Did you >think of something else than >BioStars.org when saying «Bioconductor?» > >The question could be viewed as gene >related, but I think it is really about >how can one easier than with sqlite >handle large tsv files, and why is that >parser thing so slow ... I think this >is more like a core R thing than gene >related question ... > >[1] https://www.biostars.org/p/9492486/ -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
https://support.bioconductor.org and the community slack (sign up at https://bioc-community.herokuapp.com/ ) as well as the general site https://bioconductor.org . Actually your question sounds like a SQLite question � JOIN a table, versus parameterized query. One could perhaps construct the relevant example at the sqlite command line? Martin Morgan On 10/6/21, 2:50 PM, "R-help" wrote: Thank you Bert, I set up a new thread on BioStars [1]. So far, I'm a bit unfamilliar with Bioconductor (but will hopefully attend a course about it in November, which I'm kinda hyped about), other than installing and updating R packages using BiocManager Did you think of something else than BioStars.org when saying �Bioconductor?� The question could be viewed as gene related, but I think it is really about how can one easier than with sqlite handle large tsv files, and why is that parser thing so slow ... I think this is more like a core R thing than gene related question ... [1] https://www.biostars.org/p/9492486/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
Thank you Bert, I set up a new thread on BioStars [1]. So far, I'm a bit unfamilliar with Bioconductor (but will hopefully attend a course about it in November, which I'm kinda hyped about), other than installing and updating R packages using BiocManager Did you think of something else than BioStars.org when saying «Bioconductor?» The question could be viewed as gene related, but I think it is really about how can one easier than with sqlite handle large tsv files, and why is that parser thing so slow ... I think this is more like a core R thing than gene related question ... [1] https://www.biostars.org/p/9492486/ signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RSQLite slowness
Generally, such gene-related questions are better asked on Bioconductor than here. They also might know of more efficient, purpose built tools for your efforts there. No guarantees, of course, and you might get a helpful response here. But if not ... Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Oct 6, 2021 at 9:23 AM Rasmus Liland wrote: > Dear r-help readers, > > why is it so much slower to query an > sqlite database using RSQlite «from the > outside» using param like > > statement <- > "SELECT * FROM gene2refseq > LEFT JOIN gene_info ON > gene_info.GeneID = gene2refseq.GeneID > WHERE gene2refseq.`RNA_nucleotide_accession.version` > LIKE ?" > db <- "gene_info.sqlite" > conn <- DBI::dbConnect(RSQLite::SQLite(), db) > x1 <- DBI::dbGetQuery(conn=conn, > statement=statement, > param=list(Håkan20210914$RNANucleotideAccession)) > > compared to querying «from the inside» > of sqlite, by writing your search terms > as a table first, and then calling it > > statement <- > "SELECT * FROM H > LEFT JOIN gene2refseq R ON > R.`RNA_nucleotide_accession.version` > LIKE '%' || H.RNANucleotideAccession || '%' > LEFT JOIN gene_info I ON I.GeneID = R.GeneID" > DBI::dbWriteTable(conn, "H", Håkan20210914) > x2 <- DBI::dbGetQuery(conn=conn, statement=statement) > DBI::dbDisconnect(conn) > > On my system (E5-2603 v4), the first > query took more than an hour, while the > second took only a few minutes ... > > Do you guys know of any faster (but also > nice) way to dig around in very large > tsv files like > https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz > and > https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz > ? > > Best, > Rasmus > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RSQLite slowness
Dear r-help readers, why is it so much slower to query an sqlite database using RSQlite «from the outside» using param like statement <- "SELECT * FROM gene2refseq LEFT JOIN gene_info ON gene_info.GeneID = gene2refseq.GeneID WHERE gene2refseq.`RNA_nucleotide_accession.version` LIKE ?" db <- "gene_info.sqlite" conn <- DBI::dbConnect(RSQLite::SQLite(), db) x1 <- DBI::dbGetQuery(conn=conn, statement=statement, param=list(Håkan20210914$RNANucleotideAccession)) compared to querying «from the inside» of sqlite, by writing your search terms as a table first, and then calling it statement <- "SELECT * FROM H LEFT JOIN gene2refseq R ON R.`RNA_nucleotide_accession.version` LIKE '%' || H.RNANucleotideAccession || '%' LEFT JOIN gene_info I ON I.GeneID = R.GeneID" DBI::dbWriteTable(conn, "H", Håkan20210914) x2 <- DBI::dbGetQuery(conn=conn, statement=statement) DBI::dbDisconnect(conn) On my system (E5-2603 v4), the first query took more than an hour, while the second took only a few minutes ... Do you guys know of any faster (but also nice) way to dig around in very large tsv files like https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz and https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz ? Best, Rasmus signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.