Re: [R] RSQLite slowness

2021-10-07 Thread Rasmus Liland
Dear Martin Morgan,

Thanks for all those links!  Yes, my 
question can be characterized like that 
I think, traditional way writing 
a temporary table into the database and 
left JOINing the others vs. 
parameterized query. 

A relevant example would be to first 
create the database from the compressed 
tsv files:

for i in gene_info gene2refseq; do
  wget -c https://ftp.ncbi.nlm.nih.gov/gene/DATA/$i.gz
  gzip -d $i.gz
  sqlite3 gene_info.sqlite ".mode tabs" ".import $i $i"
  rm $i; done

then run this R code:

H <- data.frame(Group = c(1, 1, 2, 2),
  NM = c("NM_001267695", "NM_001007636",
 "NM_001003706", "NM_001353612"))
conn <- DBI::dbConnect(RSQLite::SQLite(), "gene_info.sqlite")
DBI::dbWriteTable(conn, "H", H, overwrite=T)

statement.1 <-
 "SELECT * FROM gene2refseq R
  LEFT JOIN gene_info I ON I.GeneID = R.GeneID
  WHERE R.`RNA_nucleotide_accession.version`
LIKE '%' || ? || '%'"
time.1 <- proc.time()
x1 <- DBI::dbGetQuery(
  conn=conn,
  statement=statement.1,
  param=list(H$NM))
time.1 <- proc.time() - time.1

statement.2 <-
 "SELECT * FROM H
  LEFT JOIN gene2refseq R ON
R.`RNA_nucleotide_accession.version`
LIKE '%' || H.NM || '%'
  LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
time.2 <- proc.time()
x2 <- DBI::dbGetQuery(
  conn=conn,
  statement=statement.2)
time.2 <- proc.time() - time.2

DBI::dbDisconnect(conn)

saveRDS(object=x1, file="ex1_x1.rds", compress="xz")
saveRDS(object=x2, file="ex1_x2.rds", compress="xz")
saveRDS(object=list("Time x1"=list(time.1),
"Time x2"=list(time.2)),
file="ex1_t.rds", compress="xz")

I got these timings in the ex1_t.rds 
file:

$`Time x1`
   user  system elapsed
571.731 182.006 772.199

$`Time x2`
   user  system elapsed
200.068  90.529 295.086

As you can see, statement.1 takes a lot 
longer to process compared to 
statement.2 ...  When I add the rest of 
the 31 search terms, the difference gets 
a lot bigger like I pointed out 
initially, beyond the full hour vs. only 
a few minutes. 

Best,
Rasmus


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-07 Thread Rasmus Liland
On 2021-10-06 12:11 -0700, Jeff Newmiller wrote:
> FWIW all SQL implementations work 
> better with indexes

An index seems to be a good way to 
improve sql performance, I'll look into 
it.

Best,
Rasmus


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-07 Thread Rasmus Liland
Dear Ivan,

Thanks for that explaination!  I think 
it explains the slowness clearly.

It is possible to use carray is in Rust 
[1] so it might be available in R in the 
future(?)  I'll look into rusqlite some 
time at least.

sqlite is supposed to be one of the 
fastest sql implementations.  Realm [2] 
might be faster, but I don't know how to 
use it in R so ...  

I thought someone might suggest Redis, 
but I think it is way slower.  

sqlite with indexes and the traditional 
way of doing that by writing in a 
temporary table might just be it ... 

Best,
Rasmus

[1] https://docs.rs/rusqlite/0.24.2/rusqlite/vtab/array/index.html
[2] https://en.wikipedia.org/wiki/Realm_(database)
[3] https://en.wikipedia.org/wiki/Redis


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-07 Thread Ivan Krylov
On Wed,  6 Oct 2021 16:23:15 +
Rasmus Liland  wrote:

>"SELECT * FROM gene2refseq
> LEFT JOIN gene_info ON
>   gene_info.GeneID = gene2refseq.GeneID
> WHERE gene2refseq.`RNA_nucleotide_accession.version` 
>   LIKE ?"

<...>

>   x1 <- DBI::dbGetQuery(conn=conn, 
> statement=statement, 
> param=list(Håkan20210914$RNANucleotideAccession))

I think that the problem here is that you pass a vector as a bound
parameter to LIKE, when parameter placeholders usually expect a scalar.
DBI transparently handles this:

>> The elements of the `params` argument do not need to be scalars,
>> vectors of arbitrary length (including length 0) are supported.  For
>> queries, calling dbFetch() binding such parameters returns
>> concatenated results, equivalent to binding and fetching for each
>> set of values and connecting via rbind().

I think this means that DBI runs a SELECT for each value in
Håkan20210914$RNANucleotideAccession, which is understandably slower
than a single query. Unfortunately, it's hard to pass vectors of values
to queries with bound parameters; the SQL engines I know don't have a
syntax for "WHERE param IN (:multi_placeholder:)". SQLite comes with
carray [1], but I don't know whether it's exposed by RSQLite (could be
hard to do in a pointer-safe way), and you're already aware of the
traditional way of doing that: create a temporary table, populate it
and JOIN with the rest of the query.

-- 
Best regards,
Ivan

[1] https://www.sqlite.org/carray.html

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-06 Thread Jeff Newmiller
Since the sqlite package is contributed, it is NOT related to "core R", and is 
in fact technically off-topic on this list.

FWIW all SQL implementations work better with indexes, but AFAIK the R data 
frame support does nothing with indexes. This may be related to your question, 
or not. I am not a regular sqlite user.

As for fast reading of tsv files, I think arrow, readr, and data.table packages 
all offer high-performance import functions that could be relevant.

On October 6, 2021 11:49:55 AM PDT, Rasmus Liland  wrote:
>Thank you Bert, I set up a new thread on 
>BioStars [1].  So far, I'm a bit 
>unfamilliar with Bioconductor (but will 
>hopefully attend a course about it in 
>November, which I'm kinda hyped about), 
>other than installing and updating R 
>packages using BiocManager   Did you 
>think of something else than 
>BioStars.org when saying «Bioconductor?»
>
>The question could be viewed as gene 
>related, but I think it is really about 
>how can one easier than with sqlite 
>handle large tsv files, and why is that 
>parser thing so slow ...  I think this 
>is more like a core R thing than gene 
>related question ...  
>
>[1] https://www.biostars.org/p/9492486/

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-06 Thread Martin Morgan
https://support.bioconductor.org and the community slack (sign up at 
https://bioc-community.herokuapp.com/ ) as well as the general site 
https://bioconductor.org . Actually your question sounds like a SQLite question 
� JOIN a table, versus parameterized query. One could perhaps construct the 
relevant example at the sqlite command line?

Martin Morgan

On 10/6/21, 2:50 PM, "R-help"  wrote:
Thank you Bert, I set up a new thread on
BioStars [1].  So far, I'm a bit
unfamilliar with Bioconductor (but will
hopefully attend a course about it in
November, which I'm kinda hyped about),
other than installing and updating R
packages using BiocManager   Did you
think of something else than
BioStars.org when saying �Bioconductor?�

The question could be viewed as gene
related, but I think it is really about
how can one easier than with sqlite
handle large tsv files, and why is that
parser thing so slow ...  I think this
is more like a core R thing than gene
related question ...

[1] https://www.biostars.org/p/9492486/



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-06 Thread Rasmus Liland
Thank you Bert, I set up a new thread on 
BioStars [1].  So far, I'm a bit 
unfamilliar with Bioconductor (but will 
hopefully attend a course about it in 
November, which I'm kinda hyped about), 
other than installing and updating R 
packages using BiocManager   Did you 
think of something else than 
BioStars.org when saying «Bioconductor?»

The question could be viewed as gene 
related, but I think it is really about 
how can one easier than with sqlite 
handle large tsv files, and why is that 
parser thing so slow ...  I think this 
is more like a core R thing than gene 
related question ...  

[1] https://www.biostars.org/p/9492486/


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RSQLite slowness

2021-10-06 Thread Bert Gunter
Generally, such gene-related questions are better asked on Bioconductor
than here. They also might know of more efficient, purpose built tools for
your efforts there. No guarantees, of course, and you might get a helpful
response here. But if not ...

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Oct 6, 2021 at 9:23 AM Rasmus Liland  wrote:

> Dear r-help readers,
>
> why is it so much slower to query an
> sqlite database using RSQlite «from the
> outside» using param like
>
> statement <-
>  "SELECT * FROM gene2refseq
>   LEFT JOIN gene_info ON
> gene_info.GeneID = gene2refseq.GeneID
>   WHERE gene2refseq.`RNA_nucleotide_accession.version`
> LIKE ?"
> db <- "gene_info.sqlite"
> conn <- DBI::dbConnect(RSQLite::SQLite(), db)
> x1 <- DBI::dbGetQuery(conn=conn,
>   statement=statement,
>   param=list(Håkan20210914$RNANucleotideAccession))
>
> compared to querying «from the inside»
> of sqlite, by writing your search terms
> as a table first, and then calling it
>
> statement <-
>  "SELECT * FROM H
>   LEFT JOIN gene2refseq R ON
> R.`RNA_nucleotide_accession.version`
> LIKE '%' || H.RNANucleotideAccession || '%'
>   LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
> DBI::dbWriteTable(conn, "H", Håkan20210914)
> x2 <- DBI::dbGetQuery(conn=conn, statement=statement)
> DBI::dbDisconnect(conn)
>
> On my system (E5-2603 v4), the first
> query took more than an hour, while the
> second took only a few minutes ...
>
> Do you guys know of any faster (but also
> nice) way to dig around in very large
> tsv files like
> https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
> and
> https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
> ?
>
> Best,
> Rasmus
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] RSQLite slowness

2021-10-06 Thread Rasmus Liland
Dear r-help readers,

why is it so much slower to query an 
sqlite database using RSQlite «from the 
outside» using param like 

statement <-
 "SELECT * FROM gene2refseq
  LEFT JOIN gene_info ON
gene_info.GeneID = gene2refseq.GeneID
  WHERE gene2refseq.`RNA_nucleotide_accession.version` 
LIKE ?"
db <- "gene_info.sqlite"
conn <- DBI::dbConnect(RSQLite::SQLite(), db)
x1 <- DBI::dbGetQuery(conn=conn, 
  statement=statement, 
  param=list(Håkan20210914$RNANucleotideAccession))

compared to querying «from the inside» 
of sqlite, by writing your search terms 
as a table first, and then calling it 

statement <- 
 "SELECT * FROM H
  LEFT JOIN gene2refseq R ON
R.`RNA_nucleotide_accession.version` 
LIKE '%' || H.RNANucleotideAccession || '%'
  LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
DBI::dbWriteTable(conn, "H", Håkan20210914)
x2 <- DBI::dbGetQuery(conn=conn, statement=statement)
DBI::dbDisconnect(conn)

On my system (E5-2603 v4), the first 
query took more than an hour, while the 
second took only a few minutes ...  

Do you guys know of any faster (but also 
nice) way to dig around in very large 
tsv files like 
https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz 
and 
https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz 
?

Best,
Rasmus


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.