Hi,

I am creating a tdb dataset using the uniprot files available over their public 
ftp servers. The goal of this is to map between 2 different Ids. The TDB store 
has been created, but I'm experiencing some slow query speeds, and was 
wondering if there's anything I'm missing that could improve the performance.



Here is an example of the schema using the command line:

tdbdump --loc=`pwd`

<http://example.com/genedata/23> <http://example.com/idmapping> 
"http://www.ncbi.nlm.nih.gov/gene/2947737"; <proteinModel0> .
<http://example.com/genedata/24> <http://example.com/idmapping> 
"http://www.uniprot.org/uniprot/Q6GZV7"; <proteinModel0> .
<http://example.com/genedata/24> <http://example.com/idmapping> 
"http://www.ncbi.nlm.nih.gov/gene/2947738"; <proteinModel0> .
<http://example.com/genedata/25> <http://example.com/idmapping> 
"http://www.uniprot.org/uniprot/Q6GZV6"; <proteinModel0> .
<http://example.com/genedata/25> <http://example.com/idmapping> 
"http://www.ncbi.nlm.nih.gov/gene/2947739"; <proteinModel0> .
<http://example.com/genedata/26> <http://example.com/idmapping> 
"http://www.uniprot.org/uniprot/Q6GZV5"; <proteinModel0> .
<http://example.com/genedata/26> <http://example.com/idmapping> 
"http://www.ncbi.nlm.nih.gov/gene/2947740"; <proteinModel0> .
<http://example.com/genedata/27> <http://example.com/idmapping> 
"http://www.uniprot.org/uniprot/Q6GZV4"; <proteinModel0> .

Here is an example of the query being run with the explain plan.

`tdbquery --loc=`pwd` --time --explain 'PREFIX els: <http://example.com/> 
SELECT DISTINCT ?id FROM <urn:x-arq:UnionGraph> WHERE { ?node els:idmapping ?id 
. ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956"; }'

  { ?node  els:idmapping  ?id ;
           ?p             "http://www.ncbi.nlm.nih.gov/gene/1956";
  }
22:18:21 INFO  exec                 :: ALGEBRA
(distinct
  (project (?id)
    (bgp
      (triple ?node <http://example.com/idmapping> ?id)
      (triple ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956";)
    )))
22:18:21 INFO  exec                 :: TDB
(distinct
  (project (?id)
    (bgp
      (triple ?node <http://example.com/idmapping> ?id)
      (triple ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956";)
    )))
22:18:21 INFO  exec                 :: BGP
?node <http://example.com/idmapping> ?id
?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956";
22:18:21 INFO  exec                 :: Reorder/generic
?node <http://example.com/idmapping> ?id
?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956";

-------------------------------------------
| id                                      |
===========================================
| "http://www.uniprot.org/uniprot/P00533"; |
| "http://www.ncbi.nlm.nih.gov/gene/1956"; |
| "http://www.uniprot.org/uniprot/Q504U8"; |
| "http://www.uniprot.org/uniprot/E9PFD7"; |
| "http://www.uniprot.org/uniprot/F2YGG7"; |
| "http://www.uniprot.org/uniprot/E7BSV0"; |
| "http://www.uniprot.org/uniprot/B7Z2I3"; |
-------------------------------------------
Time: 283.782 sec

The size of the data sets are 106m for uniprot and 11m for the gene data, 
resulting the the following files size:

ls -al
total 16443212
drwxrwxr-x  2 ec2-user ec2-user       4096 Jun 14 17:59 .
drwx------ 17 ec2-user ec2-user       4096 Jun 14 17:57 ..
-rw-rw-r--  1 ec2-user ec2-user 2113929216 Jun 12 11:39 GOSP.dat
-rw-rw-r--  1 ec2-user ec2-user   33554432 Jun 12 11:37 GOSP.idn
-rw-rw-r--  1 ec2-user ec2-user 2113929216 Jun 12 11:39 GPOS.dat
-rw-rw-r--  1 ec2-user ec2-user   33554432 Jun 12 11:37 GPOS.idn
-rw-rw-r--  1 ec2-user ec2-user 2147483648 Jun 12 11:39 GSPO.dat
-rw-rw-r--  1 ec2-user ec2-user   25165824 Jun 12 11:37 GSPO.idn
-rw-rw-r--  1 ec2-user ec2-user          0 Jun 12 11:39 journal.jrnl
-rw-rw-r--  1 ec2-user ec2-user 1879048192 Jun 10 10:40 node2id.dat
-rw-rw-r--  1 ec2-user ec2-user   16777216 Jun 10 10:40 node2id.idn
-rw-rw-r--  1 ec2-user ec2-user 2340085153 Jun 10 10:40 nodes.dat
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 OSP.dat
-rw-rw-r--  1 ec2-user ec2-user 2013265920 Jun 12 11:39 OSPG.dat
-rw-rw-r--  1 ec2-user ec2-user   25165824 Jun 10 10:40 OSPG.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 OSP.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 POS.dat
-rw-rw-r--  1 ec2-user ec2-user 2030043136 Jun 12 11:39 POSG.dat
-rw-rw-r--  1 ec2-user ec2-user   25165824 Jun 12 11:37 POSG.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 POS.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 prefix2id.dat
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 prefix2id.idn
-rw-rw-r--  1 ec2-user ec2-user          0 Jun  5 21:02 prefixes.dat
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 prefixIdx.dat
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 prefixIdx.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 SPO.dat
-rw-rw-r--  1 ec2-user ec2-user 2139095040 Jun 12 11:39 SPOG.dat
-rw-rw-r--  1 ec2-user ec2-user   25165824 Jun 10 10:40 SPOG.idn
-rw-rw-r--  1 ec2-user ec2-user    8388608 Jun  5 21:02 SPO.idn
-rw-rw-r--  1 ec2-user ec2-user    2909764 Jun 12 22:28 stats.opt

This is running on a machine with an SSD, in AWS.

tdbdump --version
Jena:       VERSION: 3.7.0
Jena:       BUILD_DATE: 2018-04-05T11:04:59+0000
ARQ:        VERSION: 3.7.0
ARQ:        BUILD_DATE: 2018-04-05T11:04:59+0000
RIOT:       VERSION: 3.7.0
RIOT:       BUILD_DATE: 2018-04-05T11:04:59+0000
TDB:        VERSION: ${project.version}
TDB:        BUILD_DATE: ${build.time.xsd}

I'd appreciate any help that you could offer.
Thanks,
Adam

Reply via email to