Hi, I am creating a tdb dataset using the uniprot files available over their public ftp servers. The goal of this is to map between 2 different Ids. The TDB store has been created, but I'm experiencing some slow query speeds, and was wondering if there's anything I'm missing that could improve the performance.
Here is an example of the schema using the command line: tdbdump --loc=`pwd` <http://example.com/genedata/23> <http://example.com/idmapping> "http://www.ncbi.nlm.nih.gov/gene/2947737" <proteinModel0> . <http://example.com/genedata/24> <http://example.com/idmapping> "http://www.uniprot.org/uniprot/Q6GZV7" <proteinModel0> . <http://example.com/genedata/24> <http://example.com/idmapping> "http://www.ncbi.nlm.nih.gov/gene/2947738" <proteinModel0> . <http://example.com/genedata/25> <http://example.com/idmapping> "http://www.uniprot.org/uniprot/Q6GZV6" <proteinModel0> . <http://example.com/genedata/25> <http://example.com/idmapping> "http://www.ncbi.nlm.nih.gov/gene/2947739" <proteinModel0> . <http://example.com/genedata/26> <http://example.com/idmapping> "http://www.uniprot.org/uniprot/Q6GZV5" <proteinModel0> . <http://example.com/genedata/26> <http://example.com/idmapping> "http://www.ncbi.nlm.nih.gov/gene/2947740" <proteinModel0> . <http://example.com/genedata/27> <http://example.com/idmapping> "http://www.uniprot.org/uniprot/Q6GZV4" <proteinModel0> . Here is an example of the query being run with the explain plan. `tdbquery --loc=`pwd` --time --explain 'PREFIX els: <http://example.com/> SELECT DISTINCT ?id FROM <urn:x-arq:UnionGraph> WHERE { ?node els:idmapping ?id . ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956" }' { ?node els:idmapping ?id ; ?p "http://www.ncbi.nlm.nih.gov/gene/1956" } 22:18:21 INFO exec :: ALGEBRA (distinct (project (?id) (bgp (triple ?node <http://example.com/idmapping> ?id) (triple ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956") ))) 22:18:21 INFO exec :: TDB (distinct (project (?id) (bgp (triple ?node <http://example.com/idmapping> ?id) (triple ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956") ))) 22:18:21 INFO exec :: BGP ?node <http://example.com/idmapping> ?id ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956" 22:18:21 INFO exec :: Reorder/generic ?node <http://example.com/idmapping> ?id ?node ?p "http://www.ncbi.nlm.nih.gov/gene/1956" ------------------------------------------- | id | =========================================== | "http://www.uniprot.org/uniprot/P00533" | | "http://www.ncbi.nlm.nih.gov/gene/1956" | | "http://www.uniprot.org/uniprot/Q504U8" | | "http://www.uniprot.org/uniprot/E9PFD7" | | "http://www.uniprot.org/uniprot/F2YGG7" | | "http://www.uniprot.org/uniprot/E7BSV0" | | "http://www.uniprot.org/uniprot/B7Z2I3" | ------------------------------------------- Time: 283.782 sec The size of the data sets are 106m for uniprot and 11m for the gene data, resulting the the following files size: ls -al total 16443212 drwxrwxr-x 2 ec2-user ec2-user 4096 Jun 14 17:59 . drwx------ 17 ec2-user ec2-user 4096 Jun 14 17:57 .. -rw-rw-r-- 1 ec2-user ec2-user 2113929216 Jun 12 11:39 GOSP.dat -rw-rw-r-- 1 ec2-user ec2-user 33554432 Jun 12 11:37 GOSP.idn -rw-rw-r-- 1 ec2-user ec2-user 2113929216 Jun 12 11:39 GPOS.dat -rw-rw-r-- 1 ec2-user ec2-user 33554432 Jun 12 11:37 GPOS.idn -rw-rw-r-- 1 ec2-user ec2-user 2147483648 Jun 12 11:39 GSPO.dat -rw-rw-r-- 1 ec2-user ec2-user 25165824 Jun 12 11:37 GSPO.idn -rw-rw-r-- 1 ec2-user ec2-user 0 Jun 12 11:39 journal.jrnl -rw-rw-r-- 1 ec2-user ec2-user 1879048192 Jun 10 10:40 node2id.dat -rw-rw-r-- 1 ec2-user ec2-user 16777216 Jun 10 10:40 node2id.idn -rw-rw-r-- 1 ec2-user ec2-user 2340085153 Jun 10 10:40 nodes.dat -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 OSP.dat -rw-rw-r-- 1 ec2-user ec2-user 2013265920 Jun 12 11:39 OSPG.dat -rw-rw-r-- 1 ec2-user ec2-user 25165824 Jun 10 10:40 OSPG.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 OSP.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 POS.dat -rw-rw-r-- 1 ec2-user ec2-user 2030043136 Jun 12 11:39 POSG.dat -rw-rw-r-- 1 ec2-user ec2-user 25165824 Jun 12 11:37 POSG.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 POS.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 prefix2id.dat -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 prefix2id.idn -rw-rw-r-- 1 ec2-user ec2-user 0 Jun 5 21:02 prefixes.dat -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 prefixIdx.dat -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 prefixIdx.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 SPO.dat -rw-rw-r-- 1 ec2-user ec2-user 2139095040 Jun 12 11:39 SPOG.dat -rw-rw-r-- 1 ec2-user ec2-user 25165824 Jun 10 10:40 SPOG.idn -rw-rw-r-- 1 ec2-user ec2-user 8388608 Jun 5 21:02 SPO.idn -rw-rw-r-- 1 ec2-user ec2-user 2909764 Jun 12 22:28 stats.opt This is running on a machine with an SSD, in AWS. tdbdump --version Jena: VERSION: 3.7.0 Jena: BUILD_DATE: 2018-04-05T11:04:59+0000 ARQ: VERSION: 3.7.0 ARQ: BUILD_DATE: 2018-04-05T11:04:59+0000 RIOT: VERSION: 3.7.0 RIOT: BUILD_DATE: 2018-04-05T11:04:59+0000 TDB: VERSION: ${project.version} TDB: BUILD_DATE: ${build.time.xsd} I'd appreciate any help that you could offer. Thanks, Adam