I have a folder with about 250 small-size RDF/XML files. It seems to make a huge difference whether I load all files with a single call to tdbloader like this "tdbloader --graph=... --loc=./db files/*" versus calling tdbloader on each single file.
This is my database folder in the first case (calling files/*) ├── [8.0M] GOSP.dat ├── [8.0M] GOSP.idn ├── [8.0M] GPOS.dat ├── [8.0M] GPOS.idn ├── [8.0M] GSPO.dat ├── [8.0M] GSPO.idn ├── [ 0] journal.jrnl ├── [8.0M] node2id.dat ├── [8.0M] node2id.idn ├── [384K] nodes.dat ├── [8.0M] OSP.dat ├── [8.0M] OSPG.dat ├── [8.0M] OSPG.idn ├── [8.0M] OSP.idn ├── [8.0M] POS.dat ├── [8.0M] POSG.dat ├── [8.0M] POSG.idn ├── [8.0M] POS.idn ├── [8.0M] prefix2id.dat ├── [8.0M] prefix2id.idn ├── [ 576] prefixes.dat ├── [8.0M] prefixIdx.dat ├── [8.0M] prefixIdx.idn ├── [8.0M] SPO.dat ├── [8.0M] SPOG.dat ├── [8.0M] SPOG.idn ├── [8.0M] SPO.idn └── [4.3K] stats.opt $ du -chs . 5.3M total $ du -chs --apparent-size . 193M total and this instead is the second case (calling tdbloader on each file in a loop) ├── [648M] GOSP.dat ├── [8.0M] GOSP.idn ├── [496M] GPOS.dat ├── [8.0M] GPOS.idn ├── [784M] GSPO.dat ├── [8.0M] GSPO.idn ├── [ 0] journal.jrnl ├── [216M] node2id.dat ├── [8.0M] node2id.idn ├── [385K] nodes.dat ├── [8.0M] OSP.dat ├── [648M] OSPG.dat ├── [8.0M] OSPG.idn ├── [8.0M] OSP.idn ├── [8.0M] POS.dat ├── [496M] POSG.dat ├── [8.0M] POSG.idn ├── [8.0M] POS.idn ├── [8.0M] prefix2id.dat ├── [8.0M] prefix2id.idn ├── [ 576] prefixes.dat ├── [8.0M] prefixIdx.dat ├── [8.0M] prefixIdx.idn ├── [8.0M] SPO.dat ├── [784M] SPOG.dat ├── [8.0M] SPOG.idn ├── [8.0M] SPO.idn └── [1.3K] stats.opt $ du -chs . 5.4M total $ du -chs --apparent-size . 4.2G total the difference is pretty remarkable, 200MB vs 4.2GB. What's happening here? I didn't expect such a difference, I thought the output would be the same. Is tdbloader creating some *very* sparse index each time I call it?
