I have a folder with about 250 small-size RDF/XML files. It seems to make a 
huge difference whether I load all files with a single call to tdbloader like 
this "tdbloader --graph=... --loc=./db files/*" versus calling tdbloader on 
each single file.

This is my database folder in the first case (calling files/*)

├── [8.0M]  GOSP.dat
├── [8.0M]  GOSP.idn
├── [8.0M]  GPOS.dat
├── [8.0M]  GPOS.idn
├── [8.0M]  GSPO.dat
├── [8.0M]  GSPO.idn
├── [   0]  journal.jrnl
├── [8.0M]  node2id.dat
├── [8.0M]  node2id.idn
├── [384K]  nodes.dat
├── [8.0M]  OSP.dat
├── [8.0M]  OSPG.dat
├── [8.0M]  OSPG.idn
├── [8.0M]  OSP.idn
├── [8.0M]  POS.dat
├── [8.0M]  POSG.dat
├── [8.0M]  POSG.idn
├── [8.0M]  POS.idn
├── [8.0M]  prefix2id.dat
├── [8.0M]  prefix2id.idn
├── [ 576]  prefixes.dat
├── [8.0M]  prefixIdx.dat
├── [8.0M]  prefixIdx.idn
├── [8.0M]  SPO.dat
├── [8.0M]  SPOG.dat
├── [8.0M]  SPOG.idn
├── [8.0M]  SPO.idn
└── [4.3K]  stats.opt

$ du -chs .
5.3M    total

$ du -chs --apparent-size . 
193M    total

and this instead is the second case (calling tdbloader on each file in a loop)

├── [648M]  GOSP.dat
├── [8.0M]  GOSP.idn
├── [496M]  GPOS.dat
├── [8.0M]  GPOS.idn
├── [784M]  GSPO.dat
├── [8.0M]  GSPO.idn
├── [   0]  journal.jrnl
├── [216M]  node2id.dat
├── [8.0M]  node2id.idn
├── [385K]  nodes.dat
├── [8.0M]  OSP.dat
├── [648M]  OSPG.dat
├── [8.0M]  OSPG.idn
├── [8.0M]  OSP.idn
├── [8.0M]  POS.dat
├── [496M]  POSG.dat
├── [8.0M]  POSG.idn
├── [8.0M]  POS.idn
├── [8.0M]  prefix2id.dat
├── [8.0M]  prefix2id.idn
├── [ 576]  prefixes.dat
├── [8.0M]  prefixIdx.dat
├── [8.0M]  prefixIdx.idn
├── [8.0M]  SPO.dat
├── [784M]  SPOG.dat
├── [8.0M]  SPOG.idn
├── [8.0M]  SPO.idn
└── [1.3K]  stats.opt

$ du -chs .                
5.4M    total

$ du -chs --apparent-size .
4.2G    total

the difference is pretty remarkable, 200MB vs 4.2GB. What's happening here? I 
didn't expect such a difference, I thought the output would be the same. Is 
tdbloader creating some *very* sparse index each time I call it?

Reply via email to