On 25/09/13 14:18, Zhiyun Qian wrote:
Thanks very much for your reply!
On Wed, Sep 25, 2013 at 6:01 AM, Andy Seaborne <[email protected]> wrote:
On 24/09/13 19:36, Zhiyun Qian wrote:
Hi all,
Currently when I want to update an existing TDB, I simply open it using
"memory-mapped file" mode (I'm using 64-bit) and then call
"model.createResource()" repeatedly which will get reflected onto the TDB
as the program runs.
The bulk loaders are faster. They work directly on index files and order
things efficiently for bulk loads. They onyl work on an empty dataset,
other wise the java-based one falls back to incremental update.
Thanks for the info. Unfortunately I am working with a case where the data
is dynamically generated upon parsing some raw logs. I don't have a file
that is already in bulkloader-parsable format.
I'm quite curious about the details behind the scenes.
1. According to my understanding: when I open the existing TDB, it does
not
load any data from disk just yet. it only loads on-demand whenever an
existing node needs to be referenced (for instance, let's say the existing
TDB has the triple "A p B" and I'm trying to add "A p C". This requires A
to be loaded in memory first). In this case, if I'm not referencing any
existing nodes, there's no need to load anything from the existing TDB at
all.
It has to have A not "A p B" - there is a separate node table - and you
are accessing an existing node (2 in fact) on "A p C". The node table has
a big cache in front of it.
2. Even though the TDB is loaded in "memory-mapped file", does the program
really have to periodically write to disk (assuming there's still enough
physical memory)? Can somehow the program write only when it runs out of
physical memory? Additionally, after writing the disk, can the
corresponding data in memory be freed (or maybe keep a cache of much
smaller set)?
The file is written to disk in parts and it's under OS control, not the
program. Memory mapped files are like swap. The OS manages what is
in-memory and what is not.
A memory mapped file appears as a very large virtual memory area, and it
accessed as a very lareg area of bytes (ByteBuffer). The OS controls what
is really in-memory and what is left on disk. Writing to a mmap file does
not cause the OS to write it out immediately. The OS writes dirty pages
only when it wants to free up real memory for some other use, like another
part of the file that is now accessed.
This makes sense. What about using the "direct" mode? I tried to do the
same thing (update an existing TDB) with "direct" mode but it appears to be
much much slower. I thought it would be faster because I don't have to
incrementally update the TDB on disk (which is what "memory-mapped" mode is
doing). Instead, all of the updates are done in memory first (e.g., via
model.createResource()), and finally I can do a TDB.sync(m) to batch write
to disk. However, I'm observing that even updating in memory seems
significantly slower. I wonder what's going on here.
Direct mode is running a "traditional" disk cache with an in-JVM LRU
cache of blocks. Blocks get written when the LRU cache decided it needs
to evict a (dirty) block.
But the cache size is tuned for small JVMs (32 bit ones - max 1.5Gb) and
that's going to affect performance. You can tweak the size if you modify
the code.
Andy
Andy
Any comments are welcome. Thanks!
-Zhiyun