Thanks very much for your reply!
On Wed, Sep 25, 2013 at 6:01 AM, Andy Seaborne <[email protected]> wrote: > On 24/09/13 19:36, Zhiyun Qian wrote: > >> Hi all, >> >> Currently when I want to update an existing TDB, I simply open it using >> "memory-mapped file" mode (I'm using 64-bit) and then call >> "model.createResource()" repeatedly which will get reflected onto the TDB >> as the program runs. >> > > The bulk loaders are faster. They work directly on index files and order > things efficiently for bulk loads. They onyl work on an empty dataset, > other wise the java-based one falls back to incremental update. > > Thanks for the info. Unfortunately I am working with a case where the data is dynamically generated upon parsing some raw logs. I don't have a file that is already in bulkloader-parsable format. > > I'm quite curious about the details behind the scenes. >> >> 1. According to my understanding: when I open the existing TDB, it does >> not >> load any data from disk just yet. it only loads on-demand whenever an >> existing node needs to be referenced (for instance, let's say the existing >> TDB has the triple "A p B" and I'm trying to add "A p C". This requires A >> to be loaded in memory first). In this case, if I'm not referencing any >> existing nodes, there's no need to load anything from the existing TDB at >> all. >> > > It has to have A not "A p B" - there is a separate node table - and you > are accessing an existing node (2 in fact) on "A p C". The node table has > a big cache in front of it. > > > 2. Even though the TDB is loaded in "memory-mapped file", does the program >> really have to periodically write to disk (assuming there's still enough >> physical memory)? Can somehow the program write only when it runs out of >> physical memory? Additionally, after writing the disk, can the >> corresponding data in memory be freed (or maybe keep a cache of much >> smaller set)? >> > > The file is written to disk in parts and it's under OS control, not the > program. Memory mapped files are like swap. The OS manages what is > in-memory and what is not. > > A memory mapped file appears as a very large virtual memory area, and it > accessed as a very lareg area of bytes (ByteBuffer). The OS controls what > is really in-memory and what is left on disk. Writing to a mmap file does > not cause the OS to write it out immediately. The OS writes dirty pages > only when it wants to free up real memory for some other use, like another > part of the file that is now accessed. This makes sense. What about using the "direct" mode? I tried to do the same thing (update an existing TDB) with "direct" mode but it appears to be much much slower. I thought it would be faster because I don't have to incrementally update the TDB on disk (which is what "memory-mapped" mode is doing). Instead, all of the updates are done in memory first (e.g., via model.createResource()), and finally I can do a TDB.sync(m) to batch write to disk. However, I'm observing that even updating in memory seems significantly slower. I wonder what's going on here. > > > Andy > > > >> Any comments are welcome. Thanks! >> -Zhiyun >> >> >
