Re: most efficient way to batch update a TDB

Zhiyun Qian Wed, 25 Sep 2013 06:19:45 -0700

Thanks very much for your reply!


On Wed, Sep 25, 2013 at 6:01 AM, Andy Seaborne <[email protected]> wrote:

> On 24/09/13 19:36, Zhiyun Qian wrote:
>
>> Hi all,
>>
>> Currently when I want to update an existing TDB, I simply open it using
>> "memory-mapped file" mode (I'm using 64-bit) and then call
>> "model.createResource()" repeatedly which will get reflected onto the TDB
>> as the program runs.
>>
>
> The bulk loaders are faster.  They work directly on index files and order
> things efficiently for bulk loads.  They onyl work on an empty dataset,
> other wise the java-based one falls back to incremental update.
>
> Thanks for the info. Unfortunately I am working with a case where the data
is dynamically generated upon parsing some raw logs. I don't have a file
that is already in bulkloader-parsable format.

>
>  I'm quite curious about the details behind the scenes.
>>
>> 1. According to my understanding: when I open the existing TDB, it does
>> not
>> load any data from disk just yet. it only loads on-demand whenever an
>> existing node needs to be referenced (for instance, let's say the existing
>> TDB has the triple "A p B" and I'm trying to add "A p C". This requires A
>> to be loaded in memory first). In this case, if I'm not referencing any
>> existing nodes, there's no need to load anything from the existing TDB at
>> all.
>>
>
> It has to have A not "A p B" - there is a separate node table - and you
> are accessing an existing node (2 in fact) on "A p C".  The node table has
> a big cache in front of it.
>
>
>  2. Even though the TDB is loaded in "memory-mapped file", does the program
>> really have to periodically write to disk (assuming there's still enough
>> physical memory)? Can somehow the program write only when it runs out of
>> physical memory? Additionally, after writing the disk, can the
>> corresponding data in memory be freed (or maybe keep a cache of much
>> smaller set)?
>>
>
> The file is written to disk in parts and it's under OS control, not the
> program.  Memory mapped files are like swap.  The OS manages what is
> in-memory and what is not.
>
> A memory mapped file appears as a very large virtual memory area, and it
> accessed as a very lareg area of bytes (ByteBuffer).  The OS controls what
> is really in-memory and what is left on disk.  Writing to a mmap file does
> not cause the OS to write it out immediately.  The OS writes dirty pages
> only when it wants to free up real memory for some other use, like another
> part of the file that is now accessed.


This makes sense. What about using the "direct" mode? I tried to do the
same thing (update an existing TDB) with "direct" mode but it appears to be
much much slower. I thought it would be faster because I don't have to
incrementally update the TDB on disk (which is what "memory-mapped" mode is
doing). Instead, all of the updates are done in memory first (e.g., via
model.createResource()), and finally I can do a TDB.sync(m) to batch write
to disk. However, I'm observing that even updating in memory seems
significantly slower. I wonder what's going on here.


>
>
>         Andy
>
>
>
>> Any comments are welcome. Thanks!
>> -Zhiyun
>>
>>
>

Re: most efficient way to batch update a TDB

Reply via email to