Thanks Sebastian, I was also curious if someone ran perf tests for this.
Good to know! I can see how it would speed up bulk importing if (1) the
importer assumes the ids of existing nodes will not change during execution
and (2) keeps a synced mapping of datatype and predicate ids in memory and
then (3) inserts nodes/triples that reference the ids of nodes it inserted
in the previous query since this would eliminate the overhead of the db
returning the auto-incremented ids from insertion.
I haven't read deep enough into the importer code to see if this is a
strategy the importer is already using? Unfortunately (on 584 at least,
where geometry is another ntype) the importer really struggles beyond a few
million nodes, the bottleneck seeming to be the process of checking if a
node already exists. I haven't run perf tests yet, but I've added this
unique index to my (postgres) nodes table for now so that I can 'DO
NOTHING' on insert conflict while I am trying out a multi-threaded importer
CREATE UNIQUE INDEX idx_node_essence ON nodes(ntype, svalue, ltype, lang);
On Tue, Feb 13, 2018 at 10:56 PM, Sebastian Schaffert <
> Hi Blake,
> I did performance tests back then, it actually makes a significant
> difference on most databases, especially for batch imports. Even more if
> the database is not running on localhost. Not sure about the actual numbers
> though. You can always switch to the database sequence generator for IDs if
> you want to try it out yourself, I think it's still available and it's a
> simple configuration option.
> Blake Regalia <blake.rega...@gmail.com> schrieb am Mi., 14. Feb. 2018,
>> I can see how this makes sense for future compatibility with distributed
>> systems across a variety of RDBMS, although I'm not convinced it's more
>> efficient for single nodes (e.g., auto-incrementing fields do not require
>> round trips). Thanks for the reply! Just wanted to know while porting a
>> bulk importer for 584.
>> - Blake
>> On Tue, Feb 13, 2018 at 12:15 PM, Sebastian Schaffert <
>> sebastian.schaff...@gmail.com> wrote:
>>> Hi Blake,
>>> Auto-increment requires querying the database for the next sequence
>>> number (or the last given ID, depending on the database you use), and
>>> that's adding another database roundtrip. Snowflake is purely in code, very
>>> fast to compute, and safe even in distributed setups.
>>> Is it causing problems?
>>> Blake Regalia <blake.rega...@gmail.com> schrieb am Di., 13. Feb. 2018,
>>>> What was the justification for using the 'snowflake' bigint type for
>>>> the id fields on nodes, triples and namespaces?
>>>> - Blake