On Wed, Apr 8, 2015 at 12:32 PM, Dan Kennedy <danielk1977 at gmail.com> wrote:
> On 04/08/2015 04:49 AM, Scott Hess wrote:
>> Something that bugged me a lot was that I had used deletion markers to
>> cancel out hits, but did not provide a way for deletion markers to
>> cancel out.  The main problem with this was that a large delete would
>> stay in the system until it reached the final segment, even if it had
>> already overtaken all of the original inserts.  I wished that I had
>> either maintained a separate structure tracking _document_ deletion
>> (which would make merges somewhat more complicated because they
>> wouldn't be term-centric), or code updates as "delete+insert".  In the
>> latter case deletes could drop out at the point where they reached the
>> original insert.
>
> Thanks for this. The "delete+insert" idea sounds like quite an interesting
> one.
>
> So instead of just "delete" and "insert" keys, the merge tree now also
> contains "delete+insert" keys (call them "update" keys). Then maintain the
> tree so that
>
>   (a) for each "insert", the next youngest duplicate key must either not
> exist or be a "delete",
>   (b) for each "update", the next youngest duplicate key must exist and must
> be an "insert" or "update", and
>   (c) for each "delete", the next youngest duplicate key must exist and must
> be an "insert" or "update".
>
> And as a result, when a "delete" catches up with an "insert" while merging
> they can both be discarded. Instead of the current situation, where we
> retain the "delete" unless the output segment is the oldest in the database.
> Cool.

Yes, I think that's it.  I have notes somewhere (no chance I'll find
them in this decade) noodling on how to handle it :-).  I think you
can model it as a stream of alternating (oldest to newest) INSERT,
DELETE, INSERT, DELETE, ...  When a DELETE catches an INSERT they
cancel, but when an INSERT catches a DELETE it "pushes" the DELETE
forward until it finds the original INSERT and cancels, leaving the
new INSERT alone.  Serializing the update as an actual DELETE/INSERT
pair made merges work naturally (the DELETE lines up with the earlier
INSERT, both are skipped, and the new INSERT is in the right place to
carry forward).  The space cost was two bytes per term updated, one
for the delete's POS_END varint(0), one for the insert's docid delta
varint(0).  That could be dropped to one byte by adding a POS_UPDATE
value in the encoding, with a bit more merge complexity.

The main alternative I could see would be to add additional docid
arrays per segment to track documents present and documents deleted,
which you could intersect to create an additional filter to apply
during segment merge.

-scott

Reply via email to