Re: [systemd-devel] Fragile journal interleaving

Uoti Urpala Wed, 13 Dec 2017 18:02:56 -0800

On Wed, 2017-12-13 at 18:40 +0100, Lennart Poettering wrote:
> On Mi, 13.12.17 00:43, Uoti Urpala (uoti.urp...@pp1.inet.fi) wrote:
> > correct position, so I don't see why the monotonicity discard would be
> > needed for that case either?
> 
> I figure you are right.
> 
> Any chance you can prep a patch for this change?


A patch with a fairly long commit message explaining the rationale and
related issues attached.

BTW current sd_journal_next() documentation claims "The journal is
strictly ordered by reception time, and hence advancing to the next
entry guarantees that the entry then pointing to is later in time than
then previous one, or has the same timestamp." Isn't this OBVIOUSLY
false when talking about realtime timestamps if you don't include
caveats about changing the machine time?

From 7070d7a3158cb0e0d14ba6167f7e0548a0022f55 Mon Sep 17 00:00:00 2001
From: Uoti Urpala <uau@glyph.nonexistent.invalid>
Date: Thu, 14 Dec 2017 01:41:36 +0200
Subject: [PATCH] journal: fix lost entries when timestamps are inconsistent

When journalctl or the sd_journal_* API interleaved entries from
multiple journal files into one sequence, inconsistent ordering
information in the files could lead to entries being silently
discarded. Fix this so that all entries are shown at least in some
order, even if no guarantees are made about the order.

The code determining the interleaved order of entries is based on
pairwise comparisons, where a pair of entries is compared using the
first applicable of three progressively less reliable ways to compare:
first by sequence number if seqnum id matches, then by monotonic
timestamp if boot id matches, then by realtime timestamp (entries
within one journal file are always implicitly ordered by their order
in the file, which should match seqnum order unless the file is
corrupt). Unfortunately, while this method could be said to give the
"right" answer more often than just comparing by the always-available
realtime timestamp, it does not give a globally consistent order - if
the ways to compare do not agree, it's possible to have three entries
A, B and C such that A<B<C<A. For example as follows:

A: seqnum_id=0, seqnum=1, time=0
B: seqnum_id=1, seqnum=0, time=1
C: seqnum_id=0, seqnum=0, time=2

A<B by time, B<C by time, C<A by seqnum.

The interleaving code goes through all files and checks whether the
next entry from that file is earlier than the current candidate entry,
and so finds the earliest available one. Given entries A, B, C like
above, it could pick any of them depending on the order in which the
files happen to be iterated over. In this situation, entries could be
lost because of a check in next_beyond_location() that discarded
entries which compare <= the previously chosen entry. But whichever of
the three entries we choose, there's at least one that still compares
smaller and so would be silently discarded. The rationale for this
check given in comments is to discard duplicate entries that somehow
ended up in multiple journal files. But checking for full equality
should be enough for this purpose. Early code seems to have checked
for equality, then it was changed in the restructuring in commit
de190aef08bb267 ("journal: implement multiple field matches") for some
reason.

Change the code to only discard an entry if it's verified to be fully
equal to one in another file. There doesn't seem to be any obvious
reason why the monotonicity check would be needed (not anything
remembered by its author either). This should guarantee that entries
are at least not silently discarded.

However, some ordering-related issues remain that should probably be
at least documented if not fixed:
- When ordering information is inconsistent, the order of entries is
arbitrary, and can change depending on factors should as directory list
order used by the filesystem containing the journal files, what match
terms if any were given, and iterating direction.
- As a consequence, sd_journal_previous() is not the inverse of
sd_journal_next(), and can take you to different entries than where
you came from using the other one.

If the inconsistency is limited, for example realtime clock goes back
a couple of minutes at boot, then this should normally only affect
ordering within the overlapping period. However, if you have at least
one journal file where the clock is seriously wrong while the other
information indicates another order (or other causes for a major
discrepancy, such as outright corrupt files) then very little is
guaranteed about the resulting order.

Guaranteeing a fully stable order (that wouldn't change depending on
match terms etc) would likely be hard without a global view of all
entries, but for example guaranteeing that all entries that share a
seqnum id are ordered by their seqnum with respect to each other
should be doable.
---
 src/journal/sd-journal.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/src/journal/sd-journal.c b/src/journal/sd-journal.c
index 8a1b161d8..5dedd9ea2 100644
--- a/src/journal/sd-journal.c
+++ b/src/journal/sd-journal.c
@@ -798,13 +798,10 @@ static int next_beyond_location(sd_journal *j, JournalFile *f, direction_t direc
         for (;;) {
                 bool found;

-                if (j->current_location.type == LOCATION_DISCRETE) {
-                        int k;
-
-                        k = compare_with_location(f, &j->current_location);
-
-                        found = direction == DIRECTION_DOWN ? k > 0 : k < 0;
-                } else
+                if (j->current_location.type == LOCATION_DISCRETE)
+                        // only == 0 or not matters
+                        found = compare_with_location(f, &j->current_location);
+                else
                         found = true;

                 if (found)
-- 
2.15.1

_______________________________________________
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel

Re: [systemd-devel] Fragile journal interleaving

Reply via email to