https://bugzilla.wikimedia.org/show_bug.cgi?id=22624

           Summary: Corruption of archive text due to deletion in late
                    2004
           Product: MediaWiki
           Version: 1.4-cvs
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Normal
         Component: History/Diffs
        AssignedTo: wikibugs-l@lists.wikimedia.org
        ReportedBy: tstarl...@wikimedia.org


This is a bug I'm tracking down and fixing, I'm putting it here so I have a
place for notes and something to refer to.

CGZ compression was first committed in October 2004, r5940. In December 2004,
r6640, this bug was discovered and a temporary fix put in place. Apparently
nobody submitted it to Bugzilla at the time.

The issue was that the deletion UI was blind to the compression scheme, and was
causing CGZ blobs and pointers to be moved into the archive table. Undeletion
would move them back. Pointers to deleted rows cannot work and will give you an
error message, so the text of these pointers is unreadable. If the whole
article was undeleted, the CGZ blob would get a different old_id, which means
that the pointers still don't work. 

If the article was partially undeleted, then you could have pointers which
point to deleted rows.

However, undeleted CGZ rows would still give you their default text, which left
them open to subsequent irreversible corruption by recompressTracked.php, which
may have deleted some of these CGZ blobs, replacing them with a pointer to the
primary text only.

The subsequent fixes (r6640, r8983) only fixed the text corruption at the
source (i.e. deletion). Apparently no script was run to fix corrupted archive
rows or undeleted text rows.

Some archive rows even have pointers to external storage, apparently moved in
from old/text via the same bug.

The reason this is coming up now is that there are a fair few revisions which
are either accessible (CGZ default text), or inaccessible but recoverable (CGZ
pointers), which are now at risk of being lost permanently due to
recompressTracked.php. 

The basic plan of action is to compile a list of content hashes in affected CGZ
blobs, and to match them up with broken pointers by comparing those content
hashes.

I may be able to take this opportunity to normalise the entire archive table,
by converting archive rows to the MW 1.5+ format, with a non-null ar_text_id,
and blank ar_text and ar_flags. This will free up core database space and allow
the deleted text to be recompressed.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to