Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Subramanya Sastry


TL:DR; You get to a spec by paying down technical debt that untangles 
wikitext parsing from being intricately tied to the internals of 
mediawiki implementation and state.


In discussions, there is far too much focus on the fact that you cannot 
write a BNF grammar or yacc / lex / bison / whatever or that quote 
parsing is context-sensitive. I don't think it is as much of a big deal. 
For example, you could use Markdown for parsing but that doesn't change 
much of the picture outlined below ... I think all of that is less of an 
issue compared to the following:


Right now, mediawiki HTML output depends on the following:
* input wikitext
* wiki config (including installed extensions)
* installed templates
* media resources (images, audio, video)
* PHP parser hooks that expose parsing internals and implementation 
details (not replicable in other parsers)

* wiki messages (ex: cite output)
* state of the corpus and other db state (ex: red links, bad images)
* user state (prefs, etc.)
* Tidy

So, one reason for the complexity in implementing a wikitext parser is 
because the output HTML is not simply a straightforward transformation 
of input wikitext (and some config). There is far too much other state 
that gets in the way.


The second reason for complexity is because markup errors aren't bounded 
to narrow contexts, but, can leak out and impact output of the entire 
page. Some user pages seem to exploit this as a feature even (unclosed 
div tags).


The third source of complexity is because some parser hooks expose 
internals of the implementation (Before/After Strip/Tidy and other such 
hooks). An implementation without tidy or that handles wikitext 
different might not have the same pipeline.


However, we can still get to a spec that is much more replicable if we 
start cleaning up some of this incrementally and paying down technical 
debt. Here are some things going on right now towards that.


* We are close to getting rid of Tidy which removes it from the equation.
* There are RFCs that propose defining DOM scopes and propose that 
output of templates (and extensions) be a DOM (vs a string), with some 
caveats (that I will ignore for here). If we can get to implementing 
these, we immediately isolate the parsing of a top-level page from the 
details of how extensions and transclusions are processed.
* RFCs that propose that things like red links, bad images, user state, 
site messages not be an input into the core wikitext parse. From a 
spec-point of view, they should be viewed as post-processing 
transformations. However, for efficiency reasons, an implementation 
might choose to integrate that as part of the parse, but that is not a 
requirement.


Separately, here is one other thing we can consider:
* Deprecate and replace tag hooks that expose parser internals.

When all of these are done, it become far more feasible to think of 
defining a spec for wikitext parsing that is not tied to the internals 
of mediawiki or its extensions. At that point, you could implement 
templating via Lua or via JS or via Ruby ... the specifics are 
immaterial. What matters is those templating implementations and 
extensions produce output with certain properties. You can then specify 
that mediawiki-HTML is a series of transformations that are applied to 
the output of the wikitext parser ... and where there can be multiple 
spec-compliant implementations of that parser.


I think it is feasible to get there. But, whether we want a spec for 
wikitext and should work towards that is a different question.


Subbu.

On 08/01/2016 08:34 PM, Gergo Tisza wrote:

On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier  wrote:


Do you believe that declaring "the implementation is the spec" is a
sustainable way of encouraging contribution to our projects?


Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.

There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread John Mark Vandenberg
On Tue, Aug 2, 2016 at 8:34 AM, Gergo Tisza  wrote:
> On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier  wrote:
>
>> Do you believe that declaring "the implementation is the spec" is a
>> sustainable way of encouraging contribution to our projects?
>
>
> Reimplementing Wikipedia's parser (complete with template inclusions,
> Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
> practically impossible. What we do or do not declare won't change that.

Correct, re-implementing the MediaWiki parser is a mission from hell.
And yet, WMF is doing that with parsoid ... ;-)
And, WMF will no doubt do it again in the future.
Changing infrastructure is normal for systems that last many generations.

But the real problem of not using a versioned spec is that nobody can
reliably do anything, at all, with the content.

Even basic tokenizing of wikitext has many undocumented gotchas, and
even with the correct voodoo today there is no guarantee that WMF
engineers wont break it tomorrow, and not inform everyone that the
spec has changed.

> There are many other, more realistic ways to encourage contribution by
> users who are interested in wikis, but not in Wikimedia projects.
> (Supporting Markdown would certainly be one of them.) But historically the
> WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
> other actor has been both willing and able to step up in its place.

The main reason for a spec should be the sanity of the Wikimedia
technical user base, including WMF engineers paid by donors, who build
parsers in other languages for various reasons, including supporting
tools that account for a very large percent of the total edits to
Wikimedia and are critical in preventing abuse and assisting admins
performing critical tasks to keep the sites from falling apart.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Gergo Tisza
On Mon, Aug 1, 2016 at 5:27 PM, Rob Lanphier  wrote:

> Do you believe that declaring "the implementation is the spec" is a
> sustainable way of encouraging contribution to our projects?


Reimplementing Wikipedia's parser (complete with template inclusions,
Wikidata fetches, Lua scripts, LaTeX snippets and whatever else) is
practically impossible. What we do or do not declare won't change that.

There are many other, more realistic ways to encourage contribution by
users who are interested in wikis, but not in Wikimedia projects.
(Supporting Markdown would certainly be one of them.) But historically the
WMF has shown zero interest in the wiki-but-not-Wikimedia userbase, and no
other actor has been both willing and able to step up in its place.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread John Mark Vandenberg
There is a slow moving discussion about this at
https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Markdown

The bigger risk is that the rest of the world settles on using
CommonMark Markdown once it is properly specified.  That will mean in
the short term that MediaWiki will need to support Markdown, and
eventually it would need to adopt Markdown as the primary text format,
and ultimately we would loose our own ability to render old revisions,
because the parser would bit rot.

One practical way to add more discipline around this problem is to
introduce a "mediawiki-wikitext-announce" list, similar to the
mediawiki-api-announce list, and require that *every* breaking change
to the wikitext parser is announced there.

wikitext is file format, and there are alternative parsers, which need
to be updated any time the Php parser changes.

https://www.mediawiki.org/wiki/Alternative_parsers

It should be managed just like the MediaWiki API, with appropriate
notices sent out, so that other tools can be kept up to date, and so
there is an accurate record of when breaking changes occurred.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Reload SQLight from MYSQL dump?

2016-08-01 Thread John
 For mass imports, use importDump.php - see <
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps> for details.

On Mon, Aug 1, 2016 at 8:37 PM, Jefsey  wrote:

> I am not familiar with databases. I have old MySQL based wikis sites I
> cannot access anymore due to a change in PHP and MySQL versions. I have old
> XML dumps. Is it possible to reload them under SQLight wikis? These were
> working group wikis: we only are interested in restoring texts. We have the
> images. We are not interested in the access rights: we will have to rebuild
> them anyway.
>
> Thank you for the help !
> jefsey
>
> PS. We dedicate to light wikis which are OK under SQLight, would there be
> a dedicated list to SQLight mangement (and further on development)?
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Reload SQLight from MYSQL dump?

2016-08-01 Thread Jefsey
I am not familiar with databases. I have old MySQL based wikis sites 
I cannot access anymore due to a change in PHP and MySQL versions. I 
have old XML dumps. Is it possible to reload them under SQLight 
wikis? These were working group wikis: we only are interested in 
restoring texts. We have the images. We are not interested in the 
access rights: we will have to rebuild them anyway.


Thank you for the help !
jefsey

PS. We dedicate to light wikis which are OK under SQLight, would 
there be a dedicated list to SQLight mangement (and further on development)? 



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Rob Lanphier
On Mon, Aug 1, 2016 at 1:56 PM, Gergo Tisza  wrote:
> On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier  wrote:
>> On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza  wrote:
>> > Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
>> > project (ie. wouldn't expect it to happen in this decade), and even then
>> it
>> > would not fully solve the problem[...]
>>
>> You seem to be suggesting that
>> 1.  Specifying wikitext-html conversion is really hard
>> 2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
>> 3.  HTML storage looks more like a silver bullet, and is cheaper
>> 4.  Therefore, a specification is not really worth doing, or if it is,
>> it's really low priority
>>
>> Is that an accurate way of paraphrasing your email?
>
> Yes. The main problem with specifying wikitext-to-html is that extensions
> get to extend it in arbitrary ways; e.g. the specification for Scribunto
> would have to include the whole Lua compiler semantics.


Do you believe that declaring "the implementation is the spec" is a
sustainable way of encouraging contribution to our projects?

Rob

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Gabriel Wicke
> One possibility is considering storing rendered HTML for old revisions. It
> lets wikitext (and hence parser) evolve without breaking old revisions.
Plus
> rendered HTML will use the template revision at the time it was rendered
vs.
> the latest revision (this is the problem Memento tries to solve).

Long term HTML archival is a something we have been gradually working
towards with RESTBase.

Since HTML is about 10x larger than wikitext, a major concern is storage
cost. Old estimates  put the
total storage needed to store one HTML copy of each revision at roughly
120T. To reduce this cost, we have since implemented several improvements
:


   - Brotli compression , once
   deployed, is expected to reduce the total storage needs to about
   1/4-1/5x over gzip .
   - The ability to split latest revisions from old revision lets us use
   cheaper and slower storage for old revisions.
   - Retention policies let us specify how many renders per revision we
   want to archive. We currently only archive one (the latest) render per
   revision, but have the option to store one render per $time_unit. This is
   especially important for pages like [[Main Page]], which are rarely edited,
   but constantly change their content in meaningful ways via templates. It is
   currently not possible to reliably cite such pages, without resorting to
   external services like archive.org.


Another important requirement for making HTML a useful long-term archival
medium is to establish a clear standard for HTML structures used. The
versioned Parsoid HTML spec
, along with format
migration logic for old content, are designed to make the stored HTML as
future-proof as possible.

While we currently only have space for a few months worth of HTML
revisions, we do expect the changes above to make it possible to push this
to years in the foreseeable future without unreasonable hardware needs.
This means that we can start building up an archive of our content in a
format that is not tied to the software.

Faithfully re-rendering old revisions is harder in retrospect. We will
likely have to make some trade-offs between fidelity & effort.

Gabriel


On Mon, Aug 1, 2016 at 2:01 PM, David Gerard  wrote:

> On 1 August 2016 at 17:37, Marc-Andre  wrote:
>
> > We need to find a long-term view to a solution.  I don't mean just
> keeping
> > old versions of the software around - that would be of limited help.
> It's
> > be an interesting nightmare to try to run early versions of phase3
> nowadays,
> > and probably require managing to make a very very old distro work and
> > finding the right versions of an ancient apache and PHP.  Even *building*
> > those might end up being a challenge... when is the last time you saw a
> > working egcs install? I shudder how nigh-impossible the task might be 100
> > years from now.
>
>
> oh god yes. I'm having this now, trying to revive an old Slash
> installation. I'm not sure I could even reconstruct a box to run it
> without compiling half of CPAN circa 2002 from source.
>
> Suggestion: set up a copy of WMF's setup on a VM (or two or three),
> save that VM and bundle it off to the Internet Archive as a dated
> archive resource. Do this regularly.
>
>
> - d.
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread David Gerard
On 1 August 2016 at 17:37, Marc-Andre  wrote:

> We need to find a long-term view to a solution.  I don't mean just keeping
> old versions of the software around - that would be of limited help.  It's
> be an interesting nightmare to try to run early versions of phase3 nowadays,
> and probably require managing to make a very very old distro work and
> finding the right versions of an ancient apache and PHP.  Even *building*
> those might end up being a challenge... when is the last time you saw a
> working egcs install? I shudder how nigh-impossible the task might be 100
> years from now.


oh god yes. I'm having this now, trying to revive an old Slash
installation. I'm not sure I could even reconstruct a box to run it
without compiling half of CPAN circa 2002 from source.

Suggestion: set up a copy of WMF's setup on a VM (or two or three),
save that VM and bundle it off to the Internet Archive as a dated
archive resource. Do this regularly.


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Gergo Tisza
On Mon, Aug 1, 2016 at 1:01 PM, Rob Lanphier  wrote:

> On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza  wrote:
> > Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
> > project (ie. wouldn't expect it to happen in this decade), and even then
> it
> > would not fully solve the problem[...]
>
> You seem to be suggesting that
> 1.  Specifying wikitext-html conversion is really hard
> 2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
> 3.  HTML storage looks more like a silver bullet, and is cheaper
> 4.  Therefore, a specification is not really worth doing, or if it is,
> it's really low priority
>
> Is that an accurate way of paraphrasing your email?
>

Yes. The main problem with specifying wikitext-to-html is that extensions
get to extend it in arbitrary ways; e.g. the specification for Scribunto
would have to include the whole Lua compiler semantics.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Rob Lanphier
On Mon, Aug 1, 2016 at 12:19 PM, Gergo Tisza  wrote:
> Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
> project (ie. wouldn't expect it to happen in this decade), and even then it
> would not fully solve the problem[...]

You seem to be suggesting that
1.  Specifying wikitext-html conversion is really hard
2.  It's not a silver bullet (i.e. it doesn't "fully solve the problem")
3.  HTML storage looks more like a silver bullet, and is cheaper
4.  Therefore, a specification is not really worth doing, or if it is,
it's really low priority

Is that an accurate way of paraphrasing your email?

Rob

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Acquiring list of templates including external links

2016-08-01 Thread Legoktm
Hi,

On 07/31/2016 07:53 AM, Takashi OTA wrote:
> Such links are stored in externallinks.sql.gz, in an expanded form.
> 
> When you want to check increase/decrease of linked domains in chronological
> order through edit history, you have to check pages-meta-history1.xml etc.
> In a such case, traditional links and links by templates are mixed,
> Therefore, the latter ones (links by templates) should be expanded to
> traditional link forms.

If you have the revision ID, you can make an API query like:
.

This will expand all templates and give you the same set of
externallinks that would have ended up in the dump.

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Gergo Tisza
On Mon, Aug 1, 2016 at 11:47 AM, Rob Lanphier  wrote:

> > HTML storage comes with its own can of worms, but it seems like a
> solution
> > worth thinking about in some form.
> >
> > 1. storage costs (fully rendered HTML would be 5-10 times bigger than
> > wikitext for that same page, and much larger if stored as wikitext diffs)
> > 2. evolution of HTML spec and its affect on old content (this affects the
> > entire web, so, whatever solution works there will work for us as well)
> > 3. newly discovered security holes and retroactively fixing them in
> stored
> > html and released dumps (not sure).
> > ... and maybe others.
>
> I think these are all reasons why I chose the word "seductive" as
> opposed to more unambiguous praise  :-)  Beyond these reasons, the
> bigger issue is that it's an invitation to be sloppy about our
> formats.  We should endeavor to make our wikitext to html conversion
> more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
> Kinzler taught me).  Holding a large data store of snapshots seems
> like a crutch to avoid the hard work of specifying how this conversion
> ought to work.  Let's actually nail down the spec for this[2][3]
> rather than kidding ourselves into believing we can just store enough
> HTML snapshots to make the problem moot.
>

Specifying wikitext-html conversion sounds like a MediaWiki 2.0 type of
project (ie. wouldn't expect it to happen in this decade), and even then it
would not fully solve the problem - e.g. very old versions relied on the
default CSS of a different MediaWiki skin; you need site scripts for some
things such as infobox show/hide functionality to work, but the standard
library those scripts rely on has changed; same for Scribunto scripts.

HTML storage is actually not that bad - browsers are very good at backwards
compatibility with older HTML spec and there is very little security
footprint in serving static HTML from a separate domain. Storage is
problem, but there is no need to store every page revision - monthly or
yearly snapshots would be fine IMO. (cf. T17017 - again, Kiwix seems to do
this already, so maybe it's just a matter of coordination.) The only other
practical problem I can think of is that it would preserve
deleted/oversighted information - that problem already exists with the
dumps, but those are not kept for very long (on WMF servers at least).
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Rob Lanphier
On Mon, Aug 1, 2016 at 9:51 AM, Subramanya Sastry  wrote:
> On 08/01/2016 11:37 AM, Marc-Andre wrote:
>> Is there something we can do to make the passage of years hurt less?
>> Should we be laying groundwork now to prevent issues decades away?
>
>
> One possibility is considering storing rendered HTML for old revisions. It
> lets wikitext (and hence parser) evolve without breaking old revisions. Plus
> rendered HTML will use the template revision at the time it was rendered vs.
> the latest revision (this is the problem Memento tries to solve).


This is a seductive path to choose.  Maintaining backwards
compatibility for poorly conceived (in retrospect) engineering
decisions is really hard work.  A lot of the cruft and awfulness of
enterprise-focused software comes from dealing with the seemingly
endless torrent of edge cases which are often backwards-compatibility
issues in the systems/formats/databases/protocols that the software
depends on.  The [Y2K problem][1] was a global lesson in the
importance of intelligently paying down technical debt.

You outline the problems with this approach in the remainder of your email

> HTML storage comes with its own can of worms, but it seems like a solution
> worth thinking about in some form.
>
> 1. storage costs (fully rendered HTML would be 5-10 times bigger than
> wikitext for that same page, and much larger if stored as wikitext diffs)
> 2. evolution of HTML spec and its affect on old content (this affects the
> entire web, so, whatever solution works there will work for us as well)
> 3. newly discovered security holes and retroactively fixing them in stored
> html and released dumps (not sure).
> ... and maybe others.

I think these are all reasons why I chose the word "seductive" as
opposed to more unambiguous praise  :-)  Beyond these reasons, the
bigger issue is that it's an invitation to be sloppy about our
formats.  We should endeavor to make our wikitext to html conversion
more scientifically reproducible (i.e. "Nachvollziehbarkeit" as Daniel
Kinzler taught me).  Holding a large data store of snapshots seems
like a crutch to avoid the hard work of specifying how this conversion
ought to work.  Let's actually nail down the spec for this[2][3]
rather than kidding ourselves into believing we can just store enough
HTML snapshots to make the problem moot.

Rob

[1]: https://en.wikipedia.org/wiki/Year_2000_problem
[2]: https://www.mediawiki.org/wiki/Markup_spec
[3]: https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Conference submissions now open for WikiConference North America

2016-08-01 Thread Pine W
WikiConference North America will take place October 7 through 10 in San
Diego.

The session tracks are:

1. Community
2. Advocacy & Outreach
3. Technology & Infrastructure
4. Health care and science
5. GLAM
6. Education and Academic Engagement

Please submit proposals here: https://wikiconference.org/wiki/Submissions

The submission deadline is August 31st.

Pine
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Pine W
"Should we be laying groundwork now to prevent issues decades away?" I'll
answer that with "Yes". I could provide some interesting stories about
technological and budgetary headaches that result from repeatedly delaying
efforts to make legacy software be forwards-compatible. The technical
details of the tools mentioned here are beyond me, but I saw what happened
in another org that was dealing with legacy software and it wasn't pretty.

Pine

On Mon, Aug 1, 2016 at 9:37 AM, Marc-Andre  wrote:

> On 2016-08-01 12:21 PM, Gergo Tisza wrote:
>
> the parser has changed over time and old templates
>> might not work anymo
>>
>
> Aaah.  Good point.  Also, the changes in extensions (or, indeed, what
> extensions are installed at all) might break attempts to parse the past, as
> it were.
>
> You know, this is actually quite troublesome: as the platform evolves the
> older data becomes increasingly hard to use at all - making it effectively
> lost even if we kept the bits around.  This is a rather widespread issue in
> computing as a rule; but I now find myself distressed at its unavoidable
> effect on what we've always intended to be a permanent contribution to
> humanity.
>
> We need to find a long-term view to a solution.  I don't mean just keeping
> old versions of the software around - that would be of limited help.  It's
> be an interesting nightmare to try to run early versions of phase3
> nowadays, and probably require managing to make a very very old distro work
> and finding the right versions of an ancient apache and PHP.  Even
> *building* those might end up being a challenge... when is the last time
> you saw a working egcs install? I shudder how nigh-impossible the task
> might be 100 years from now.
>
> Is there something we can do to make the passage of years hurt less?
> Should we be laying groundwork now to prevent issues decades away?
>
> At the very least, I think those questions are worth asking.
>
> -- Coren / Marc
>
>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Subramanya Sastry

On 08/01/2016 11:37 AM, Marc-Andre wrote:

...
Is there something we can do to make the passage of years hurt less?  
Should we be laying groundwork now to prevent issues decades away?


One possibility is considering storing rendered HTML for old revisions. 
It lets wikitext (and hence parser) evolve without breaking old 
revisions. Plus rendered HTML will use the template revision at the time 
it was rendered vs. the latest revision (this is the problem Memento 
tries to solve).


HTML storage comes with its own can of worms, but it seems like a 
solution worth thinking about in some form.


1. storage costs (fully rendered HTML would be 5-10 times bigger than 
wikitext for that same page, and much larger if stored as wikitext diffs)
2. evolution of HTML spec and its affect on old content (this affects 
the entire web, so, whatever solution works there will work for us as well)
3. newly discovered security holes and retroactively fixing them in 
stored html and released dumps (not sure).

... and maybe others.

Subbu.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Loosing the history of our projects to bitrot. Was: Acquiring list of templates including external links

2016-08-01 Thread Marc-Andre

On 2016-08-01 12:21 PM, Gergo Tisza wrote:


the parser has changed over time and old templates
might not work anymo


Aaah.  Good point.  Also, the changes in extensions (or, indeed, what 
extensions are installed at all) might break attempts to parse the past, 
as it were.


You know, this is actually quite troublesome: as the platform evolves 
the older data becomes increasingly hard to use at all - making it 
effectively lost even if we kept the bits around.  This is a rather 
widespread issue in computing as a rule; but I now find myself 
distressed at its unavoidable effect on what we've always intended to be 
a permanent contribution to humanity.


We need to find a long-term view to a solution.  I don't mean just 
keeping old versions of the software around - that would be of limited 
help.  It's be an interesting nightmare to try to run early versions of 
phase3 nowadays, and probably require managing to make a very very old 
distro work and finding the right versions of an ancient apache and 
PHP.  Even *building* those might end up being a challenge... when is 
the last time you saw a working egcs install? I shudder how 
nigh-impossible the task might be 100 years from now.


Is there something we can do to make the passage of years hurt less?  
Should we be laying groundwork now to prevent issues decades away?


At the very least, I think those questions are worth asking.

-- Coren / Marc


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Acquiring list of templates including external links

2016-08-01 Thread Gergo Tisza
On Mon, Aug 1, 2016 at 7:46 AM, Marc-Andre  wrote:

> Clearly, all the data to do so is there in the database - and I seem to
> recall that there exists an extension that will allow you to use the parser
> in that way - but the Foundation projects do not have such an extension
> installed and cannot be convinced to render a page for you that would
> accurately show what ELs it might have had at a given date.
>

That would be the Memento [1] extension. I'm not sure this is even
theoretically possible - the parser has changed over time and old templates
might not work anymore.

Your best bet is probably to find some old dumps. (Kiwix [2] maybe? I don't
know if they preserve templates.)


[1] https://www.mediawiki.org/wiki/Extension:Memento
[2] https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Acquiring list of templates including external links

2016-08-01 Thread Marc-Andre

On 2016-07-31 10:53 AM, Takashi OTA wrote:


When you want to check increase/decrease of linked domains in chronological
order through edit history


This is actually a harder problem that it seems, even at first glance: 
if you want to examine the links over time then, when you are looking at 
an old revision of an article, you have to contrive to expand the 
templates /as they existed at that time/ and not those that exist /now/ 
as the Mediawiki engine would do.


Clearly, all the data to do so is there in the database - and I seem to 
recall that there exists an extension that will allow you to use the 
parser in that way - but the Foundation projects do not have such an 
extension installed and cannot be convinced to render a page for you 
that would accurately show what ELs it might have had at a given date.


-- Coren / Marc


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l