Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-15 Thread Chris Adams
On Wed, Jan 13, 2016 at 12:47 PM, Legoktm 
wrote:

> > When that work completes, we'll have somewhere around half a million
> links
> > which differ only in the URL scheme. What would be the best way to
> rewrite
> > all of those URLs? I'd like to reduce the window during which users
> transit
> > from HTTPS -> HTTP -> HTTPS.
>
> You can use Pywikbot's replace.py[1], which lets you provide regex
> find/replace and can get a list of pages from the API equivalent of
> Special:LinkSearch.
>

Thanks – I gave this a test using our simplest site (
https://gist.github.com/acdha/77354c76bf503b6f455f) to produce a minor edit
like this:

https://en.wikipedia.org/w/index.php?title=World_Digital_Library=78071=699554478

I had a question about etiquette: is a one-time operation like this
considered a bot for the purposes of needing to go through the approval
process? I anticipate running this multiple times as each application is
migrated but it would be a one-time process and since there will be
permanent redirects there won't be a need for this to run automatically in
the future since users won't be seeing http: URLs any more.

Chris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-15 Thread Oliver Keyes
I imagine you would need to go through the process, yep, since it's
kind of a lot of edits that'd need clearing up if something went
wrong.

On 15 January 2016 at 13:32, Chris Adams  wrote:
> On Wed, Jan 13, 2016 at 12:47 PM, Legoktm 
> wrote:
>
>> > When that work completes, we'll have somewhere around half a million
>> links
>> > which differ only in the URL scheme. What would be the best way to
>> rewrite
>> > all of those URLs? I'd like to reduce the window during which users
>> transit
>> > from HTTPS -> HTTP -> HTTPS.
>>
>> You can use Pywikbot's replace.py[1], which lets you provide regex
>> find/replace and can get a list of pages from the API equivalent of
>> Special:LinkSearch.
>>
>
> Thanks – I gave this a test using our simplest site (
> https://gist.github.com/acdha/77354c76bf503b6f455f) to produce a minor edit
> like this:
>
> https://en.wikipedia.org/w/index.php?title=World_Digital_Library=78071=699554478
>
> I had a question about etiquette: is a one-time operation like this
> considered a bot for the purposes of needing to go through the approval
> process? I anticipate running this multiple times as each application is
> migrated but it would be a one-time process and since there will be
> permanent redirects there won't be a need for this to run automatically in
> the future since users won't be seeing http: URLs any more.
>
> Chris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Max Semenik
Fix them with a bot, for example AWB
.

On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams  wrote:

> I've been working with a number of colleagues getting ready to turn HTTPS
> on by default for various loc.gov domains. This has been fairly successful
> and we're working through the old legacy apps now.
>
> When that work completes, we'll have somewhere around half a million links
> which differ only in the URL scheme. What would be the best way to rewrite
> all of those URLs? I'd like to reduce the window during which users transit
> from HTTPS -> HTTP -> HTTPS.
>
> If anyone's curious, I've been collecting the links for a few dozen wikis
> in a somewhat oversized Git repo:
>
> https://github.com/acdha/lc-wikipedia-links
>
> The first site which has completely migrated is the much smaller World
> Digital Library which has just under four thousand links:
> https://gist.github.com/acdha/f785b22b356a9842439e
>
> Thanks,
> Chris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Best regards,
Max Semenik ([[User:MaxSem]])
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Chris Adams
On Wed, Jan 13, 2016 at 1:49 PM, Risker  wrote:

> Before properly answering this question, it's important to know how many
> links we're talking about.  If it's 5000, the fallout is probably
> manageable; but if it's in the hundreds of thousands on any project (most
> likely enwiki) there will be renting of garments and gnashing of teeth.
> All those changes show up on people's watchlists, after all.
>

Yes, that's exactly what I'd like to avoid. The first batch of URLs which
is ready to go is small (~4K) but the full list is significantly larger and
many of those are used on multiple pages so the edit churn would be
non-trivial.


> Please also ensure that if you're changing the URL, it's not just a http
> --> https swap, but that the new URL is tested to verify it lands on a real
> page.  There are no doubt plenty of bad links in amongst all those URLs -
> even government websites rearrange themselves periodically -  and replacing
> a bad link with a more secure bad link is not really helpful.
>

Yes – part of this project on our side is setting permanent redirects not
just for the protocol but also for pages which have moved into a different
application. This is the other side of what Oliver Keyes was asking about
where there are a mix of legacy applications which are non-trivial to
rewrite but also many thousands of URLs where a simple regex could handle
both the protocol change and switching to the canonical item page in the
modern unified app instead of continuing to use a long-deprecated legacy
view. Internally we've been working to chunk that list of URLs into
patterns by application / project so they can be reviewed and tested in a
reasonable amount of time.

Chris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Risker
Before properly answering this question, it's important to know how many
links we're talking about.  If it's 5000, the fallout is probably
manageable; but if it's in the hundreds of thousands on any project (most
likely enwiki) there will be renting of garments and gnashing of teeth.
All those changes show up on people's watchlists, after all.

Please also ensure that if you're changing the URL, it's not just a http
--> https swap, but that the new URL is tested to verify it lands on a real
page.  There are no doubt plenty of bad links in amongst all those URLs -
even government websites rearrange themselves periodically -  and replacing
a bad link with a more secure bad link is not really helpful.

Risker/Anne

On 13 January 2016 at 13:32, Max Semenik  wrote:

> Fix them with a bot, for example AWB
> .
>
> On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams  wrote:
>
> > I've been working with a number of colleagues getting ready to turn HTTPS
> > on by default for various loc.gov domains. This has been fairly
> successful
> > and we're working through the old legacy apps now.
> >
> > When that work completes, we'll have somewhere around half a million
> links
> > which differ only in the URL scheme. What would be the best way to
> rewrite
> > all of those URLs? I'd like to reduce the window during which users
> transit
> > from HTTPS -> HTTP -> HTTPS.
> >
> > If anyone's curious, I've been collecting the links for a few dozen wikis
> > in a somewhat oversized Git repo:
> >
> > https://github.com/acdha/lc-wikipedia-links
> >
> > The first site which has completely migrated is the much smaller World
> > Digital Library which has just under four thousand links:
> > https://gist.github.com/acdha/f785b22b356a9842439e
> >
> > Thanks,
> > Chris
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
>
> --
> Best regards,
> Max Semenik ([[User:MaxSem]])
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Oliver Keyes
Question; are LOC links handled in a standardised way using a
template? Because if so this could be one change, not hundreds of
thousands.

(If it's not I'd really suggest using the same edit sets and
opportunity to restructure them that way, if LOC links are consistent
enough for it to be done. That way you'll both avoid this problem in
the future if something goes kooky - only have to make one edit! - and
have a much easier way of identifying how many there are and where
they live)

On 13 January 2016 at 10:49, Risker  wrote:
> Before properly answering this question, it's important to know how many
> links we're talking about.  If it's 5000, the fallout is probably
> manageable; but if it's in the hundreds of thousands on any project (most
> likely enwiki) there will be renting of garments and gnashing of teeth.
> All those changes show up on people's watchlists, after all.
>
> Please also ensure that if you're changing the URL, it's not just a http
> --> https swap, but that the new URL is tested to verify it lands on a real
> page.  There are no doubt plenty of bad links in amongst all those URLs -
> even government websites rearrange themselves periodically -  and replacing
> a bad link with a more secure bad link is not really helpful.
>
> Risker/Anne
>
> On 13 January 2016 at 13:32, Max Semenik  wrote:
>
>> Fix them with a bot, for example AWB
>> .
>>
>> On Wed, Jan 13, 2016 at 9:09 AM, Chris Adams  wrote:
>>
>> > I've been working with a number of colleagues getting ready to turn HTTPS
>> > on by default for various loc.gov domains. This has been fairly
>> successful
>> > and we're working through the old legacy apps now.
>> >
>> > When that work completes, we'll have somewhere around half a million
>> links
>> > which differ only in the URL scheme. What would be the best way to
>> rewrite
>> > all of those URLs? I'd like to reduce the window during which users
>> transit
>> > from HTTPS -> HTTP -> HTTPS.
>> >
>> > If anyone's curious, I've been collecting the links for a few dozen wikis
>> > in a somewhat oversized Git repo:
>> >
>> > https://github.com/acdha/lc-wikipedia-links
>> >
>> > The first site which has completely migrated is the much smaller World
>> > Digital Library which has just under four thousand links:
>> > https://gist.github.com/acdha/f785b22b356a9842439e
>> >
>> > Thanks,
>> > Chris
>> > ___
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>>
>>
>>
>> --
>> Best regards,
>> Max Semenik ([[User:MaxSem]])
>> ___
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Oliver Keyes
Count Logula
Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Chris Adams
On Wed, Jan 13, 2016 at 12:47 PM, Legoktm 
wrote:

> You can use Pywikbot's replace.py[1], which lets you provide regex
> find/replace and can get a list of pages from the API equivalent of
> Special:LinkSearch.
>

Thanks - I'll look into that as we get various batches of URLs ready for
testing.


> You should also consider setting up HSTS[2] so regardless if users click
> on an HTTP link, they'll be sent to the HTTPS version of the site.
>

Yes – that's on the plan as soon as we finishing remediating the older
legacy content. I've been using lists from Wikipedia, a sampling of web
access logs, etc. to feed a script[1] to find cases where someone used an
absolute URL in a 

[Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Chris Adams
I've been working with a number of colleagues getting ready to turn HTTPS
on by default for various loc.gov domains. This has been fairly successful
and we're working through the old legacy apps now.

When that work completes, we'll have somewhere around half a million links
which differ only in the URL scheme. What would be the best way to rewrite
all of those URLs? I'd like to reduce the window during which users transit
from HTTPS -> HTTP -> HTTPS.

If anyone's curious, I've been collecting the links for a few dozen wikis
in a somewhat oversized Git repo:

https://github.com/acdha/lc-wikipedia-links

The first site which has completely migrated is the much smaller World
Digital Library which has just under four thousand links:
https://gist.github.com/acdha/f785b22b356a9842439e

Thanks,
Chris
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread Legoktm
On 01/13/2016 09:09 AM, Chris Adams wrote:
> I've been working with a number of colleagues getting ready to turn HTTPS
> on by default for various loc.gov domains. This has been fairly successful
> and we're working through the old legacy apps now.

Awesome!

> When that work completes, we'll have somewhere around half a million links
> which differ only in the URL scheme. What would be the best way to rewrite
> all of those URLs? I'd like to reduce the window during which users transit
> from HTTPS -> HTTP -> HTTPS.

You can use Pywikbot's replace.py[1], which lets you provide regex
find/replace and can get a list of pages from the API equivalent of
Special:LinkSearch.

You should also consider setting up HSTS[2] so regardless if users click
on an HTTP link, they'll be sent to the HTTPS version of the site.

[1] https://www.mediawiki.org/wiki/Manual:Pywikibot/replace.py
[2] https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security

-- Legoktm

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk link rewrites for HTTP -> HTTPS migration?

2016-01-13 Thread P. Josepherum
If you use Apache, a rewrite rule is the simplest approach and instructions
can be found by searching for "rewrite http to https Apache". A similar
process will work with nginx.

On Wed, 13 Jan 2016, 17:09 Chris Adams  wrote:

> I've been working with a number of colleagues getting ready to turn HTTPS
> on by default for various loc.gov domains. This has been fairly successful
> and we're working through the old legacy apps now.
>
> When that work completes, we'll have somewhere around half a million links
> which differ only in the URL scheme. What would be the best way to rewrite
> all of those URLs? I'd like to reduce the window during which users transit
> from HTTPS -> HTTP -> HTTPS.
>
> If anyone's curious, I've been collecting the links for a few dozen wikis
> in a somewhat oversized Git repo:
>
> https://github.com/acdha/lc-wikipedia-links
>
> The first site which has completely migrated is the much smaller World
> Digital Library which has just under four thousand links:
> https://gist.github.com/acdha/f785b22b356a9842439e
>
> Thanks,
> Chris
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

-- 

Love and waffles,
PJosepherum
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l