TL;DR: Look for protocol relative links.
> For those wikis I have tested, the output looks very similar to
> what I got from looking at the replicated database, except that
> all external links to WMF sites seem to have been removed from
> the SQL dumps.
That's interesting. There are such links in the dumps:
gzip -d <
/mnt/user-store/dumps/frwiki-group-writable/frwiki-20111123-externallinks.sql.gz
| grep -oP ".{17}commons.wikimedia.org[^']*" | more
> (572952,'http://commons.wikimedia.org/wiki/Image:Georges_Seurat_043.jpg
> (572952,'http://commons.wikimedia.org/wiki/Image:Tour_Eiffel_exposition_universelle_1889.jpg
> (603411,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:Nokia_N-Gage.jpg
> (150418,'http://commons.wikimedia.org/wiki/Category:The_Yorck_Project
> (564205,'http://commons.wikimedia.org/wiki/Category:Icons
> (446364,'http://commons.wikimedia.org/wiki/Image:Viaduc_Millau_France.PNG
> (446364,'http://commons.wikimedia.org/wiki/Image:La2-demis-france.png
> (339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalities_in_the_canton_of_Vaud
> (339872,'http://commons.wikimedia.org/wiki/Category:Flags_of_municipalities_in_the_canton_of_Geneva
> (339872,'http://commons.wikimedia.org/wiki/Category:Maps_of_municipalities_in_the_canton_of_Geneva
> (339872,'http://commons.wikimedia.org/wiki/Accueil
> (518924,'http://commons.wikimedia.org/wiki/Special:Search?search=tintagel
> (488187,'http://commons.wikimedia.org/skins-1.5/common/metadata.js
> (342102,'http://commons.wikimedia.org/wiki/Category:Photography
> (547524,'http://commons.wikimedia.org/wiki/Special:Contributions/Pereubu
> (601588,'http://commons.wikimedia.org/wiki/Main_Page
> (601588,'http://commons.wikimedia.org/w/index.php?title=Image:Uz%C3%A8s_7.jpg&action=edit
> (366318,'http://commons.wikimedia.org/wiki/Category:Chinese_stroke_order
> (208246,'http://commons.wikimedia.org/wiki/Category:SVG_flags
> (208246,'http://commons.wikimedia.org/wiki/Category:Rolleiflex
> (350993,'http://commons.wikimedia.org/wiki/Paris
> (232033,'http://commons.wikimedia.org/wiki/Category:Bolivia
> (488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:IMac_G3_slot_loading.jpg
> (488180,'http://commons.wikimedia.org/wiki/Template:Deletion_requests#Image:Nokia_N-Gage.jpg
> (209441,'http://commons.wikimedia.org/wiki/Accueil
etc.
Although few of them seem to come from NS_MAIN:
sql frwiki_p "select page_namespace from page where page_id IN (0$(gzip
-d < frwiki-20111123-externallinks.sql.gz | grep --line-buffered -oP
"([0-9]+)(?=,'http://commons.wikimedia.org[^']*)" | uniq | head -500 |
sed s/^/,/g ))"
The trick seems to be that most of them they will be present through
//commons.wikimedia.org, not http://commons.wikimedia.org
Talk pages will have more url-copying, while articles will use templates
like {{Autres projets}}
> sql frwiki_p "select page_namespace from page where page_id IN (0$(gzip -d <
> frwiki-20111123-externallinks.sql.gz | grep --line-buffered -oP
> "([0-9]+)(?=,'//commons.wikimedia.org[^']*)" | uniq | head -500 | sed s/^/,/g
> ))"
does show many NS_MAIN entries.
I suspect you are only taking into accoutn http: and https: links with
your perl script, while all of them with your sql query.
> select count(*) from externallinks join page on (el_from = page_id) where
> el_to like '//commons.wikimedia.org/%' and page_namespace = 0;
> +----------+
> | count(*) |
> +----------+
> | 101439 |
> +----------+
> select count(*) from externallinks join page on (el_from = page_id) where
> el_to like 'http://commons.wikimedia.org/%' and page_namespace = 0;
> +----------+
> | count(*) |
> +----------+
> | 399 |
> +----------+
101439 + 399 = 101838 ~= 101619 which is your db result.
_______________________________________________
Toolserver-l mailing list ([email protected])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list:
https://wiki.toolserver.org/view/Mailing_list_etiquette