https://bugzilla.wikimedia.org/show_bug.cgi?id=67849

            Bug ID: 67849
           Summary: If-Modified-Since handling is broken
           Product: MediaWiki
           Version: unspecified
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: major
          Priority: Unprioritized
         Component: General/Unknown
          Assignee: wikibugs-l@lists.wikimedia.org
          Reporter: bianji...@google.com
       Web browser: ---
   Mobile Platform: ---

We (crawling team in Google) found that Wikipedia's If-Modified-Since handling
is broken, at least when it comes to (Wikipedia-style) redirects/symlinks.

As a short-term work-around (in order to serve up-to-date content), we
temporarily stopped sending out If-Modified-Since header in the crawl request.

Hope you take a look at this issue, and let us know when it is resolved.



Below is what we did to reproduce the issue (Actually, we have observed many
times that we cannot fetch the latest content of an articles before)

1. Pick a Wikipedia page to vandalize it (apologies for being a vandal; we
promise it is for the greater good...), and wait for it is reverted.
http://en.wikipedia.org/wiki/SSh (note the capitalization).  Its history
(http://en.wikipedia.org/w/index.php?title=SSh&action=history) will contains
the vandalization at 01:47:42 (GMT) and its rollback at 01:49:06 (GMT)

2. Fetch the url (at 01:47:54 GMT, right after the vandalized revision is
submitted) using telnet.

It seems the "Last-Modified" is using the last redirect destination's latest
revision: 08 Jul 2014 11:49:08 GMT
(http://en.wikipedia.org/w/index.php?title=SSL&action=history)

$ telnet en.wikipedia.org 80
Trying 2620:0:861:ed1a::1...
Connected to text-lb.eqiad.wikimedia.org.
Escape character is '^]'.
GET /wiki/SSh HTTP/1.1
Host: en.wikipedia.org

HTTP/1.1 200 OK
Server: Apache
X-Content-Type-Options: nosniff
Content-language: en
X-UA-Compatible: IE=Edge
Vary: Accept-Encoding,Cookie
Last-Modified: Tue, 08 Jul 2014 11:49:08 GMT
Content-Type: text/html; charset=UTF-8
X-Varnish: 4282068392, 1036646122
Via: 1.1 varnish, 1.1 varnish
Transfer-Encoding: chunked
Date: Thu, 10 Jul 2014 01:47:54 GMT ← This one's accurate; this is when the
crawl happened.
Age: 0
Connection: keep-alive
X-Cache: cp1065 miss (0), cp1053 frontend miss (0)
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Set-Cookie: GeoIP=::::v6; Path=/; Domain=.wikipedia.org

…Page for SSL follows…


3. After it's reverted, re-crawl (at 01:50:57 GMT) the article with a
If-Modified-Since header, setting to the Last-Modified value in step2 (we are
simulating the production crawling works).
It's strange that we got a 304 response this time. And the "Last-Modified"
value is weird too, neither SSh nor SSH has update around 08 Jul 2014 11:49:08
GMT.

$ telnet en.wikipedia.org 80
Trying 2620:0:861:ed1a::1...                                                    
Connected to text-lb.eqiad.wikimedia.org.  
Escape character is '^]'.
GET /wiki/SSh HTTP/1.1                                                          
Host: en.wikipedia.org  
If-Modified-Since: Tue, 08 Jul 2014 11:49:08 GMT

HTTP/1.1 304 Not Modified
Server: Apache
X-Content-Type-Options: nosniff
Content-language: en
X-UA-Compatible: IE=Edge
Vary: Accept-Encoding,Cookie
Last-Modified: Thu, 26 Jun 2014 11:22:21 GMT ← A mysterious timestamp.
Content-Type: text/html; charset=UTF-8
X-Varnish: 4282315466 4282211494, 2309645493
Via: 1.1 varnish, 1.1 varnish
Date: Thu, 10 Jul 2014 01:50:57 GMT
Age: 74  
Connection: keep-alive
X-Cache: cp1065 hit (1), cp1066 frontend miss (0)
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Set-Cookie: GeoIP=::::v6; Path=/; Domain=.wikipedia.org

Connection closed by foreign host.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to