On Mon, 08 Dec 2014 21:50:56 +0100, Simon Pieters <sim...@opera.com> wrote:

SELECT COUNT(*) as num,
  CASE
WHEN REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>(\s*[^<]+)+\s*</menuitem>') THEN "has content"
   ELSE "no content"
  END as stat
 FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
   AND REGEXP_MATCH(LOWER(body), r'<menuitem')
GROUP BY stat
ORDER BY num desc

Row     num     stat    
1       10101   no content      

Hixie pointed out that this doesn't catch element children. So flipping it gives:

SELECT COUNT(*) as num,
 CASE
WHEN REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>') THEN "no content"
  ELSE "has content"
 END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
  AND REGEXP_MATCH(LOWER(body), r'<menuitem')
GROUP BY stat
ORDER BY num desc

Row     num     stat    
1       10085   no content      
2       16      has content     

15 of these are omitting the end tag, as seen by the other query. So what is the last one doing?

SELECT url,body
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
  AND LOWER(body) CONTAINS '<menuitem'
  AND LOWER(body) CONTAINS '</menuitem'
  AND NOT REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>')

Row     url     body    
1 http://www.dod.gr/lib/menuData_v483.php <menus> <!-- BOTTOM NAVIGATION MENU ---> <menu id="BottomNavigationMenu" type="main" x="30" y="30"> <menuitem x="120" y="120"> <image>community.swf</image> <label>community</label> ...

Yep, mislabeled XML.

For completeness, the 15 pages with no end tags fall in two categories:

* for(i=0;i<menuitems.length;i++)
* <xml id=""SolpartMenuDI"" onreadystatechange=""if (this.readyState == 'complete') spm_initMyMenu(this, spm_getById('dnn_dnnMENU_ctldnnMENU'))""><root><menuitem id=""2533"" title=""صفحه اصلی"" url=""/Default.aspx?tabid=2533"" lefthtml=""&lt;img alt=&quot;*&quot; BORDER=&quot;0&quot; src=&quot;/images/breadcrumb.gif&quot;&gt;"" css="" "" />


Previous conclusion stands. :-)

--
Simon Pieters
Opera Software

Reply via email to