On Mon, 08 Dec 2014 21:50:56 +0100, Simon Pieters <sim...@opera.com> wrote:
SELECT COUNT(*) as num,
CASE
WHEN REGEXP_MATCH(LOWER(body),
r'<menuitem[^>]*>(\s*[^<]+)+\s*</menuitem>') THEN "has content"
ELSE "no content"
END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
AND REGEXP_MATCH(LOWER(body), r'<menuitem')
GROUP BY stat
ORDER BY num desc
Row num stat
1 10101 no content
Hixie pointed out that this doesn't catch element children. So flipping it
gives:
SELECT COUNT(*) as num,
CASE
WHEN REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>') THEN
"no content"
ELSE "has content"
END as stat
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
AND REGEXP_MATCH(LOWER(body), r'<menuitem')
GROUP BY stat
ORDER BY num desc
Row num stat
1 10085 no content
2 16 has content
15 of these are omitting the end tag, as seen by the other query. So what
is the last one doing?
SELECT url,body
FROM [httparchive:runs.2014_08_15_requests_body]
WHERE mimeType CONTAINS "html"
AND LOWER(body) CONTAINS '<menuitem'
AND LOWER(body) CONTAINS '</menuitem'
AND NOT REGEXP_MATCH(LOWER(body), r'<menuitem[^>]*>\s*</menuitem>')
Row url body
1 http://www.dod.gr/lib/menuData_v483.php <menus> <!-- BOTTOM NAVIGATION
MENU ---> <menu id="BottomNavigationMenu" type="main" x="30" y="30">
<menuitem x="120" y="120"> <image>community.swf</image>
<label>community</label> ...
Yep, mislabeled XML.
For completeness, the 15 pages with no end tags fall in two categories:
* for(i=0;i<menuitems.length;i++)
* <xml id=""SolpartMenuDI"" onreadystatechange=""if (this.readyState ==
'complete') spm_initMyMenu(this,
spm_getById('dnn_dnnMENU_ctldnnMENU'))""><root><menuitem id=""2533""
title=""صفحه اصلی"" url=""/Default.aspx?tabid=2533"" lefthtml=""<img
alt="*" BORDER="0"
src="/images/breadcrumb.gif">"" css="" "" />
Previous conclusion stands. :-)
--
Simon Pieters
Opera Software