I've created a ticket (https://issues.apache.org/jira/browse/CONNECTORS-1264) and attached a patch.
Karl On Sun, Dec 6, 2015 at 9:22 AM, Karl Wright <[email protected]> wrote: > Hi Issei, > > MCF's html parser handles unquoted attribute values, but there are limits > to what characters you can put in an unquoted attribute value according to > HTML4. It's not clear that "/" is in fact an allowed character, but if > you believe that it is, then please open a ticket and I will fix the > problem. > > Thanks, > Karl > > > On Sun, Dec 6, 2015 at 9:11 AM, Issei Nishigata <[email protected]> > wrote: > >> I'm using MCF 2.2. >> When I crawl links that attribute values of href like below, MCF can't >> extract links properly. >> >> <a href=/sample/Mainservlet?sample=000 >sample</a> >> # attribute value doesn't specified by the double quoted. >> # I got "/sample". >> >> In HTML4, it does not always require quotes around attribute value. >> XHTML requires quotes around attribute value. >> Is MCF compliant with HTML4? >> >> >> Thanks, >> Issei >> > >
