So I think the problem is with malformed img tags.  The parser fails
if the tag is of this form:

   <img src="/library/homepage/images/curve.gif" alt="" border="0" />

Note the end of the tag is closed with "/>" instead of just ">" as in
the spec.  When the parser finds the "/" it thinks it sets
attr_name_begin to the "/" and then attr_name_end gets set to the same
thing.  

If I edit the html file to change the tag to:

   <img src="/library/homepage/images/curve.gif" alt="" border="0">

it is recognized correctly.

Unfortunately in this case the parser also seg faults in the call to
strlen() in the array_allowed() function.  I haven't looked closely at
this yet but it only shows up when a follow_tags list is passed to
map_html_tags.  That is, if you use NULL pointers for follow_tags and
follow_attrs, there is no seg fault.  I was asking the parser to just
tell me about img tags.

This problem with img tags seems to be quite common (redhat.com,
ibm.com, microsoft.com) maybe due to some authoring tools.

Thanks.


-- Anees



> > For some reason, <img src=... > tags are recognized but then skipped
> > almost every time they are encountered.  When using the full program
> > and recursive retrieve, the images are in fact retreived so it seems
> > that the parser does work correctly when not in standalone mode.
> > 
> > It seems that the following condition is met when parsing img
> > tag attributes
> > 
> >     /* Establish bounds of attribute name. */
> >     attr_name_begin = p;    /* <foo bar ...> */
> >                             /*      ^        */
> >     while (NAME_CHAR_P (*p))
> >       ADVANCE (p);
> >     attr_name_end = p;      /* <foo bar ...> */
> >                             /*         ^     */
> >     if (attr_name_begin == attr_name_end)
> >       goto backout_tag;
> > 
> > Can someone shed some light on this?
> 
> For some reason, the parser does not advance past the attribute name.
> Try going into the debugger and printing the value of P.  You should
> find out why the parser refuses to advance beyond attr_name_begin.
> 
> Perhaps it thinks it has reached the end of file?  (Are you calling it
> with the proper text length?)  Perhaps the text is corrupted due to
> another bug in your program and the attribute name is invalid?  A
> number of things could be wrong.
> 
> When I wrote the parser, I primarily tested it in "standalone" mode,
> so it should work thus.
> 


Reply via email to