After using wget with the -r switch I noticed some files were not being
pulled down. I am mirroring a professor's website so I don't have to check
her website, just automate everything and read from laptop.
Anyway, in her index file 3 urls have a CR in them for the file name,
which causes the website to fail to send back the file to wget because
wget is sending the string unfiltered.
For example:

<a href="HW
02_Sol.html">Solution to HW02
        </a>

Strange formatting, but retrieval fails because the filename is being
passed literally back to the website.


My fast solution is:

handle_link (struct collect_urls_closure *closure, const char *link_uri,
             struct taginfo *tag, int attrid)
{
  ...
  int i, j;
  char *ptr, *cr_ptr;
  ...
  char *fragment = ...

  /*warning issued with the following line when doing "make"*/
  cr_ptr = strstr(link_uri, "\n");
  if (cr_ptr)
  {
    /* Nullify the \n within the link_uri, this is a compensation
       for some weirdness I have encountered. I don't know why,
       but for some reason some people put a CR/LF in their html.
       This code should probably go elsewhere, but I used the -d flag
       to pick this location and used the following "if (fragment)"
       code as a template.  */
    i = strlen(link_uri);
    ptr = alloca (i);
    memcpy(ptr, link_uri, i);
    for (j = 0; ptr[j] != '\n'; j++)
      ;
    for (; j < i; j++)
      ptr[j] = ptr[j+1];
    link_uri = ptr;
  }

  if (fragment)
  ...

}

However I think I am breaking some output. Is \n allowed in link_uri????
The code does what it needs to do, I haven't looked everywhere so I don't
know if this is the most appropriate place to place this code. I used the
debugging switch and worked it from there.

Thanks for a great upgrade.
Joseph

Reply via email to