After using wget with the -r switch I noticed some files were not being
pulled down. I am mirroring a professor's website so I don't have to check
her website, just automate everything and read from laptop.
Anyway, in her index file 3 urls have a CR in them for the file name,
which causes the website to fail to send back the file to wget because
wget is sending the string unfiltered.
For example:
<a href="HW
02_Sol.html">Solution to HW02
</a>
Strange formatting, but retrieval fails because the filename is being
passed literally back to the website.
My fast solution is:
handle_link (struct collect_urls_closure *closure, const char *link_uri,
struct taginfo *tag, int attrid)
{
...
int i, j;
char *ptr, *cr_ptr;
...
char *fragment = ...
/*warning issued with the following line when doing "make"*/
cr_ptr = strstr(link_uri, "\n");
if (cr_ptr)
{
/* Nullify the \n within the link_uri, this is a compensation
for some weirdness I have encountered. I don't know why,
but for some reason some people put a CR/LF in their html.
This code should probably go elsewhere, but I used the -d flag
to pick this location and used the following "if (fragment)"
code as a template. */
i = strlen(link_uri);
ptr = alloca (i);
memcpy(ptr, link_uri, i);
for (j = 0; ptr[j] != '\n'; j++)
;
for (; j < i; j++)
ptr[j] = ptr[j+1];
link_uri = ptr;
}
if (fragment)
...
}
However I think I am breaking some output. Is \n allowed in link_uri????
The code does what it needs to do, I haven't looked everywhere so I don't
know if this is the most appropriate place to place this code. I used the
debugging switch and worked it from there.
Thanks for a great upgrade.
Joseph