Re: [Bug-wget] Segmentation fault with current development version of wget
Hi Giuseppe, Dropping the bit that sanitizes the opt.method is probably a good idea. (Perhaps I shouldn't have replied to your patch directly.) Still, even if the sanitization is removed: I think it would be better if RESTORE_POST_DATA restores the previous value of opt.method, instead of overwriting it with a hardcoded POST. Isn't it? A related question: how is a redirect response to a PUT request handled? How should it be handled? I haven't tried it, but it looks like in that case the SUSPEND_POST_DATA macro is called (by retrieve_url in retr.c). If that's true, then later on the opt.method would be 'restored' to POST by RESTORE_POST_DATA. Regards, Gijs Op 01-05-13 22:16 schreef Giuseppe Scrivano: hi Gijs, Gijs van Tulder gvtul...@gmail.com writes: Giuseppe Scrivano wrote: what about this patch? Any comment? Another suggestion: why not save the original opt.method, set opt.method to NULL and put the original opt.method back later? thanks for your suggestion but I think we should drop the code that modifies opt.method, since we have to sanitize it only when it is specified as argument. Objections?
[Bug-wget] Remaining reference to opt.post_data (WARC in src/http.c)
Hi, For the new --body-data option most of the code that used to reference opt.post_data has been changed to use opt.body_data. I found one remaining reference, hidden in one of the WARC-writing sections of src/http.c. Wget would crash if you combine --body-data with --warc-file. It's a simple fix. See the attached patch. Regards, Gijs From d2e6e16b3062cc0e6b3c13fd04e3654ed2dbdb6e Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Sun, 21 Apr 2013 22:36:50 +0200 Subject: [PATCH] Remove old reference to opt.post_data. --- src/ChangeLog |5 + src/http.c|2 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/src/ChangeLog b/src/ChangeLog index 8a60e5d..64fc634 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,8 @@ +2013-04-21 Gijs van Tulder gvtul...@gmail.com + + * http.c: Copy opt.body_data to the WARC file, instead of + opt.post_data (the old option). + 2013-04-12 Gijs van Tulder gvtul...@gmail.com * warc.c: Generate unique UUIDs for the manifest and the record diff --git a/src/http.c b/src/http.c index 3e4d7cc..88f7a96 100644 --- a/src/http.c +++ b/src/http.c @@ -2150,7 +2150,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, struct url *proxy, warc_payload_offset = ftello (warc_tmp); /* Write a copy of the data to the WARC record. */ - int warc_tmp_written = fwrite (opt.post_data, 1, body_data_size, warc_tmp); + int warc_tmp_written = fwrite (opt.body_data, 1, body_data_size, warc_tmp); if (warc_tmp_written != body_data_size) write_error = -2; } -- 1.7.9.5
[Bug-wget] Standards fix for metadata records in WARC files
This patch repairs two minor problems in the WARC metadata records. 1. Each record should have its own unique WARC-Record-ID, but currently the ID for the record holding the manifest is reused for the record holding the arguments. The patch generates a new ID for the arguments (and refers to the manifest in a WARC-Concurrent-To header). 2. According to the WARC implementation guidelines [1], the manifest should be written to a metadata record, but Wget stores it as a resource record. The patch corrects this. Regards, Gijs [1] Section 2.4.4 of http://www.netpreserve.org/resources/warc-implementation-guidelines-v1 commit b54fb8feb9dfb2a111d15f1b759de61217d5251e Author: Gijs van Tulder gvtul...@gmail.com Date: Fri Apr 12 23:37:45 2013 +0200 warc: Follow the guidelines for metadata records Do not use the same UUID for the manifest and arguments records. Write the manifest as a metadata record, not as a resource. diff --git a/src/ChangeLog b/src/ChangeLog index 65d636d..e609f2d 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,11 @@ +2013-04-12 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Generate unique UUIDs for the manifest and the record + holding the command-line arguments. + Write the manifest to a metadata record to follow the WARC + implementation guidelines. + * warc.h: Declare new function warc_write_metadata_record. + 2013-03-31 Gijs van Tulder gvtul...@gmail.com * warc.c: Correctly write the field length in the skip length field diff --git a/src/warc.c b/src/warc.c index 9b10610..916b53d 100644 --- a/src/warc.c +++ b/src/warc.c @@ -1083,7 +1083,7 @@ warc_write_metadata (void) warc_uuid_str (manifest_uuid); fflush (warc_manifest_fp); - warc_write_resource_record (manifest_uuid, + warc_write_metadata_record (manifest_uuid, metadata://gnu.org/software/wget/warc/MANIFEST.txt, NULL, NULL, NULL, text/plain, warc_manifest_fp, -1); @@ -1098,9 +1098,9 @@ warc_write_metadata (void) fflush (warc_tmp_fp); fprintf (warc_tmp_fp, %s\n, program_argstring); - warc_write_resource_record (manifest_uuid, + warc_write_resource_record (NULL, metadata://gnu.org/software/wget/warc/wget_arguments.txt, - NULL, NULL, NULL, text/plain, + NULL, manifest_uuid, NULL, text/plain, warc_tmp_fp, -1); /* warc_write_resource_record has closed warc_tmp_fp. */ @@ -1395,20 +1395,22 @@ warc_write_response_record (char *url, char *timestamp_str, return warc_write_ok; } -/* Writes a resource record to the WARC file. +/* Writes a resource or metadata record to the WARC file. + warc_type is either resource or metadata, resource_uuid is the uuid of the resource (or NULL), url is the target uri of the resource, timestamp_str is the timestamp (generated with warc_timestamp), - concurrent_to_uuid is the uuid of the request for that generated this + concurrent_to_uuid is the uuid of the record that generated this, resource (generated with warc_uuid_str) or NULL, ip is the ip address of the server (or NULL), content_type is the mime type of the body (or NULL), body is a pointer to a file containing the resource data. Calling this function will close body. Returns true on success, false on error. */ -bool -warc_write_resource_record (char *resource_uuid, const char *url, - const char *timestamp_str, const char *concurrent_to_uuid, +static bool +warc_write_record (const char *record_type, char *resource_uuid, + const char *url, const char *timestamp_str, + const char *concurrent_to_uuid, ip_address *ip, const char *content_type, FILE *body, off_t payload_offset) { @@ -1422,7 +1424,7 @@ warc_write_resource_record (char *resource_uuid, const char *url, content_type = application/octet-stream; warc_write_start_record (); - warc_write_header (WARC-Type, resource); + warc_write_header (WARC-Type, record_type); warc_write_header (WARC-Record-ID, resource_uuid); warc_write_header (WARC-Warcinfo-ID, warc_current_warcinfo_uuid_str); warc_write_header (WARC-Concurrent-To, concurrent_to_uuid); @@ -1438,3 +1440,47 @@ warc_write_resource_record (char *resource_uuid, const char *url, return warc_write_ok; } + +/* Writes a resource record to the WARC file. + resource_uuid is the uuid of the resource (or NULL), + url is the target uri of the resource, + timestamp_str is the timestamp (generated with warc_timestamp), + concurrent_to_uuid is the uuid of the record that generated this, + resource (generated with warc_uuid_str) or NULL, + ip is the ip address of the server (or NULL), + content_type is the mime type of the body (or NULL), + body is a pointer to a file containing the resource data
Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files
Hi, It appears wget may be creating slightly malformed GZIP skip-length fields I think that's correct: Wget doesn't write the subfield length in the extra field section of the header. After the subfield ID sl it should write the length LEN (see RFC 1952 [1]), but it doesn't. Luckily, it does write the correct length of all extra fields (XLEN in the RFC 1952), so Gzip implementations that just ignore the extra field can skip it without problems. This is the case for the GNU Gzip utility. But it should be fixed. I've attached a patch. It's likely that we'll need to make the warc.gz parsers a bit more robust, but I thought I'd mention it here in case this is actually a bug in wget. When I wrote the code for the extra field I used the old Hanzo warc-tools [2] as an example. That implementation has the same problem: it doesn't write the field length [3]. This means there's at least one other tool that writes these off-spec warc.gz files, so it's probably useful to make the parser a bit more robust. Thanks, Gijs [1] http://www.gzip.org/zlib/rfc-gzip.html [2] https://code.google.com/p/warc-tools/ [2] https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314 diff --git a/src/ChangeLog b/src/ChangeLog index 8e1213f..65d636d 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,8 @@ +2013-03-31 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Correctly write the field length in the skip length field + of .warc.gz files. (Following the GZIP spec in RFC 1952.) + 2013-03-12 Darshit Shah dar...@gmail.com * http.c (gethttp): Make wget return FILEBADFILE error and abort if diff --git a/src/warc.c b/src/warc.c index fb506a7..9b10610 100644 --- a/src/warc.c +++ b/src/warc.c @@ -165,7 +165,7 @@ warc_write_string (const char *str) } -#define EXTRA_GZIP_HEADER_SIZE 12 +#define EXTRA_GZIP_HEADER_SIZE 14 #define GZIP_STATIC_HEADER_SIZE 10 #define FLG_FEXTRA 0x04 #define OFF_FLG 3 @@ -200,7 +200,7 @@ warc_write_start_record (void) In warc_write_end_record we will fill this space with information about the uncompressed and compressed size of the record. */ - fprintf (warc_current_file, ); + fseek (warc_current_file, EXTRA_GZIP_HEADER_SIZE, SEEK_CUR); fflush (warc_current_file); /* Start a new GZIP stream. */ @@ -342,16 +342,19 @@ warc_write_end_record (void) /* The extra header field identifier for the WARC skip length. */ extra_header[2] = 's'; extra_header[3] = 'l'; + /* The size of the field value (8 bytes). */ + extra_header[4] = (8 255); + extra_header[5] = ((8 8) 255); /* The size of the uncompressed record. */ - extra_header[4] = (uncompressed_size 255); - extra_header[5] = (uncompressed_size 8) 255; - extra_header[6] = (uncompressed_size 16) 255; - extra_header[7] = (uncompressed_size 24) 255; + extra_header[6] = (uncompressed_size 255); + extra_header[7] = (uncompressed_size 8) 255; + extra_header[8] = (uncompressed_size 16) 255; + extra_header[9] = (uncompressed_size 24) 255; /* The size of the compressed record. */ - extra_header[8] = (compressed_size 255); - extra_header[9] = (compressed_size 8) 255; - extra_header[10] = (compressed_size 16) 255; - extra_header[11] = (compressed_size 24) 255; + extra_header[10] = (compressed_size 255); + extra_header[11] = (compressed_size 8) 255; + extra_header[12] = (compressed_size 16) 255; + extra_header[13] = (compressed_size 24) 255; /* Write the extra header after the static header. */ fseeko (warc_current_file, warc_current_gzfile_offset
Re: [Bug-wget] [PATCH] Invalid Content-Length header in WARC files, on some platforms
Giuseppe Scrivano writes: From 1e229375aa89cdc0bba07335fbe10d4f66180f68 Mon Sep 17 00:00:00 2001 Subject: [PATCH] warc: fix format string for off_t Good to see that that's fixed. However, there's another instance of this problem in the warc_write_cdx_record function in warc.c. (I saw that Tim Ruehsen fixed this in his version of the patch.) The attached patch uses number_to_string to fix the printf in warc_write_cdx_record. Regards, Gijs From 21fc9f0dd9c71e2dc3aea29be4e16f14620d12a5 Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Sat, 24 Nov 2012 12:44:14 +0100 Subject: [PATCH] warc: fix format string for off_t in CDX function. --- src/ChangeLog |5 + src/warc.c|8 ++-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/src/ChangeLog b/src/ChangeLog index 07152a5..45b2a70 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,8 @@ +2012-11-24 Gijs van Tulder gvtul...@gmail.com + + * warc.c (warc_write_cdx_record): Use `number_to_string' to + convert the offset to a string. + 2012-11-24 Giuseppe Scrivano gscriv...@gnu.org * warc.c (warc_write_block_from_file): Use `number_to_string' to diff --git a/src/warc.c b/src/warc.c index 99e7016..25a8517 100644 --- a/src/warc.c +++ b/src/warc.c @@ -1225,10 +1225,14 @@ warc_write_cdx_record (const char *url, const char *timestamp_str, if (redirect_location == NULL || strlen(redirect_location) == 0) redirect_location = -; + char offset_string[22]; + number_to_string (offset_string, offset); + /* Print the CDX line. */ - fprintf (warc_current_cdx_file, %s %s %s %s %d %s %s - %ld %s %s\n, url, + fprintf (warc_current_cdx_file, %s %s %s %s %d %s %s - %s %s %s\n, url, timestamp_str_cdx, url, mime_type, response_code, checksum, - redirect_location, offset, warc_current_filename, response_uuid); + redirect_location, offset_string, warc_current_filename, + response_uuid); fflush (warc_current_cdx_file); return true; -- 1.7.9.5
[Bug-wget] Invalid Content-Length header in WARC files, on some platforms
Hi, There's a somewhat serious issue in the WARC-generating code: on some platforms (presumably the ones where off_t is not a 64-bit number) the Content-Length header at the top of each WARC record has an incorrect length. On these platforms it is sometimes 0, sometimes 1, but never the correct length. This makes the whole WARC file unreadable. The code works fine on many platforms, but it is apparently a problem on some PowerPC and ARM systems, and maybe other systems as well. Existing WARC files with this problem can be repaired by replacing the value of the Content-Length header with the correct value, for each WARC record in the file. The content of the WARC records is there, it's just the Content-Length header that is wrong. The attached patch fixes the problem in warc.c. It replaces off_t by wgint and uses the number_to_static_string function from util.c. Regards, Gijs commit 66c0595f5440b36afb7307d4cab3d6430254183b Author: Gijs van Tulder gvtul...@gmail.com Date: Mon Nov 12 22:03:30 2012 +0100 Fix for invalid WARC Content-Length header on some platforms. diff --git a/src/ChangeLog b/src/ChangeLog index ec78fe8..3901d94 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,10 @@ +2012-11-12 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Fix for invalid Content-Length WARC header on platforms + where off_t is less than 64 bits wide. + * warc.h: Likewise: Use wgint instead of off_t. + * http.c: Likewise. + 2012-08-29 Rohit Mathulla rohit_mathu...@yahoo.com (tiny change) * html-url.c (get_urls_file): Convert shorthand URLs. diff --git a/src/http.c b/src/http.c index 5888474..52cbe87 100644 --- a/src/http.c +++ b/src/http.c @@ -1712,7 +1712,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, struct url *proxy, char warc_timestamp_str [21]; char warc_request_uuid [48]; ip_address *warc_ip = NULL; - off_t warc_payload_offset = -1; + wgint warc_payload_offset = -1; /* Whether this connection will be kept alive after the HTTP request is done. */ diff --git a/src/warc.c b/src/warc.c index de99bf7..894b802 100644 --- a/src/warc.c +++ b/src/warc.c @@ -78,10 +78,10 @@ static FILE *warc_current_file; static gzFile warc_current_gzfile; /* The offset of the current gzip record in the WARC file. */ -static off_t warc_current_gzfile_offset; +static wgint warc_current_gzfile_offset; /* The uncompressed size (so far) of the current record. */ -static off_t warc_current_gzfile_uncompressed_size; +static wgint warc_current_gzfile_uncompressed_size; # endif /* This is true until a warc_write_* method fails. */ @@ -247,7 +247,9 @@ warc_write_block_from_file (FILE *data_in) /* Add the Content-Length header. */ char *content_length; fseeko (data_in, 0L, SEEK_END); - if (! asprintf (content_length, %ld, ftello (data_in))) + wgint bytes = ftello (data_in); + int ret = asprintf (content_length, %s, number_to_static_string (bytes)); + if (ret 0) { warc_write_ok = false; return false; @@ -313,9 +315,9 @@ warc_write_end_record (void) */ /* Calculate the uncompressed and compressed sizes. */ - off_t current_offset = ftello (warc_current_file); - off_t uncompressed_size = current_offset - warc_current_gzfile_offset; - off_t compressed_size = warc_current_gzfile_uncompressed_size; + wgint current_offset = ftello (warc_current_file); + wgint uncompressed_size = current_offset - warc_current_gzfile_offset; + wgint compressed_size = warc_current_gzfile_uncompressed_size; /* Go back to the static GZIP header. */ fseeko (warc_current_file, warc_current_gzfile_offset @@ -414,14 +416,14 @@ warc_write_ip_header (ip_address *ip) 16 bytes beginning ad RES_PAYLOAD. */ static int warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload, - off_t payload_offset) + wgint payload_offset) { #define BLOCKSIZE 32768 struct sha1_ctx ctx_block; struct sha1_ctx ctx_payload; - off_t pos; - off_t sum; + wgint pos; + wgint sum; char *buffer = malloc (BLOCKSIZE + 72); if (!buffer) @@ -440,7 +442,7 @@ warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload, /* We read the file in blocks of BLOCKSIZE bytes. One call of the computation function processes the whole buffer so that with the next round of the loop another block can be read. */ - off_t n; + wgint n; sum = 0; /* Read block. Take care for partial reads. */ @@ -481,7 +483,7 @@ warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload, if (payload_offset = 0 payload_offset pos) { /* At least part of the buffer contains data from payload. */ - off_t start_of_payload = payload_offset - (pos - BLOCKSIZE); + wgint start_of_payload = payload_offset - (pos - BLOCKSIZE
[Bug-wget] Segfault with WARC + CDX
Hi, There's a bug in the warc_find_duplicate_cdx_record function. If you provide a file with CDX records, Wget can segfault if a record is not found in the CDX file. In fact, the deduplication now only works if *every* new record can be found in the CDX index. The segmentation fault is generated on these lines in src/warc.c: hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, key, rec_existing); if (rec_existing != NULL strcmp (rec_existing-url, url) == 0) Other than the code expects hash_table_get_pair does not set rec_existing to NULL if no record is found. So instead of checking for NULL, the function should check if the return value of hash_table_get_pair is non-zero: int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, key, rec_existing); if (found strcmp (rec_existing-url, url) == 0) The attached patch makes this change. The deduplication works better. Regards, Gijs From 807b98d7d9289765c9f210336d2dbf294d663f99 Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Wed, 30 May 2012 23:00:04 +0200 Subject: [PATCH] warc: Fix segfault if CDX record is not found. --- src/ChangeLog |4 src/warc.c|6 +++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/src/ChangeLog b/src/ChangeLog index 7e16b17..9e74e47 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,7 @@ +2012-05-30 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Fix segfault if CDX record is not found. + 2011-05-26 Steven Schweda s...@antinode.info * connect.c [HAVE_SYS_SOCKET_H]: Include sys/socket.h. [HAVE_SYS_SELECT_H]: Include sys/select.h. diff --git a/src/warc.c b/src/warc.c index 24751db..92a49ef 100644 --- a/src/warc.c +++ b/src/warc.c @@ -1001,10 +1001,10 @@ warc_find_duplicate_cdx_record (char *url, char *sha1_digest_payload) char *key; struct warc_cdx_record *rec_existing; - hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, key, - rec_existing); + int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, + key, rec_existing); - if (rec_existing != NULL strcmp (rec_existing-url, url) == 0) + if (found strcmp (rec_existing-url, url) == 0) return rec_existing; else return NULL; -- 1.7.4.1
[Bug-wget] Combining --output-document with --recursive
Hi, There's a problem if you combine --output-document with --recursive or --page-requisites. --output-document breaks the recursion. First you get a warning: WARNING: combining -O with -r or -p will mean that all downloaded content will be placed in the single file you specified. That is what you'd expect, no problem there. However, there is a problem with the recursion. Because Wget *appends* all downloaded content in the same file, the HTML and CSS parsers get confused. The same content is parsed over and over again, each time with a different URL context. Example: 1. You run wget -O out.tmp -r http://example.com/index.html 2. http://example.com/index.html is written to out.tmp. URLs are extracted from out.tmp relative to http://example.com/index.html. Suppose that there is a link to a subdirectory test/index.html, which is added to the download queue as http://example.com/test/index.html (correct). 3. http://example.com/test/index.html is appended to out.tmp. Now, again, Wget extracts URLs from out.tmp. It parses the whole file, so it first finds the contents of /index.html, with the link to test/index.html. Because Wget thinks it is now parsing http://example.com/test/index.html, it will enqueue this as http://example.com/test/test/index.html (wrong). One obvious solution, which I've added to this email, is to clear the output document before downloading the next file. This breaks the current behaviour, so maybe it's not a good idea. Is there a better solution? Regards, Gijs -- index 8d4edba..502b68f 100644 --- a/src/http.c +++ b/src/http.c @@ -2888,7 +2888,18 @@ read_header: } } else -fp = output_stream; +{ + fp = output_stream; + rewind (fp); + if (ftruncate (fileno (fp), 0) == -1) +{ + logprintf (LOG_NOTQUIET, Could not truncate output file: %s\n, strerror (errno)); + CLOSE_INVALIDATE (sock); + xfree (head); + xfree_null (type); + return FOPENERR; +} +} /* Print fetch message, if opt.verbose. */ if (opt.verbose)
Re: [Bug-wget] Regular expression matching
Hi, Here is a new version of the regular expressions patch. The new version combines POSIX (always, from gnulib) and PCRE (if available). The patch adds these options: --accept-regex=... --reject-regex=... --regex-type=posix for POSIX extended regexes (the default) --regex-type=pcrefor PCRE regexes (if PCRE is available) In reference to the --match-query-string patch: since the regexes look at the complete URL, you can also use them to match the query string. Regards, Gijs === modified file 'ChangeLog' --- ChangeLog 2012-03-25 11:47:53 + +++ ChangeLog 2012-04-10 22:28:11 + @@ -1,3 +1,8 @@ +2012-04-11 Gijs van Tulder gvtul...@gmail.com + + * bootstrap.conf (gnulib_modules): Include module `regex'. + * configure.ac: Check for PCRE library. + 2012-03-25 Ray Satiro raysat...@yahoo.com * configure.ac: Fix build under mingw when OpenSSL is used. === modified file 'bootstrap.conf' --- bootstrap.conf 2012-03-20 19:41:14 + +++ bootstrap.conf 2012-04-04 15:09:08 + @@ -58,6 +58,7 @@ quote quotearg recv +regex select send setsockopt === modified file 'configure.ac' --- configure.ac 2012-03-25 11:47:53 + +++ configure.ac 2012-04-10 21:59:48 + @@ -532,6 +532,18 @@ ]) ) +dnl +dnl Check for PCRE +dnl + +AC_CHECK_HEADER(pcre.h, +AC_CHECK_LIB(pcre, pcre_compile, + [LIBS=${LIBS} -lpcre + AC_DEFINE([HAVE_LIBPCRE], 1, + [Define if libpcre is available.]) + ]) +) + dnl Needed by src/Makefile.am AM_CONDITIONAL([IRI_IS_ENABLED], [test X$iri != Xno]) === modified file 'src/ChangeLog' --- src/ChangeLog 2012-04-01 14:30:59 + +++ src/ChangeLog 2012-04-10 22:30:28 + @@ -1,3 +1,12 @@ +2012-04-11 Gijs van Tulder gvtul...@gmail.com + + * init.c: Add --accept-regex, --reject-regex and --regex-type. + * main.c: Likewise. + * options.c: Likewise. + * recur.c: Likewise. + * utils.c: Add regex-related functions. + * utils.h: Add regex-related functions. + 2012-04-01 Giuseppe Scrivano gscriv...@gnu.org * gnutls.c (wgnutls_read_timeout): Ensure timer is freed. === modified file 'src/init.c' --- src/init.c 2012-03-08 09:00:51 + +++ src/init.c 2012-04-10 22:10:10 + @@ -46,6 +46,10 @@ # endif #endif +#include regex.h +#ifdef HAVE_LIBPCRE +# include pcre.h +#endif #ifdef HAVE_PWD_H # include pwd.h @@ -94,6 +98,7 @@ CMD_DECLARE (cmd_spec_prefer_family); CMD_DECLARE (cmd_spec_progress); CMD_DECLARE (cmd_spec_recursive); +CMD_DECLARE (cmd_spec_regex_type); CMD_DECLARE (cmd_spec_restrict_file_names); #ifdef HAVE_SSL CMD_DECLARE (cmd_spec_secure_protocol); @@ -116,6 +121,7 @@ } commands[] = { /* KEEP THIS LIST ALPHABETICALLY SORTED */ { accept, opt.accepts, cmd_vector }, + { acceptregex, opt.acceptregex_s, cmd_string }, { addhostdir, opt.add_hostdir, cmd_boolean }, { adjustextension, opt.adjust_extension, cmd_boolean }, { alwaysrest, opt.always_rest, cmd_boolean }, /* deprecated */ @@ -236,7 +242,9 @@ { reclevel, opt.reclevel, cmd_number_inf }, { recursive,NULL, cmd_spec_recursive }, { referer, opt.referer, cmd_string }, + { regextype,opt.regex_type,cmd_spec_regex_type }, { reject, opt.rejects, cmd_vector }, + { rejectregex, opt.rejectregex_s, cmd_string }, { relativeonly, opt.relative_only, cmd_boolean }, { remoteencoding, opt.encoding_remote, cmd_string }, { removelisting,opt.remove_listing,cmd_boolean }, @@ -361,6 +369,8 @@ opt.restrict_files_nonascii = false; opt.restrict_files_case = restrict_no_case_restriction; + opt.regex_type = regex_type_posix; + opt.max_redirect = 20; opt.waitretry = 10; @@ -1368,6 +1378,25 @@ return true; } +/* Validate --regex-type and set the choice. */ + +static bool +cmd_spec_regex_type (const char *com, const char *val, void *place_ignored) +{ + static const struct decode_item choices[] = { +{ posix, regex_type_posix }, +#ifdef HAVE_LIBPCRE +{ pcre, regex_type_pcre }, +#endif + }; + int regex_type = regex_type_posix; + int ok = decode_string (val, choices, countof (choices), regex_type); + if (!ok) +fprintf (stderr, _(%s: %s: Invalid value %s.\n), exec_name, com, quote (val)); + opt.regex_type = regex_type; + return ok; +} + static bool cmd_spec_restrict_file_names (const char *com, const char *val, void *place_ignored) { === modified file 'src/main.c' --- src/main.c 2012-03-05 21:23:06 + +++ src/main.c 2012-04-10 22:25:56 + @@ -158,6 +158,7 @@ static struct cmdline_option option_data[] = { { accept, 'A', OPT_VALUE, accept, -1 }, +{ accept-regex, 0, OPT_VALUE, acceptregex, -1 }, { adjust-extension, 'E', OPT_BOOLEAN, adjustextension, -1 }, { append-output, 'a', OPT__APPEND_OUTPUT, NULL, required_argument
[Bug-wget] Regular expression matching
Hi, Here is a patch that adds the --acceptregex and --rejectregex options. With these options it would be possible to do two things: 1. You can match complete urls, instead of just the directory prefix or the file name suffix (which you can do with --accept and --include-directories). 2. You can use regular expressions to do the matching, which is sometimes easier to than using a list of wildcard patterns. Now this isn't a new idea (there are long discussions in the archive, see [1]). But somehow the previous attempts didn't make it, so I thought I'd send my own version. It's a small patch, I've been using it for a while and found it really useful. I've made two versions of the patch: one uses PCRE, the other uses the gnulib regex library, which is probably easier to integrate. Regards, Gijs [1] https://lists.gnu.org/archive/html/bug-wget/2009-09/msg00035.html === modified file 'bootstrap.conf' --- bootstrap.conf 2012-03-20 19:41:14 + +++ bootstrap.conf 2012-04-04 15:09:08 + @@ -58,6 +58,7 @@ quote quotearg recv +regex select send setsockopt === modified file 'src/init.c' --- src/init.c 2012-03-08 09:00:51 + +++ src/init.c 2012-04-04 17:46:59 + @@ -80,6 +80,7 @@ CMD_DECLARE (cmd_directory_vector); CMD_DECLARE (cmd_number); CMD_DECLARE (cmd_number_inf); +CMD_DECLARE (cmd_regex); CMD_DECLARE (cmd_string); CMD_DECLARE (cmd_file); CMD_DECLARE (cmd_directory); @@ -116,6 +117,7 @@ } commands[] = { /* KEEP THIS LIST ALPHABETICALLY SORTED */ { accept, opt.accepts, cmd_vector }, + { acceptregex, opt.acceptregex, cmd_regex }, { addhostdir, opt.add_hostdir, cmd_boolean }, { adjustextension, opt.adjust_extension, cmd_boolean }, { alwaysrest, opt.always_rest, cmd_boolean }, /* deprecated */ @@ -237,6 +239,7 @@ { recursive,NULL, cmd_spec_recursive }, { referer, opt.referer, cmd_string }, { reject, opt.rejects, cmd_vector }, + { rejectregex, opt.rejectregex, cmd_regex }, { relativeonly, opt.relative_only, cmd_boolean }, { remoteencoding, opt.encoding_remote, cmd_string }, { removelisting,opt.remove_listing,cmd_boolean }, @@ -943,6 +946,30 @@ return true; } +/* Compile the regular expression and place a + pointer to *PLACE. */ +static bool +cmd_regex (const char *com, const char *val, void *place) +{ + regex_t **regex = (regex_t **)place; + *regex = malloc (sizeof (regex_t)); + + int errcode = regcomp (*regex, val, REG_EXTENDED | REG_NOSUB); + + if (errcode != 0) +{ + int errbuf_size = regerror (errcode, *regex, NULL, 0); + char *errbuf = malloc (errbuf_size); + errbuf_size = regerror (errcode, *regex, errbuf, errbuf_size); + fprintf (stderr, _(%s: %s: Invalid regular expression %s, %s\n), + exec_name, com, quote (val), errbuf); + xfree (errbuf); + return false; +} + + return true; +} + /* Like the above, but handles tilde-expansion when reading a user's `.wgetrc'. In that case, and if VAL begins with `~', the tilde === modified file 'src/main.c' --- src/main.c 2012-03-05 21:23:06 + +++ src/main.c 2012-04-04 15:15:50 + @@ -158,6 +158,7 @@ static struct cmdline_option option_data[] = { { accept, 'A', OPT_VALUE, accept, -1 }, +{ acceptregex, 0, OPT_VALUE, acceptregex, -1 }, { adjust-extension, 'E', OPT_BOOLEAN, adjustextension, -1 }, { append-output, 'a', OPT__APPEND_OUTPUT, NULL, required_argument }, { ask-password, 0, OPT_BOOLEAN, askpassword, -1 }, @@ -263,6 +264,7 @@ { recursive, 'r', OPT_BOOLEAN, recursive, -1 }, { referer, 0, OPT_VALUE, referer, -1 }, { reject, 'R', OPT_VALUE, reject, -1 }, +{ rejectregex, 0, OPT_VALUE, rejectregex, -1 }, { relative, 'L', OPT_BOOLEAN, relativeonly, -1 }, { remote-encoding, 0, OPT_VALUE, remoteencoding, -1 }, { remove-listing, 0, OPT_BOOLEAN, removelisting, -1 }, @@ -723,6 +725,10 @@ N_(\ -R, --reject=LIST comma-separated list of rejected extensions.\n), N_(\ + --acceptregex=REGEX extended regex matching accepted URLs.\n), +N_(\ + --rejectregex=REGEX extended regex matching rejected URLs.\n), +N_(\ -D, --domains=LIST comma-separated list of accepted domains.\n), N_(\ --exclude-domains=LIST comma-separated list of rejected domains.\n), === modified file 'src/options.h' --- src/options.h 2012-03-05 21:23:06 + +++ src/options.h 2012-04-04 17:43:42 + @@ -29,6 +29,8 @@ shall include the source code for the parts of OpenSSL used as well as that of the covered work. */ +#include regex.h + struct options { int verbose; /* Are we verbose? (First set to -1, @@ -74,6 +76,9 @@ bool ignore_case; /* Whether to ignore case when matching dirs and files */ + regex_t *acceptregex; /* Patterns
Re: [Bug-wget] Regular expression matching
Ángel González wrote: I really like PCRE, but I think the default should be POSIX regex Certainly. (I'm not sure if it's even worth adding the PCRE option. Matching URLs can't be that hard, can it?) How are the interactions between --{accept,reject}regex and --{accept,reject}? The regex options are just another group of options in the list of accept/reject checks: if an URL doesn't pass one of the tests it's rejected. Regards, Gijs
[Bug-wget] Fix for crash on invalid STYLE tag
Hi, Here's a tiny fix for a problem in the HTML parsing in html-url.c. Wget crashes on HTML files that contain an incomplete STYLE tag, e.g.: style /style If it finds one of those, it calls get_urls_css with an invalid buffer (the buffer has a negative length), which leads to this crash: bad buffer in yy_scan_bytes() ERROR (2) The attached patch checks the buffer before calling get_urls_css. The content of the incomplete tag still won't be parsed, but at least it will no longer lead to a crash. Regards, Gijs === modified file 'src/ChangeLog' --- src/ChangeLog 2012-03-29 18:13:27 + +++ src/ChangeLog 2012-04-01 20:35:28 + @@ -1,3 +1,7 @@ +2012-04-01 Gijs van Tulder gvtul...@gmail.com (tiny change) + + * html-url.c: Prevent crash on incomplete STYLE tag. + 2012-03-29 From: Tim Ruehsen tim.rueh...@gmx.de (tiny change) * utils.c (library): Include sys/time.h. === modified file 'src/html-url.c' --- src/html-url.c 2011-04-24 11:03:48 + +++ src/html-url.c 2012-04-01 16:08:18 + @@ -676,7 +676,8 @@ check_style_attr (tag, ctx); if (tag-end_tag_p (0 == strcasecmp (tag-name, style)) - tag-contents_begin tag-contents_end) + tag-contents_begin tag-contents_end + tag-contents_begin = tag-contents_end) { /* parse contents */ get_urls_css (ctx, tag-contents_begin - ctx-text,
[Bug-wget] Fix: Large files in WARC
Hi, Another small problem in the WARC section: wget crashes with a segmentation fault if you have WARC output enabled and try to download a file larger than 2GB. I think this is because of the size_t, ftell and fseek in warc.c. The attached patch changes the references from size_t to off_t, ftell to ftello, fseek to fseeko. On my 64-bit system this seemed to fix the problem (but I'm not an expert in these matters, so maybe this doesn't hold for 32-bit systems). Regards, Gijs === modified file 'src/ChangeLog' --- src/ChangeLog 2012-01-28 13:09:29 + +++ src/ChangeLog 2012-01-31 23:16:33 + @@ -1,3 +1,9 @@ +2012-02-01 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Fix large file support with ftello, fseeko. + * warc.h: Fix large file support. + * http.c: Fix large file support. + 2012-01-27 Gijs van Tulder gvtul...@gmail.com * retr.c (fd_read_body): If the response is chunked, the chunk === modified file 'src/http.c' --- src/http.c 2012-01-28 13:08:52 + +++ src/http.c 2012-01-31 22:34:45 + @@ -1712,7 +1712,7 @@ char warc_timestamp_str [21]; char warc_request_uuid [48]; ip_address *warc_ip = NULL; - long int warc_payload_offset = -1; + off_t warc_payload_offset = -1; /* Whether this connection will be kept alive after the HTTP request is done. */ @@ -2127,7 +2127,7 @@ if (write_error = 0 warc_tmp != NULL) { /* Remember end of headers / start of payload. */ - warc_payload_offset = ftell (warc_tmp); + warc_payload_offset = ftello (warc_tmp); /* Write a copy of the data to the WARC record. */ int warc_tmp_written = fwrite (opt.post_data, 1, post_data_size, warc_tmp); @@ -2139,7 +2139,7 @@ { if (warc_tmp != NULL) /* Remember end of headers / start of payload. */ -warc_payload_offset = ftell (warc_tmp); +warc_payload_offset = ftello (warc_tmp); write_error = post_file (sock, opt.post_file_name, post_data_size, warc_tmp); } === modified file 'src/warc.c' --- src/warc.c 2012-01-11 14:27:06 + +++ src/warc.c 2012-01-31 22:35:00 + @@ -50,10 +50,10 @@ static gzFile *warc_current_gzfile; /* The offset of the current gzip record in the WARC file. */ -static size_t warc_current_gzfile_offset; +static off_t warc_current_gzfile_offset; /* The uncompressed size (so far) of the current record. */ -static size_t warc_current_gzfile_uncompressed_size; +static off_t warc_current_gzfile_uncompressed_size; # endif /* This is true until a warc_write_* method fails. */ @@ -158,7 +158,7 @@ return false; fflush (warc_current_file); - if (opt.warc_maxsize 0 ftell (warc_current_file) = opt.warc_maxsize) + if (opt.warc_maxsize 0 ftello (warc_current_file) = opt.warc_maxsize) warc_start_new_file (false); #ifdef HAVE_LIBZ @@ -166,7 +166,7 @@ if (opt.warc_compression_enabled) { /* Record the starting offset of the new record. */ - warc_current_gzfile_offset = ftell (warc_current_file); + warc_current_gzfile_offset = ftello (warc_current_file); /* Reserve space for the extra GZIP header field. In warc_write_end_record we will fill this space @@ -217,8 +217,8 @@ { /* Add the Content-Length header. */ char *content_length; - fseek (data_in, 0L, SEEK_END); - if (! asprintf (content_length, %ld, ftell (data_in))) + fseeko (data_in, 0L, SEEK_END); + if (! asprintf (content_length, %ld, ftello (data_in))) { warc_write_ok = false; return false; @@ -229,7 +229,7 @@ /* End of the WARC header section. */ warc_write_string (\r\n); - if (fseek (data_in, 0L, SEEK_SET) != 0) + if (fseeko (data_in, 0L, SEEK_SET) != 0) warc_write_ok = false; /* Copy the data in the file to the WARC record. */ @@ -266,7 +266,7 @@ } fflush (warc_current_file); - fseek (warc_current_file, 0, SEEK_END); + fseeko (warc_current_file, 0, SEEK_END); /* The WARC standard suggests that we add 'skip length' data in the extra header field of the GZIP stream. @@ -284,12 +284,12 @@ */ /* Calculate the uncompressed and compressed sizes. */ - size_t current_offset = ftell (warc_current_file); - size_t uncompressed_size = current_offset - warc_current_gzfile_offset; - size_t compressed_size = warc_current_gzfile_uncompressed_size; + off_t current_offset = ftello (warc_current_file); + off_t uncompressed_size = current_offset - warc_current_gzfile_offset; + off_t compressed_size = warc_current_gzfile_uncompressed_size; /* Go back to the static GZIP header. */ - fseek (warc_current_file, warc_current_gzfile_offset + EXTRA_GZIP_HEADER_SIZE, SEEK_SET); + fseeko (warc_current_file, warc_current_gzfile_offset + EXTRA_GZIP_HEADER_SIZE, SEEK_SET); /* Read the header. */ char static_header
[Bug-wget] Two fixes: Memory leak with chunked responses / Chunked responses and WARC files
Hi, Here are two small patches. I hope they will be useful. First, a patch that fixes a memory leak in fd_read_body (src/retr.c) and skip_short_body (src/http.c) when it retrieves a response with Transfer-Encoding: chunked. Both functions make calls to fd_read_line but never free the result. Second, a patch to the fd_read_body function that changes the way chunked responses are saved in the WARC file. Until now, wget would write a de-chunked response to the WARC file, which is wrong: the WARC file is supposed to have an exact copy of the HTTP response, so it should also include the chunk headers. The first patch fixes the memory leaks. The second patch changes fd_read_body to save the full, chunked response in the WARC file. Regards, Gijs === modified file 'src/ChangeLog' --- src/ChangeLog 2012-01-11 14:27:06 + +++ src/ChangeLog 2012-01-26 21:30:19 + @@ -1,3 +1,8 @@ +2012-01-27 Gijs van Tulder gvtul...@gmail.com + + * retr.c (fd_read_body): Fix a memory leak with chunked responses. + * http.c (skip_short_body): Fix the same memory leak. + 2012-01-09 Gijs van Tulder gvtul...@gmail.com * init.c: Disable WARC compression if zlib is disabled. === modified file 'src/http.c' --- src/http.c 2012-01-08 23:03:23 + +++ src/http.c 2012-01-26 21:30:19 + @@ -951,9 +951,12 @@ break; remaining_chunk_size = strtol (line, endl, 16); + xfree (line); + if (remaining_chunk_size == 0) { - fd_read_line (fd); + line = fd_read_line (fd); + xfree_null (line); break; } } @@ -978,8 +981,13 @@ { remaining_chunk_size -= ret; if (remaining_chunk_size == 0) -if (fd_read_line (fd) == NULL) - return false; +{ + char *line = fd_read_line (fd); + if (line == NULL) +return false; + else +xfree (line); +} } /* Safe even if %.*s bogusly expects terminating \0 because === modified file 'src/retr.c' --- src/retr.c 2011-11-04 21:25:00 + +++ src/retr.c 2012-01-26 21:30:19 + @@ -307,11 +307,16 @@ } remaining_chunk_size = strtol (line, endl, 16); + xfree (line); + if (remaining_chunk_size == 0) { ret = 0; - if (fd_read_line (fd) == NULL) + line = fd_read_line (fd); + if (line == NULL) ret = -1; + else +xfree (line); break; } } @@ -371,11 +376,16 @@ { remaining_chunk_size -= ret; if (remaining_chunk_size == 0) -if (fd_read_line (fd) == NULL) - { -ret = -1; -break; - } +{ + char *line = fd_read_line (fd); + if (line == NULL) +{ + ret = -1; + break; +} + else +xfree (line); +} } } === modified file 'src/ChangeLog' --- src/ChangeLog 2012-01-26 21:30:19 + +++ src/ChangeLog 2012-01-26 21:56:27 + @@ -1,3 +1,9 @@ +2012-01-27 Gijs van Tulder gvtul...@gmail.com + + * retr.c (fd_read_body): If the response is chunked, the chunk + headers are now written to the WARC file, making the WARC file + an exact copy of the HTTP response. + 2012-01-27 Gijs van Tulder gvtul...@gmail.com * retr.c (fd_read_body): Fix a memory leak with chunked responses. * http.c (skip_short_body): Fix the same memory leak. === modified file 'src/retr.c' --- src/retr.c 2012-01-26 21:30:19 + +++ src/retr.c 2012-01-26 21:56:27 + @@ -213,6 +213,9 @@ the data is stored to ELAPSED. If OUT2 is non-NULL, the contents is also written to OUT2. + OUT2 will get an exact copy of the response: if this is a chunked + response, everything -- including the chunk headers -- is written + to OUT2. (OUT will only get the unchunked response.) The function exits and returns the amount of data read. In case of error while reading data, -1 is returned. In case of error while @@ -305,6 +308,8 @@ ret = -1; break; } + else if (out2 != NULL) +fwrite (line, 1, strlen (line), out2); remaining_chunk_size = strtol (line, endl, 16); xfree (line); @@ -316,7 +321,11 @@ if (line == NULL) ret = -1; else -xfree (line); +{ + if (out2 != NULL
Re: [Bug-wget] Cannot compile current bzr trunk: undefined reference to `gzwrite' / `gzclose' / `gzdopen'
Hi all, The attached patch should hopefully fix Evgenii's problem. The patch changes the configure script to always use libz, unless it is explicitly disabled. In that case, the patch makes sure that the WARC functions do not use gzip but write to uncompressed files instead. The funny thing is that libz was already included with the SSL support. Unless you compiled wget with --without-ssl, libz was always compiled in (even if you configured with --without-zlib). Regards, Gijs Op 09-01-12 02:15 schreef Evgenii Philippov: Actually I currently close my work on wget. So these messages are just bug reports for wget collaborators. Some additional info: export PS1=Ok\ Ok uname -asm Linux host_name 2.6.38-13-generic #53-Ubuntu SMP Mon Nov 28 19:33:45 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux Ok lsb_release -dr Description: Ubuntu 11.04 Release: 11.04 Ok With best regards, Thank you for a wonderful utility, -- Evgeniy === modified file 'ChangeLog' --- ChangeLog 2011-12-12 20:30:39 + +++ ChangeLog 2012-01-09 13:40:01 + @@ -1,3 +1,7 @@ +2012-01-09 Gijs van Tulder gvtul...@gmail.com + + * configure.ac: Always try to use libz, even without SSL. + 2011-12-12 Giuseppe Scrivano gscriv...@gnu.org * Makefile.am (EXTRA_DIST): Add build-aux/bzr-version-gen. === modified file 'configure.ac' --- configure.ac 2011-11-04 21:25:00 + +++ configure.ac 2012-01-09 13:40:01 + @@ -65,6 +65,9 @@ [[ --without-ssl disable SSL autodetection --with-ssl={gnutls,openssl} specify the SSL backend. GNU TLS is the default.]]) +AC_ARG_WITH(zlib, +[[ --without-zlib disable zlib ]]) + AC_ARG_ENABLE(opie, [ --disable-opie disable support for opie or s/key FTP login], ENABLE_OPIE=$enableval, ENABLE_OPIE=yes) @@ -234,6 +237,10 @@ dnl Checks for libraries. dnl +AS_IF([test x$with_zlib != xno], [ + AC_CHECK_LIB(z, compress) +]) + AS_IF([test x$with_ssl = xopenssl], [ dnl some versions of openssl use zlib compression AC_CHECK_LIB(z, compress) === modified file 'src/ChangeLog' --- src/ChangeLog 2012-01-08 23:03:23 + +++ src/ChangeLog 2012-01-09 13:40:01 + @@ -1,3 +1,10 @@ +2012-01-09 Gijs van Tulder gvtul...@gmail.com + + * init.c: Disable WARC compression if zlib is disabled. + * main.c: Do not show the 'no-warc-compression' option if zlib is + disabled. + * warc.c: Do not compress WARC files if zlib is disabled. + 2012-01-09 Sasikantha Babu sasikanth@gmail.com (tiny change) * connect.c (connect_to_ip): properly formatted ipv6 address display. (socket_family): New function - returns socket family type. === modified file 'src/init.c' --- src/init.c 2011-11-04 21:25:00 + +++ src/init.c 2012-01-09 13:40:01 + @@ -267,7 +267,9 @@ { waitretry,opt.waitretry, cmd_time }, { warccdx, opt.warc_cdx_enabled, cmd_boolean }, { warccdxdedup, opt.warc_cdx_dedup_filename, cmd_file }, +#ifdef HAVE_LIBZ { warccompression, opt.warc_compression_enabled, cmd_boolean }, +#endif { warcdigests, opt.warc_digests_enabled, cmd_boolean }, { warcfile, opt.warc_filename, cmd_file }, { warcheader, NULL, cmd_spec_warc_header }, @@ -374,7 +376,11 @@ opt.show_all_dns_entries = false; opt.warc_maxsize = 0; /* 1024 * 1024 * 1024; */ +#ifdef HAVE_LIBZ opt.warc_compression_enabled = true; +#else + opt.warc_compression_enabled = false; +#endif opt.warc_digests_enabled = true; opt.warc_cdx_enabled = false; opt.warc_cdx_dedup_filename = NULL; === modified file 'src/main.c' --- src/main.c 2011-11-04 21:25:00 + +++ src/main.c 2012-01-09 13:40:01 + @@ -289,7 +289,9 @@ { wait, 'w', OPT_VALUE, wait, -1 }, { waitretry, 0, OPT_VALUE, waitretry, -1 }, { warc-cdx, 0, OPT_BOOLEAN, warccdx, -1 }, +#ifdef HAVE_LIBZ { warc-compression, 0, OPT_BOOLEAN, warccompression, -1 }, +#endif { warc-dedup, 0, OPT_VALUE, warccdxdedup, -1 }, { warc-digests, 0, OPT_BOOLEAN, warcdigests, -1 }, { warc-file, 0, OPT_VALUE, warcfile, -1 }, @@ -674,8 +676,10 @@ --warc-cdxwrite CDX index files.\n), N_(\ --warc-dedup=FILENAME do not store records listed in this CDX file.\n), +#ifdef HAVE_LIBZ N_(\ --no-warc-compression do not compress WARC files with GZIP.\n), +#endif N_(\ --no-warc-digests do not calculate SHA1 digests.\n), N_(\ === modified file 'src/warc.c' --- src/warc.c 2011-11-20 17:28:19 + +++ src/warc.c 2012-01-09 13:40:01 + @@ -14,7 +14,9 @@ #include sha1.h #include base32.h #include unistd.h +#ifdef HAVE_LIBZ #include zlib.h +#endif #ifdef HAVE_LIBUUID #include uuid/uuid.h #endif @@ -42,6 +44,7 @@ /* The current WARC file (or NULL, if WARC is disabled). */ static FILE *warc_current_file; +#ifdef HAVE_LIBZ /* The gzip stream for the current WARC file (or NULL, if WARC or gzip is disabled). */ static gzFile
Re: [Bug-wget] WARC, new version
lovely. I am going to push it soon with some small adjustments. That's good to hear. There's one other small adjustment that you may want to make, see the attached patch. One of the WARC functions uses the basename function, which causes problems on OS X. Including libgen.h and strdup-ing the output of basename seems to solve this problem. Thanks, Gijs Op 04-11-11 22:27 schreef Giuseppe Scrivano: Gijs van Tuldergvtul...@gmail.com writes: Hi Giuseppe, * I've changed the configure.ac and src/Makefile.am. * I've added a ChangeLog entry. lovely. I am going to push it soon with some small adjustments. Thanks for the great work. Whenever it happens to be in the same place, I'll buy you a beer :-) Cheers, Giuseppe --- a/src/warc.c 2011-11-04 17:41:11.383704054 +0100 +++ b/src/warc.c 2011-11-04 23:06:28.693712714 +0100 @@ -19,6 +19,10 @@ #include uuid/uuid.h #endif +#ifndef WINDOWS +#include libgen.h +#endif + #include warc.h extern char *version_string; @@ -605,7 +609,7 @@ char *filename_copy, *filename_basename; filename_copy = strdup (filename); - filename_basename = basename (filename_copy); + filename_basename = strdup (basename (filename_copy)); warc_write_start_record (); warc_write_header (WARC-Type, warcinfo); @@ -619,6 +623,7 @@ if (warc_tmp == NULL) { free (filename_copy); + free (filename_basename); return false; } @@ -646,6 +651,7 @@ } free (filename_copy); + free (filename_basename); fclose (warc_tmp); return warc_write_ok; }
[Bug-wget] Memory leak when using GnuTLS
Hi, I think there is a memory leak in the GnuTLS part of wget. When downloading multiple files from a HTTPS server, wget with GnuTLS uses a lot of memory. Perhaps an explanation for this can be found in src/http.c. The gethttp calls ssl_init for each download: /* Initialize the SSL context. After this has once been done, it becomes a no-op. */ if (!ssl_init ()) The OpenSSL version of ssl_init, in src/openssl.c, checks if SSL has already been initialized and doesn't repeat the work. But the GnuTLS version doesn't: bool ssl_init () { const char *ca_directory; DIR *dir; gnutls_global_init (); gnutls_certificate_allocate_credentials (credentials); GnuTLS is initialized again and again, but there is never a call to gnutls_global_deinit. I've attached a small patch to add a check to ssl_init in src/gnutls.c, similar to the check already in src/openssl.c. With it, wget can still download over HTTPS and the memory usage stays within reasonable limits. Thanks, Gijs === modified file 'src/gnutls.c' --- src/gnutls.c 2011-09-04 11:30:01 + +++ src/gnutls.c 2011-10-31 22:58:38 + @@ -59,10 +59,17 @@ confused with actual gnutls functions -- such as the gnutls_read preprocessor macro. */ +/* Becomes true if GnuTLS is initialized. */ +static bool ssl_initialized = false; + static gnutls_certificate_credentials credentials; bool ssl_init () { + /* GnuTLS should be initialized only once. */ + if (ssl_initialized) +return true; + const char *ca_directory; DIR *dir; @@ -104,6 +111,9 @@ if (opt.ca_cert) gnutls_certificate_set_x509_trust_file (credentials, opt.ca_cert, GNUTLS_X509_FMT_PEM); + + ssl_initialized = true; + return true; }
Re: [Bug-wget] WARC, new version
Hi David, David H. Lipman wrote: I have seen WARC mentioned but have not seen a definition. WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web resources. It is used for making archives of web sites. The Internet Archive, for example, uses it as the file format for their Wayback Machine and Heritrix crawler. The nice thing about WARC is that it lets you store all information about your web crawl: the files you download, of course, but also things like the HTTP request and response headers, information about redirects and error pages. WARC also provides a place to keep the related metadata. It is, in short, a way to store everything, in a standardized file format. Adding WARC to wget means that you'll be able to do things like wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu which will produce (next to the normal wget download) a file named 'gnu.warc.gz' that contains every HTTP request and every HTTP response that wget made. This is a 'archival grade' copy of the mirrored site. Once you have the WARC file, you could store it in your archive, extract files, run your own local Wayback Machine [2, 3]. wget is already a very useful tool to make a quick copy of a website, adding WARC support helps to make the copy is as complete as possible. Maybe that answers some of your questions? Regards, Gijs [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml [2] http://archive-access.sourceforge.net/projects/wayback/ [3] http://netpreserve.org/software/downloads.php
Re: [Bug-wget] WARC output
can you please send a complete diff against the current development tree version? Here's the diff of the WARC additions (1.9MB zipped) to revision 2565: http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2 Thanks, Gijs
Re: [Bug-wget] WARC output
Hi. It's been a while since we've discussed the WARC addition to Wget. Is there anything I can help with? Gijs
Re: [Bug-wget] WARC output
Giuseppe Scrivano writes: The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ how much code is really needed from that library? I wonder if we can avoid this dependency at all. The library comes with some utilities, an HTTrack plugin, a Java module etc. These extra things are not needed for Wget. But of the C library, I used pretty much everything. The library handles all the WARC writing stuff. It can also read WARCs, but that's not needed here. Rough estimate: 12.000 lines of code (excluding comments). It's probably important to note that I have changed a few small things in the warc-tools library. (I have records in Git.) As for the other dependencies: - I used an MIT-licenced base32 encoder (there seems to be no such module in Gnulib), but that's quite small so could be replaced; - it links to the UUID library. Can you please track all contributors? Any contribution to GNU wget requires copyright assigments to the FSF. Yes, it's all in the Git history, so it's easy to make a list. (There's only one other contributor of code, others helped with testing.) In the meanwhile, can you check if you are following the GNU Coding Standards for the new code? I tried to do that. So except for the warc-tools library, which uses a different standard, all new code follows the GNU standards (I hope). Thanks, Gijs
[Bug-wget] WARC output
Hi, I'd like to propose a new feature that allows Wget to make WARC files. Perhaps you're already familiar with it, but in short: WARC is a file format for web archives. In a single WARC file, you can store every file of the website, plus the HTTP request and response headers and other metadata. This makes it a very useful format for web archivists: you keep everything together, in the most detailed and original form. The WARC format (an ISO standard, ISO 28500) has been developed by the International Internet Preservation Consortium, which includes the Internet Archive and many national libraries. It is supposed to become *the* standard file format for web archives. For example, it is used in the Internet Archive's Wayback Machine and its Heritrix crawler. There are several projects building tools to work with WARC files. It would be cool if Wget could become one of these tools. Already the Swiss army knife for mirroring websites, the one thing that Wget is missing is a good way to store these mirrors. The current output of --mirror is not sufficient for archival purposes: - it throws away the HTTP headers (of the request and response); - it doesn't keep 404 pages and redirects; - it doesn't store the original urls but mangles the filenames; - and, if you're not careful, it even rewrites the links inside the documents that it has downloaded. The WARC format supports these things. With some help from others, I've added WARC functions to Wget. With the --warc-file option you can specify that the mirror should also be written to a WARC archive. Wget will then keep everything, including the HTTP request and response headers, redirects and 404 pages. Do you think this is something that could be included in the main Wget version? If that's the case, what should be the next step? Description, links to more information about WARC: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output Code: https://github.com/alard/wget-warc/ https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2 The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ I look forward to your response. Kind regards, Gijs van Tulder