Re: [Bug-wget] Segmentation fault with current development version of wget

2013-05-01 Thread Gijs van Tulder

Hi Giuseppe,

Dropping the bit that sanitizes the opt.method is probably a good idea. 
(Perhaps I shouldn't have replied to your patch directly.)


Still, even if the sanitization is removed: I think it would be better 
if RESTORE_POST_DATA restores the previous value of opt.method, instead 
of overwriting it with a hardcoded POST. Isn't it?


A related question: how is a redirect response to a PUT request handled? 
How should it be handled?


I haven't tried it, but it looks like in that case the SUSPEND_POST_DATA 
macro is called (by retrieve_url in retr.c). If that's true, then later 
on the opt.method would be 'restored' to POST by RESTORE_POST_DATA.


Regards,

Gijs


Op 01-05-13 22:16 schreef Giuseppe Scrivano:

hi Gijs,

Gijs van Tulder gvtul...@gmail.com writes:


Giuseppe Scrivano wrote:

what about this patch?  Any comment?


Another suggestion: why not save the original opt.method, set
opt.method to NULL and put the original opt.method back later?


thanks for your suggestion but I think we should drop the code that
modifies opt.method, since we have to sanitize it only when it is
specified as argument.  Objections?






[Bug-wget] Remaining reference to opt.post_data (WARC in src/http.c)

2013-04-21 Thread Gijs van Tulder

Hi,

For the new --body-data option most of the code that used to reference 
opt.post_data has been changed to use opt.body_data. I found one 
remaining reference, hidden in one of the WARC-writing sections of 
src/http.c. Wget would crash if you combine --body-data with --warc-file.


It's a simple fix. See the attached patch.

Regards,

Gijs
From d2e6e16b3062cc0e6b3c13fd04e3654ed2dbdb6e Mon Sep 17 00:00:00 2001
From: Gijs van Tulder gvtul...@gmail.com
Date: Sun, 21 Apr 2013 22:36:50 +0200
Subject: [PATCH] Remove old reference to opt.post_data.

---
 src/ChangeLog |5 +
 src/http.c|2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 8a60e5d..64fc634 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,8 @@
+2013-04-21  Gijs van Tulder  gvtul...@gmail.com
+
+	* http.c: Copy opt.body_data to the WARC file, instead of
+	opt.post_data (the old option).
+
 2013-04-12  Gijs van Tulder  gvtul...@gmail.com
 
 	* warc.c: Generate unique UUIDs for the manifest and the record
diff --git a/src/http.c b/src/http.c
index 3e4d7cc..88f7a96 100644
--- a/src/http.c
+++ b/src/http.c
@@ -2150,7 +2150,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, struct url *proxy,
   warc_payload_offset = ftello (warc_tmp);
 
   /* Write a copy of the data to the WARC record. */
-  int warc_tmp_written = fwrite (opt.post_data, 1, body_data_size, warc_tmp);
+  int warc_tmp_written = fwrite (opt.body_data, 1, body_data_size, warc_tmp);
   if (warc_tmp_written != body_data_size)
 write_error = -2;
 }
-- 
1.7.9.5



[Bug-wget] Standards fix for metadata records in WARC files

2013-04-12 Thread Gijs van Tulder

This patch repairs two minor problems in the WARC metadata records.

1. Each record should have its own unique WARC-Record-ID, but currently 
the ID for the record holding the manifest is reused for the record 
holding the arguments. The patch generates a new ID for the arguments 
(and refers to the manifest in a WARC-Concurrent-To header).


2. According to the WARC implementation guidelines [1], the manifest 
should be written to a metadata record, but Wget stores it as a 
resource record. The patch corrects this.


Regards,

Gijs


[1] Section 2.4.4 of 
http://www.netpreserve.org/resources/warc-implementation-guidelines-v1
commit b54fb8feb9dfb2a111d15f1b759de61217d5251e
Author: Gijs van Tulder gvtul...@gmail.com
Date:   Fri Apr 12 23:37:45 2013 +0200

warc: Follow the guidelines for metadata records

Do not use the same UUID for the manifest and arguments records.
Write the manifest as a metadata record, not as a resource.

diff --git a/src/ChangeLog b/src/ChangeLog
index 65d636d..e609f2d 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,11 @@
+2013-04-12  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c: Generate unique UUIDs for the manifest and the record
+	holding the command-line arguments.
+	Write the manifest to a metadata record to follow the WARC
+	implementation guidelines.
+	* warc.h: Declare new function warc_write_metadata_record.
+
 2013-03-31  Gijs van Tulder  gvtul...@gmail.com
 
 	* warc.c: Correctly write the field length in the skip length field
diff --git a/src/warc.c b/src/warc.c
index 9b10610..916b53d 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -1083,7 +1083,7 @@ warc_write_metadata (void)
   warc_uuid_str (manifest_uuid);
 
   fflush (warc_manifest_fp);
-  warc_write_resource_record (manifest_uuid,
+  warc_write_metadata_record (manifest_uuid,
   metadata://gnu.org/software/wget/warc/MANIFEST.txt,
   NULL, NULL, NULL, text/plain,
   warc_manifest_fp, -1);
@@ -1098,9 +1098,9 @@ warc_write_metadata (void)
   fflush (warc_tmp_fp);
   fprintf (warc_tmp_fp, %s\n, program_argstring);
 
-  warc_write_resource_record (manifest_uuid,
+  warc_write_resource_record (NULL,
metadata://gnu.org/software/wget/warc/wget_arguments.txt,
-  NULL, NULL, NULL, text/plain,
+  NULL, manifest_uuid, NULL, text/plain,
   warc_tmp_fp, -1);
   /* warc_write_resource_record has closed warc_tmp_fp. */
 
@@ -1395,20 +1395,22 @@ warc_write_response_record (char *url, char *timestamp_str,
   return warc_write_ok;
 }
 
-/* Writes a resource record to the WARC file.
+/* Writes a resource or metadata record to the WARC file.
+   warc_type  is either resource or metadata,
resource_uuid  is the uuid of the resource (or NULL),
url  is the target uri of the resource,
timestamp_str  is the timestamp (generated with warc_timestamp),
-   concurrent_to_uuid  is the uuid of the request for that generated this
+   concurrent_to_uuid  is the uuid of the record that generated this,
resource (generated with warc_uuid_str) or NULL,
ip  is the ip address of the server (or NULL),
content_type  is the mime type of the body (or NULL),
body  is a pointer to a file containing the resource data.
Calling this function will close body.
Returns true on success, false on error. */
-bool
-warc_write_resource_record (char *resource_uuid, const char *url,
- const char *timestamp_str, const char *concurrent_to_uuid,
+static bool
+warc_write_record (const char *record_type, char *resource_uuid,
+ const char *url, const char *timestamp_str,
+ const char *concurrent_to_uuid,
  ip_address *ip, const char *content_type, FILE *body,
  off_t payload_offset)
 {
@@ -1422,7 +1424,7 @@ warc_write_resource_record (char *resource_uuid, const char *url,
 content_type = application/octet-stream;
 
   warc_write_start_record ();
-  warc_write_header (WARC-Type, resource);
+  warc_write_header (WARC-Type, record_type);
   warc_write_header (WARC-Record-ID, resource_uuid);
   warc_write_header (WARC-Warcinfo-ID, warc_current_warcinfo_uuid_str);
   warc_write_header (WARC-Concurrent-To, concurrent_to_uuid);
@@ -1438,3 +1440,47 @@ warc_write_resource_record (char *resource_uuid, const char *url,
 
   return warc_write_ok;
 }
+
+/* Writes a resource record to the WARC file.
+   resource_uuid  is the uuid of the resource (or NULL),
+   url  is the target uri of the resource,
+   timestamp_str  is the timestamp (generated with warc_timestamp),
+   concurrent_to_uuid  is the uuid of the record that generated this,
+   resource (generated with warc_uuid_str) or NULL,
+   ip  is the ip address of the server (or NULL),
+   content_type  is the mime type of the body (or NULL),
+   body  is a pointer to a file containing the resource data

Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files

2013-03-30 Thread Gijs van Tulder

Hi,

 It appears wget may be creating slightly malformed GZIP skip-length
 fields

I think that's correct: Wget doesn't write the subfield length in the 
extra field section of the header. After the subfield ID sl it 
should write the length LEN (see RFC 1952 [1]), but it doesn't.


Luckily, it does write the correct length of all extra fields (XLEN in 
the RFC 1952), so Gzip implementations that just ignore the extra field 
can skip it without problems. This is the case for the GNU Gzip utility.


But it should be fixed. I've attached a patch.

 It's likely that we'll need to make the warc.gz parsers a bit more
 robust, but I thought I'd mention it here in case this is
 actually a bug in wget.

When I wrote the code for the extra field I used the old Hanzo 
warc-tools [2] as an example. That implementation has the same problem: 
it doesn't write the field length [3]. This means there's at least one 
other tool that writes these off-spec warc.gz files, so it's probably 
useful to make the parser a bit more robust.


Thanks,

Gijs

[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] https://code.google.com/p/warc-tools/
[2] 
https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314
diff --git a/src/ChangeLog b/src/ChangeLog
index 8e1213f..65d636d 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,8 @@
+2013-03-31  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c: Correctly write the field length in the skip length field
+	of .warc.gz files. (Following the GZIP spec in RFC 1952.)
+
 2013-03-12  Darshit Shah dar...@gmail.com
 
 	* http.c (gethttp): Make wget return FILEBADFILE error and abort if
diff --git a/src/warc.c b/src/warc.c
index fb506a7..9b10610 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -165,7 +165,7 @@ warc_write_string (const char *str)
 }
 
 
-#define EXTRA_GZIP_HEADER_SIZE 12
+#define EXTRA_GZIP_HEADER_SIZE 14
 #define GZIP_STATIC_HEADER_SIZE  10
 #define FLG_FEXTRA  0x04
 #define OFF_FLG 3
@@ -200,7 +200,7 @@ warc_write_start_record (void)
  In warc_write_end_record we will fill this space
  with information about the uncompressed and
  compressed size of the record. */
-  fprintf (warc_current_file, );
+  fseek (warc_current_file, EXTRA_GZIP_HEADER_SIZE, SEEK_CUR);
   fflush (warc_current_file);
 
   /* Start a new GZIP stream. */
@@ -342,16 +342,19 @@ warc_write_end_record (void)
   /* The extra header field identifier for the WARC skip length. */
   extra_header[2]  = 's';
   extra_header[3]  = 'l';
+  /* The size of the field value (8 bytes).  */
+  extra_header[4]  = (8  255);
+  extra_header[5]  = ((8  8)  255);
   /* The size of the uncompressed record.  */
-  extra_header[4]  = (uncompressed_size  255);
-  extra_header[5]  = (uncompressed_size  8)  255;
-  extra_header[6]  = (uncompressed_size  16)  255;
-  extra_header[7]  = (uncompressed_size  24)  255;
+  extra_header[6]  = (uncompressed_size  255);
+  extra_header[7]  = (uncompressed_size  8)  255;
+  extra_header[8]  = (uncompressed_size  16)  255;
+  extra_header[9]  = (uncompressed_size  24)  255;
   /* The size of the compressed record.  */
-  extra_header[8]  = (compressed_size  255);
-  extra_header[9]  = (compressed_size  8)  255;
-  extra_header[10] = (compressed_size  16)  255;
-  extra_header[11] = (compressed_size  24)  255;
+  extra_header[10] = (compressed_size  255);
+  extra_header[11] = (compressed_size  8)  255;
+  extra_header[12] = (compressed_size  16)  255;
+  extra_header[13] = (compressed_size  24)  255;
 
   /* Write the extra header after the static header. */
   fseeko (warc_current_file, warc_current_gzfile_offset


Re: [Bug-wget] [PATCH] Invalid Content-Length header in WARC files, on some platforms

2012-11-24 Thread Gijs van Tulder

Giuseppe Scrivano writes:

From 1e229375aa89cdc0bba07335fbe10d4f66180f68 Mon Sep 17 00:00:00 2001
Subject: [PATCH] warc: fix format string for off_t


Good to see that that's fixed. However, there's another instance of this 
problem in the warc_write_cdx_record function in warc.c. (I saw that Tim 
Ruehsen fixed this in his version of the patch.)


The attached patch uses number_to_string to fix the printf in 
warc_write_cdx_record.


Regards,

Gijs
From 21fc9f0dd9c71e2dc3aea29be4e16f14620d12a5 Mon Sep 17 00:00:00 2001
From: Gijs van Tulder gvtul...@gmail.com
Date: Sat, 24 Nov 2012 12:44:14 +0100
Subject: [PATCH] warc: fix format string for off_t in CDX function.

---
 src/ChangeLog |5 +
 src/warc.c|8 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 07152a5..45b2a70 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,8 @@
+2012-11-24  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c (warc_write_cdx_record): Use `number_to_string' to
+	convert the offset to a string.
+
 2012-11-24  Giuseppe Scrivano  gscriv...@gnu.org
 
 	* warc.c (warc_write_block_from_file): Use `number_to_string' to
diff --git a/src/warc.c b/src/warc.c
index 99e7016..25a8517 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -1225,10 +1225,14 @@ warc_write_cdx_record (const char *url, const char *timestamp_str,
   if (redirect_location == NULL || strlen(redirect_location) == 0)
 redirect_location = -;
 
+  char offset_string[22];
+  number_to_string (offset_string, offset);
+
   /* Print the CDX line. */
-  fprintf (warc_current_cdx_file, %s %s %s %s %d %s %s - %ld %s %s\n, url,
+  fprintf (warc_current_cdx_file, %s %s %s %s %d %s %s - %s %s %s\n, url,
timestamp_str_cdx, url, mime_type, response_code, checksum,
-   redirect_location, offset, warc_current_filename, response_uuid);
+   redirect_location, offset_string, warc_current_filename,
+   response_uuid);
   fflush (warc_current_cdx_file);
 
   return true;
-- 
1.7.9.5



[Bug-wget] Invalid Content-Length header in WARC files, on some platforms

2012-11-12 Thread Gijs van Tulder

Hi,

There's a somewhat serious issue in the WARC-generating code: on some 
platforms (presumably the ones where off_t is not a 64-bit number) the 
Content-Length header at the top of each WARC record has an incorrect 
length. On these platforms it is sometimes 0, sometimes 1, but never the 
correct length. This makes the whole WARC file unreadable.


The code works fine on many platforms, but it is apparently a problem on 
some PowerPC and ARM systems, and maybe other systems as well.


Existing WARC files with this problem can be repaired by replacing the 
value of the Content-Length header with the correct value, for each WARC 
record in the file. The content of the WARC records is there, it's just 
the Content-Length header that is wrong.


The attached patch fixes the problem in warc.c. It replaces off_t by 
wgint and uses the number_to_static_string function from util.c.


Regards,

Gijs
commit 66c0595f5440b36afb7307d4cab3d6430254183b
Author: Gijs van Tulder gvtul...@gmail.com
Date:   Mon Nov 12 22:03:30 2012 +0100

Fix for invalid WARC Content-Length header on some platforms.

diff --git a/src/ChangeLog b/src/ChangeLog
index ec78fe8..3901d94 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,10 @@
+2012-11-12  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c: Fix for invalid Content-Length WARC header on platforms
+	where off_t is less than 64 bits wide.
+	* warc.h: Likewise: Use wgint instead of off_t.
+	* http.c: Likewise.
+
 2012-08-29  Rohit Mathulla rohit_mathu...@yahoo.com (tiny change)
 
 	* html-url.c (get_urls_file): Convert shorthand URLs.
diff --git a/src/http.c b/src/http.c
index 5888474..52cbe87 100644
--- a/src/http.c
+++ b/src/http.c
@@ -1712,7 +1712,7 @@ gethttp (struct url *u, struct http_stat *hs, int *dt, struct url *proxy,
   char warc_timestamp_str [21];
   char warc_request_uuid [48];
   ip_address *warc_ip = NULL;
-  off_t warc_payload_offset = -1;
+  wgint warc_payload_offset = -1;
 
   /* Whether this connection will be kept alive after the HTTP request
  is done. */
diff --git a/src/warc.c b/src/warc.c
index de99bf7..894b802 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -78,10 +78,10 @@ static FILE *warc_current_file;
 static gzFile warc_current_gzfile;
 
 /* The offset of the current gzip record in the WARC file. */
-static off_t warc_current_gzfile_offset;
+static wgint warc_current_gzfile_offset;
 
 /* The uncompressed size (so far) of the current record. */
-static off_t warc_current_gzfile_uncompressed_size;
+static wgint warc_current_gzfile_uncompressed_size;
 # endif
 
 /* This is true until a warc_write_* method fails. */
@@ -247,7 +247,9 @@ warc_write_block_from_file (FILE *data_in)
   /* Add the Content-Length header. */
   char *content_length;
   fseeko (data_in, 0L, SEEK_END);
-  if (! asprintf (content_length, %ld, ftello (data_in)))
+  wgint bytes = ftello (data_in);
+  int ret = asprintf (content_length, %s, number_to_static_string (bytes));
+  if (ret  0)
 {
   warc_write_ok = false;
   return false;
@@ -313,9 +315,9 @@ warc_write_end_record (void)
   */
 
   /* Calculate the uncompressed and compressed sizes. */
-  off_t current_offset = ftello (warc_current_file);
-  off_t uncompressed_size = current_offset - warc_current_gzfile_offset;
-  off_t compressed_size = warc_current_gzfile_uncompressed_size;
+  wgint current_offset = ftello (warc_current_file);
+  wgint uncompressed_size = current_offset - warc_current_gzfile_offset;
+  wgint compressed_size = warc_current_gzfile_uncompressed_size;
 
   /* Go back to the static GZIP header. */
   fseeko (warc_current_file, warc_current_gzfile_offset
@@ -414,14 +416,14 @@ warc_write_ip_header (ip_address *ip)
16 bytes beginning ad RES_PAYLOAD.  */
 static int
 warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload,
-   off_t payload_offset)
+   wgint payload_offset)
 {
 #define BLOCKSIZE 32768
 
   struct sha1_ctx ctx_block;
   struct sha1_ctx ctx_payload;
-  off_t pos;
-  off_t sum;
+  wgint pos;
+  wgint sum;
 
   char *buffer = malloc (BLOCKSIZE + 72);
   if (!buffer)
@@ -440,7 +442,7 @@ warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload,
   /* We read the file in blocks of BLOCKSIZE bytes.  One call of the
  computation function processes the whole buffer so that with the
  next round of the loop another block can be read.  */
-  off_t n;
+  wgint n;
   sum = 0;
 
   /* Read block.  Take care for partial reads.  */
@@ -481,7 +483,7 @@ warc_sha1_stream_with_payload (FILE *stream, void *res_block, void *res_payload,
   if (payload_offset = 0  payload_offset  pos)
 {
   /* At least part of the buffer contains data from payload. */
-  off_t start_of_payload = payload_offset - (pos - BLOCKSIZE);
+  wgint start_of_payload = payload_offset - (pos - BLOCKSIZE

[Bug-wget] Segfault with WARC + CDX

2012-05-30 Thread Gijs van Tulder

Hi,

There's a bug in the warc_find_duplicate_cdx_record function. If you 
provide a file with CDX records, Wget can segfault if a record is not 
found in the CDX file. In fact, the deduplication now only works if 
*every* new record can be found in the CDX index.


The segmentation fault is generated on these lines in src/warc.c:

  hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, key,
   rec_existing);
  if (rec_existing != NULL  strcmp (rec_existing-url, url) == 0)

Other than the code expects hash_table_get_pair does not set 
rec_existing to NULL if no record is found. So instead of checking for 
NULL, the function should check if the return value of 
hash_table_get_pair is non-zero:


  int found = hash_table_get_pair (warc_cdx_dedup_table, 
sha1_digest_payload,

   key, rec_existing);
  if (found  strcmp (rec_existing-url, url) == 0)

The attached patch makes this change. The deduplication works better.

Regards,

Gijs
From 807b98d7d9289765c9f210336d2dbf294d663f99 Mon Sep 17 00:00:00 2001
From: Gijs van Tulder gvtul...@gmail.com
Date: Wed, 30 May 2012 23:00:04 +0200
Subject: [PATCH] warc: Fix segfault if CDX record is not found.

---
 src/ChangeLog |4 
 src/warc.c|6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 7e16b17..9e74e47 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,7 @@
+2012-05-30  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c: Fix segfault if CDX record is not found.
+
 2011-05-26  Steven Schweda  s...@antinode.info
 	* connect.c [HAVE_SYS_SOCKET_H]: Include sys/socket.h.
 	[HAVE_SYS_SELECT_H]: Include sys/select.h.
diff --git a/src/warc.c b/src/warc.c
index 24751db..92a49ef 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -1001,10 +1001,10 @@ warc_find_duplicate_cdx_record (char *url, char *sha1_digest_payload)
 
   char *key;
   struct warc_cdx_record *rec_existing;
-  hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, key,
-   rec_existing);
+  int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload,
+   key, rec_existing);
 
-  if (rec_existing != NULL  strcmp (rec_existing-url, url) == 0)
+  if (found  strcmp (rec_existing-url, url) == 0)
 return rec_existing;
   else
 return NULL;
-- 
1.7.4.1



[Bug-wget] Combining --output-document with --recursive

2012-05-24 Thread Gijs van Tulder

Hi,

There's a problem if you combine --output-document with --recursive or 
--page-requisites. --output-document breaks the recursion.


First you get a warning:

  WARNING: combining -O with -r or -p will mean that all downloaded
  content will be placed in the single file you specified.

That is what you'd expect, no problem there.

However, there is a problem with the recursion. Because Wget *appends* 
all downloaded content in the same file, the HTML and CSS parsers get 
confused. The same content is parsed over and over again, each time with 
a different URL context.


Example:
1. You run wget -O out.tmp -r http://example.com/index.html
2. http://example.com/index.html is written to out.tmp.
   URLs are extracted from out.tmp relative to
   http://example.com/index.html. Suppose that there is a link to a
   subdirectory test/index.html, which is added to the download queue
   as http://example.com/test/index.html (correct).
3. http://example.com/test/index.html is appended to out.tmp.
   Now, again, Wget extracts URLs from out.tmp. It parses the whole
   file, so it first finds the contents of /index.html, with the link
   to test/index.html. Because Wget thinks it is now parsing
   http://example.com/test/index.html, it will enqueue this as
   http://example.com/test/test/index.html (wrong).

One obvious solution, which I've added to this email, is to clear the 
output document before downloading the next file. This breaks the 
current behaviour, so maybe it's not a good idea. Is there a better 
solution?


Regards,

Gijs

--

index 8d4edba..502b68f 100644
--- a/src/http.c
+++ b/src/http.c
@@ -2888,7 +2888,18 @@ read_header:
 }
 }
   else
-fp = output_stream;
+{
+  fp = output_stream;
+  rewind (fp);
+  if (ftruncate (fileno (fp), 0) == -1)
+{
+  logprintf (LOG_NOTQUIET, Could not truncate output file: 
%s\n, strerror (errno));

+  CLOSE_INVALIDATE (sock);
+  xfree (head);
+  xfree_null (type);
+  return FOPENERR;
+}
+}

   /* Print fetch message, if opt.verbose.  */
   if (opt.verbose)



Re: [Bug-wget] Regular expression matching

2012-04-10 Thread Gijs van Tulder

Hi,

Here is a new version of the regular expressions patch. The new version 
combines POSIX (always, from gnulib) and PCRE (if available).


The patch adds these options:

 --accept-regex=...
 --reject-regex=...

 --regex-type=posix   for POSIX extended regexes (the default)
 --regex-type=pcrefor PCRE regexes (if PCRE is available)

In reference to the --match-query-string patch: since the regexes look 
at the complete URL, you can also use them to match the query string.


Regards,

Gijs
=== modified file 'ChangeLog'
--- ChangeLog	2012-03-25 11:47:53 +
+++ ChangeLog	2012-04-10 22:28:11 +
@@ -1,3 +1,8 @@
+2012-04-11  Gijs van Tulder  gvtul...@gmail.com
+
+	* bootstrap.conf (gnulib_modules): Include module `regex'.
+	* configure.ac: Check for PCRE library.
+
 2012-03-25 Ray Satiro raysat...@yahoo.com
 
 	* configure.ac: Fix build under mingw when OpenSSL is used.

=== modified file 'bootstrap.conf'
--- bootstrap.conf	2012-03-20 19:41:14 +
+++ bootstrap.conf	2012-04-04 15:09:08 +
@@ -58,6 +58,7 @@
 quote
 quotearg
 recv
+regex
 select
 send
 setsockopt

=== modified file 'configure.ac'
--- configure.ac	2012-03-25 11:47:53 +
+++ configure.ac	2012-04-10 21:59:48 +
@@ -532,6 +532,18 @@
   ])
 )
 
+dnl
+dnl Check for PCRE
+dnl
+
+AC_CHECK_HEADER(pcre.h,
+AC_CHECK_LIB(pcre, pcre_compile,
+  [LIBS=${LIBS} -lpcre
+   AC_DEFINE([HAVE_LIBPCRE], 1,
+ [Define if libpcre is available.])
+  ])
+)
+
  
 dnl Needed by src/Makefile.am
 AM_CONDITIONAL([IRI_IS_ENABLED], [test X$iri != Xno])

=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-04-01 14:30:59 +
+++ src/ChangeLog	2012-04-10 22:30:28 +
@@ -1,3 +1,12 @@
+2012-04-11  Gijs van Tulder  gvtul...@gmail.com
+
+	* init.c: Add --accept-regex, --reject-regex and --regex-type.
+	* main.c: Likewise.
+	* options.c: Likewise.
+	* recur.c: Likewise.
+	* utils.c: Add regex-related functions.
+	* utils.h: Add regex-related functions.
+
 2012-04-01  Giuseppe Scrivano  gscriv...@gnu.org
 
 	* gnutls.c (wgnutls_read_timeout): Ensure timer is freed.

=== modified file 'src/init.c'
--- src/init.c	2012-03-08 09:00:51 +
+++ src/init.c	2012-04-10 22:10:10 +
@@ -46,6 +46,10 @@
 # endif
 #endif
 
+#include regex.h
+#ifdef HAVE_LIBPCRE
+# include pcre.h
+#endif
 
 #ifdef HAVE_PWD_H
 # include pwd.h
@@ -94,6 +98,7 @@
 CMD_DECLARE (cmd_spec_prefer_family);
 CMD_DECLARE (cmd_spec_progress);
 CMD_DECLARE (cmd_spec_recursive);
+CMD_DECLARE (cmd_spec_regex_type);
 CMD_DECLARE (cmd_spec_restrict_file_names);
 #ifdef HAVE_SSL
 CMD_DECLARE (cmd_spec_secure_protocol);
@@ -116,6 +121,7 @@
 } commands[] = {
   /* KEEP THIS LIST ALPHABETICALLY SORTED */
   { accept,   opt.accepts,   cmd_vector },
+  { acceptregex,  opt.acceptregex_s, cmd_string },
   { addhostdir,   opt.add_hostdir,   cmd_boolean },
   { adjustextension,  opt.adjust_extension,  cmd_boolean },
   { alwaysrest,   opt.always_rest,   cmd_boolean }, /* deprecated */
@@ -236,7 +242,9 @@
   { reclevel, opt.reclevel,  cmd_number_inf },
   { recursive,NULL,   cmd_spec_recursive },
   { referer,  opt.referer,   cmd_string },
+  { regextype,opt.regex_type,cmd_spec_regex_type },
   { reject,   opt.rejects,   cmd_vector },
+  { rejectregex,  opt.rejectregex_s, cmd_string },
   { relativeonly, opt.relative_only, cmd_boolean },
   { remoteencoding,   opt.encoding_remote,   cmd_string },
   { removelisting,opt.remove_listing,cmd_boolean },
@@ -361,6 +369,8 @@
   opt.restrict_files_nonascii = false;
   opt.restrict_files_case = restrict_no_case_restriction;
 
+  opt.regex_type = regex_type_posix;
+
   opt.max_redirect = 20;
 
   opt.waitretry = 10;
@@ -1368,6 +1378,25 @@
   return true;
 }
 
+/* Validate --regex-type and set the choice.  */
+
+static bool
+cmd_spec_regex_type (const char *com, const char *val, void *place_ignored)
+{
+  static const struct decode_item choices[] = {
+{ posix, regex_type_posix },
+#ifdef HAVE_LIBPCRE
+{ pcre,  regex_type_pcre },
+#endif
+  };
+  int regex_type = regex_type_posix;
+  int ok = decode_string (val, choices, countof (choices), regex_type);
+  if (!ok)
+fprintf (stderr, _(%s: %s: Invalid value %s.\n), exec_name, com, quote (val));
+  opt.regex_type = regex_type;
+  return ok;
+}
+
 static bool
 cmd_spec_restrict_file_names (const char *com, const char *val, void *place_ignored)
 {

=== modified file 'src/main.c'
--- src/main.c	2012-03-05 21:23:06 +
+++ src/main.c	2012-04-10 22:25:56 +
@@ -158,6 +158,7 @@
 static struct cmdline_option option_data[] =
   {
 { accept, 'A', OPT_VALUE, accept, -1 },
+{ accept-regex, 0, OPT_VALUE, acceptregex, -1 },
 { adjust-extension, 'E', OPT_BOOLEAN, adjustextension, -1 },
 { append-output, 'a', OPT__APPEND_OUTPUT, NULL, required_argument

[Bug-wget] Regular expression matching

2012-04-04 Thread Gijs van Tulder

Hi,

Here is a patch that adds the --acceptregex and --rejectregex options.

With these options it would be possible to do two things:

1. You can match complete urls, instead of just the directory prefix or 
the file name suffix (which you can do with --accept and 
--include-directories).
2. You can use regular expressions to do the matching, which is 
sometimes easier to than using a list of wildcard patterns.


Now this isn't a new idea (there are long discussions in the archive, 
see [1]). But somehow the previous attempts didn't make it, so I thought 
I'd send my own version. It's a small patch, I've been using it for a 
while and found it really useful.


I've made two versions of the patch: one uses PCRE, the other uses the 
gnulib regex library, which is probably easier to integrate.


Regards,

Gijs

[1] https://lists.gnu.org/archive/html/bug-wget/2009-09/msg00035.html
=== modified file 'bootstrap.conf'
--- bootstrap.conf	2012-03-20 19:41:14 +
+++ bootstrap.conf	2012-04-04 15:09:08 +
@@ -58,6 +58,7 @@
 quote
 quotearg
 recv
+regex
 select
 send
 setsockopt

=== modified file 'src/init.c'
--- src/init.c	2012-03-08 09:00:51 +
+++ src/init.c	2012-04-04 17:46:59 +
@@ -80,6 +80,7 @@
 CMD_DECLARE (cmd_directory_vector);
 CMD_DECLARE (cmd_number);
 CMD_DECLARE (cmd_number_inf);
+CMD_DECLARE (cmd_regex);
 CMD_DECLARE (cmd_string);
 CMD_DECLARE (cmd_file);
 CMD_DECLARE (cmd_directory);
@@ -116,6 +117,7 @@
 } commands[] = {
   /* KEEP THIS LIST ALPHABETICALLY SORTED */
   { accept,   opt.accepts,   cmd_vector },
+  { acceptregex,  opt.acceptregex,   cmd_regex },
   { addhostdir,   opt.add_hostdir,   cmd_boolean },
   { adjustextension,  opt.adjust_extension,  cmd_boolean },
   { alwaysrest,   opt.always_rest,   cmd_boolean }, /* deprecated */
@@ -237,6 +239,7 @@
   { recursive,NULL,   cmd_spec_recursive },
   { referer,  opt.referer,   cmd_string },
   { reject,   opt.rejects,   cmd_vector },
+  { rejectregex,  opt.rejectregex,   cmd_regex },
   { relativeonly, opt.relative_only, cmd_boolean },
   { remoteencoding,   opt.encoding_remote,   cmd_string },
   { removelisting,opt.remove_listing,cmd_boolean },
@@ -943,6 +946,30 @@
   return true;
 }
 
+/* Compile the regular expression and place a
+   pointer to *PLACE.  */
+static bool
+cmd_regex (const char *com, const char *val, void *place)
+{
+  regex_t **regex = (regex_t **)place;
+  *regex = malloc (sizeof (regex_t));
+
+  int errcode = regcomp (*regex, val, REG_EXTENDED | REG_NOSUB);
+
+  if (errcode != 0)
+{
+  int errbuf_size = regerror (errcode, *regex, NULL, 0);
+  char *errbuf = malloc (errbuf_size);
+  errbuf_size = regerror (errcode, *regex, errbuf, errbuf_size);
+  fprintf (stderr, _(%s: %s: Invalid regular expression %s, %s\n),
+   exec_name, com, quote (val), errbuf);
+  xfree (errbuf);
+  return false;
+}
+
+  return true;
+}
+
 
 /* Like the above, but handles tilde-expansion when reading a user's
`.wgetrc'.  In that case, and if VAL begins with `~', the tilde

=== modified file 'src/main.c'
--- src/main.c	2012-03-05 21:23:06 +
+++ src/main.c	2012-04-04 15:15:50 +
@@ -158,6 +158,7 @@
 static struct cmdline_option option_data[] =
   {
 { accept, 'A', OPT_VALUE, accept, -1 },
+{ acceptregex, 0, OPT_VALUE, acceptregex, -1 },
 { adjust-extension, 'E', OPT_BOOLEAN, adjustextension, -1 },
 { append-output, 'a', OPT__APPEND_OUTPUT, NULL, required_argument },
 { ask-password, 0, OPT_BOOLEAN, askpassword, -1 },
@@ -263,6 +264,7 @@
 { recursive, 'r', OPT_BOOLEAN, recursive, -1 },
 { referer, 0, OPT_VALUE, referer, -1 },
 { reject, 'R', OPT_VALUE, reject, -1 },
+{ rejectregex, 0, OPT_VALUE, rejectregex, -1 },
 { relative, 'L', OPT_BOOLEAN, relativeonly, -1 },
 { remote-encoding, 0, OPT_VALUE, remoteencoding, -1 },
 { remove-listing, 0, OPT_BOOLEAN, removelisting, -1 },
@@ -723,6 +725,10 @@
 N_(\
   -R,  --reject=LIST   comma-separated list of rejected extensions.\n),
 N_(\
+   --acceptregex=REGEX extended regex matching accepted URLs.\n),
+N_(\
+   --rejectregex=REGEX extended regex matching rejected URLs.\n),
+N_(\
   -D,  --domains=LIST  comma-separated list of accepted domains.\n),
 N_(\
--exclude-domains=LIST  comma-separated list of rejected domains.\n),

=== modified file 'src/options.h'
--- src/options.h	2012-03-05 21:23:06 +
+++ src/options.h	2012-04-04 17:43:42 +
@@ -29,6 +29,8 @@
 shall include the source code for the parts of OpenSSL used as well
 as that of the covered work.  */
 
+#include regex.h
+
 struct options
 {
   int verbose;			/* Are we verbose?  (First set to -1,
@@ -74,6 +76,9 @@
   bool ignore_case;		/* Whether to ignore case when
    matching dirs and files */
 
+  regex_t *acceptregex;		/* Patterns 

Re: [Bug-wget] Regular expression matching

2012-04-04 Thread Gijs van Tulder

Ángel González wrote:
 I really like PCRE, but I think the default should be POSIX regex

Certainly. (I'm not sure if it's even worth adding the PCRE option. 
Matching URLs can't be that hard, can it?)


 How are the interactions between --{accept,reject}regex and
 --{accept,reject}?

The regex options are just another group of options in the list of 
accept/reject checks: if an URL doesn't pass one of the tests it's rejected.


Regards,

Gijs



[Bug-wget] Fix for crash on invalid STYLE tag

2012-04-01 Thread Gijs van Tulder

Hi,

Here's a tiny fix for a problem in the HTML parsing in html-url.c.

Wget crashes on HTML files that contain an incomplete STYLE tag, e.g.:

  style /style

If it finds one of those, it calls get_urls_css with an invalid buffer 
(the buffer has a negative length), which leads to this crash:


  bad buffer in yy_scan_bytes()
  ERROR (2)

The attached patch checks the buffer before calling get_urls_css. The 
content of the incomplete tag still won't be parsed, but at least it 
will no longer lead to a crash.


Regards,

Gijs
=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-03-29 18:13:27 +
+++ src/ChangeLog	2012-04-01 20:35:28 +
@@ -1,3 +1,7 @@
+2012-04-01  Gijs van Tulder  gvtul...@gmail.com (tiny change)
+
+	* html-url.c: Prevent crash on incomplete STYLE tag.
+
 2012-03-29  From: Tim Ruehsen tim.rueh...@gmx.de (tiny change)
 
 	* utils.c (library): Include sys/time.h.

=== modified file 'src/html-url.c'
--- src/html-url.c	2011-04-24 11:03:48 +
+++ src/html-url.c	2012-04-01 16:08:18 +
@@ -676,7 +676,8 @@
   check_style_attr (tag, ctx);
 
   if (tag-end_tag_p  (0 == strcasecmp (tag-name, style)) 
-  tag-contents_begin  tag-contents_end)
+  tag-contents_begin  tag-contents_end 
+  tag-contents_begin = tag-contents_end)
   {
 /* parse contents */
 get_urls_css (ctx, tag-contents_begin - ctx-text,



[Bug-wget] Fix: Large files in WARC

2012-01-31 Thread Gijs van Tulder

Hi,

Another small problem in the WARC section: wget crashes with a 
segmentation fault if you have WARC output enabled and try to download a 
file larger than 2GB. I think this is because of the size_t, ftell and 
fseek in warc.c.


The attached patch changes the references from size_t to off_t, ftell to 
ftello, fseek to fseeko. On my 64-bit system this seemed to fix the 
problem (but I'm not an expert in these matters, so maybe this doesn't 
hold for 32-bit systems).


Regards,

Gijs
=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-01-28 13:09:29 +
+++ src/ChangeLog	2012-01-31 23:16:33 +
@@ -1,3 +1,9 @@
+2012-02-01  Gijs van Tulder  gvtul...@gmail.com
+
+	* warc.c: Fix large file support with ftello, fseeko.
+	* warc.h: Fix large file support.
+	* http.c: Fix large file support.
+
 2012-01-27  Gijs van Tulder  gvtul...@gmail.com
 
 	* retr.c (fd_read_body): If the response is chunked, the chunk

=== modified file 'src/http.c'
--- src/http.c	2012-01-28 13:08:52 +
+++ src/http.c	2012-01-31 22:34:45 +
@@ -1712,7 +1712,7 @@
   char warc_timestamp_str [21];
   char warc_request_uuid [48];
   ip_address *warc_ip = NULL;
-  long int warc_payload_offset = -1;
+  off_t warc_payload_offset = -1;
 
   /* Whether this connection will be kept alive after the HTTP request
  is done. */
@@ -2127,7 +2127,7 @@
   if (write_error = 0  warc_tmp != NULL)
 {
   /* Remember end of headers / start of payload. */
-  warc_payload_offset = ftell (warc_tmp);
+  warc_payload_offset = ftello (warc_tmp);
 
   /* Write a copy of the data to the WARC record. */
   int warc_tmp_written = fwrite (opt.post_data, 1, post_data_size, warc_tmp);
@@ -2139,7 +2139,7 @@
 {
   if (warc_tmp != NULL)
 /* Remember end of headers / start of payload. */
-warc_payload_offset = ftell (warc_tmp);
+warc_payload_offset = ftello (warc_tmp);
 
   write_error = post_file (sock, opt.post_file_name, post_data_size, warc_tmp);
 }

=== modified file 'src/warc.c'
--- src/warc.c	2012-01-11 14:27:06 +
+++ src/warc.c	2012-01-31 22:35:00 +
@@ -50,10 +50,10 @@
 static gzFile *warc_current_gzfile;
 
 /* The offset of the current gzip record in the WARC file. */
-static size_t warc_current_gzfile_offset;
+static off_t warc_current_gzfile_offset;
 
 /* The uncompressed size (so far) of the current record. */
-static size_t warc_current_gzfile_uncompressed_size;
+static off_t warc_current_gzfile_uncompressed_size;
 # endif
 
 /* This is true until a warc_write_* method fails. */
@@ -158,7 +158,7 @@
 return false;
 
   fflush (warc_current_file);
-  if (opt.warc_maxsize  0  ftell (warc_current_file) = opt.warc_maxsize)
+  if (opt.warc_maxsize  0  ftello (warc_current_file) = opt.warc_maxsize)
 warc_start_new_file (false);
 
 #ifdef HAVE_LIBZ
@@ -166,7 +166,7 @@
   if (opt.warc_compression_enabled)
 {
   /* Record the starting offset of the new record. */
-  warc_current_gzfile_offset = ftell (warc_current_file);
+  warc_current_gzfile_offset = ftello (warc_current_file);
 
   /* Reserve space for the extra GZIP header field.
  In warc_write_end_record we will fill this space
@@ -217,8 +217,8 @@
 {
   /* Add the Content-Length header. */
   char *content_length;
-  fseek (data_in, 0L, SEEK_END);
-  if (! asprintf (content_length, %ld, ftell (data_in)))
+  fseeko (data_in, 0L, SEEK_END);
+  if (! asprintf (content_length, %ld, ftello (data_in)))
 {
   warc_write_ok = false;
   return false;
@@ -229,7 +229,7 @@
   /* End of the WARC header section. */
   warc_write_string (\r\n);
 
-  if (fseek (data_in, 0L, SEEK_SET) != 0)
+  if (fseeko (data_in, 0L, SEEK_SET) != 0)
 warc_write_ok = false;
 
   /* Copy the data in the file to the WARC record. */
@@ -266,7 +266,7 @@
 }
 
   fflush (warc_current_file);
-  fseek (warc_current_file, 0, SEEK_END);
+  fseeko (warc_current_file, 0, SEEK_END);
 
   /* The WARC standard suggests that we add 'skip length' data in the
  extra header field of the GZIP stream.
@@ -284,12 +284,12 @@
   */
 
   /* Calculate the uncompressed and compressed sizes. */
-  size_t current_offset = ftell (warc_current_file);
-  size_t uncompressed_size = current_offset - warc_current_gzfile_offset;
-  size_t compressed_size = warc_current_gzfile_uncompressed_size;
+  off_t current_offset = ftello (warc_current_file);
+  off_t uncompressed_size = current_offset - warc_current_gzfile_offset;
+  off_t compressed_size = warc_current_gzfile_uncompressed_size;
 
   /* Go back to the static GZIP header. */
-  fseek (warc_current_file, warc_current_gzfile_offset + EXTRA_GZIP_HEADER_SIZE, SEEK_SET);
+  fseeko (warc_current_file, warc_current_gzfile_offset + EXTRA_GZIP_HEADER_SIZE, SEEK_SET);
 
   /* Read the header. */
   char static_header

[Bug-wget] Two fixes: Memory leak with chunked responses / Chunked responses and WARC files

2012-01-27 Thread Gijs van Tulder

Hi,

Here are two small patches. I hope they will be useful.

First, a patch that fixes a memory leak in fd_read_body (src/retr.c) and 
skip_short_body (src/http.c) when it retrieves a response with 
Transfer-Encoding: chunked. Both functions make calls to fd_read_line 
but never free the result.


Second, a patch to the fd_read_body function that changes the way 
chunked responses are saved in the WARC file. Until now, wget would 
write a de-chunked response to the WARC file, which is wrong: the WARC 
file is supposed to have an exact copy of the HTTP response, so it 
should also include the chunk headers.


The first patch fixes the memory leaks. The second patch changes 
fd_read_body to save the full, chunked response in the WARC file.


Regards,

Gijs

=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-01-11 14:27:06 +
+++ src/ChangeLog	2012-01-26 21:30:19 +
@@ -1,3 +1,8 @@
+2012-01-27  Gijs van Tulder  gvtul...@gmail.com
+
+	* retr.c (fd_read_body): Fix a memory leak with chunked responses.
+	* http.c (skip_short_body): Fix the same memory leak.
+
 2012-01-09  Gijs van Tulder  gvtul...@gmail.com
 
 	* init.c: Disable WARC compression if zlib is disabled.

=== modified file 'src/http.c'
--- src/http.c	2012-01-08 23:03:23 +
+++ src/http.c	2012-01-26 21:30:19 +
@@ -951,9 +951,12 @@
 break;
 
   remaining_chunk_size = strtol (line, endl, 16);
+  xfree (line);
+
   if (remaining_chunk_size == 0)
 {
-  fd_read_line (fd);
+  line = fd_read_line (fd);
+  xfree_null (line);
   break;
 }
 }
@@ -978,8 +981,13 @@
 {
   remaining_chunk_size -= ret;
   if (remaining_chunk_size == 0)
-if (fd_read_line (fd) == NULL)
-  return false;
+{
+  char *line = fd_read_line (fd);
+  if (line == NULL)
+return false;
+  else
+xfree (line);
+}
 }
 
   /* Safe even if %.*s bogusly expects terminating \0 because

=== modified file 'src/retr.c'
--- src/retr.c	2011-11-04 21:25:00 +
+++ src/retr.c	2012-01-26 21:30:19 +
@@ -307,11 +307,16 @@
 }
 
   remaining_chunk_size = strtol (line, endl, 16);
+  xfree (line);
+
   if (remaining_chunk_size == 0)
 {
   ret = 0;
-  if (fd_read_line (fd) == NULL)
+  line = fd_read_line (fd);
+  if (line == NULL)
 ret = -1;
+  else
+xfree (line);
   break;
 }
 }
@@ -371,11 +376,16 @@
 {
   remaining_chunk_size -= ret;
   if (remaining_chunk_size == 0)
-if (fd_read_line (fd) == NULL)
-  {
-ret = -1;
-break;
-  }
+{
+  char *line = fd_read_line (fd);
+  if (line == NULL)
+{
+  ret = -1;
+  break;
+}
+  else
+xfree (line);
+}
 }
 }
 


=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-01-26 21:30:19 +
+++ src/ChangeLog	2012-01-26 21:56:27 +
@@ -1,3 +1,9 @@
+2012-01-27  Gijs van Tulder  gvtul...@gmail.com
+
+	* retr.c (fd_read_body): If the response is chunked, the chunk
+	headers are now written to the WARC file, making the WARC file
+	an exact copy of the HTTP response.
+
 2012-01-27  Gijs van Tulder  gvtul...@gmail.com
 
 	* retr.c (fd_read_body): Fix a memory leak with chunked responses.
 	* http.c (skip_short_body): Fix the same memory leak.

=== modified file 'src/retr.c'
--- src/retr.c	2012-01-26 21:30:19 +
+++ src/retr.c	2012-01-26 21:56:27 +
@@ -213,6 +213,9 @@
the data is stored to ELAPSED.
 
If OUT2 is non-NULL, the contents is also written to OUT2.
+   OUT2 will get an exact copy of the response: if this is a chunked
+   response, everything -- including the chunk headers -- is written
+   to OUT2.  (OUT will only get the unchunked response.)
 
The function exits and returns the amount of data read.  In case of
error while reading data, -1 is returned.  In case of error while
@@ -305,6 +308,8 @@
   ret = -1;
   break;
 }
+  else if (out2 != NULL)
+fwrite (line, 1, strlen (line), out2);
 
   remaining_chunk_size = strtol (line, endl, 16);
   xfree (line);
@@ -316,7 +321,11 @@
   if (line == NULL)
 ret = -1;
   else
-xfree (line);
+{
+  if (out2 != NULL

Re: [Bug-wget] Cannot compile current bzr trunk: undefined reference to `gzwrite' / `gzclose' / `gzdopen'

2012-01-09 Thread Gijs van Tulder

Hi all,

The attached patch should hopefully fix Evgenii's problem.

The patch changes the configure script to always use libz, unless it is 
explicitly disabled. In that case, the patch makes sure that the WARC 
functions do not use gzip but write to uncompressed files instead.


The funny thing is that libz was already included with the SSL support. 
Unless you compiled wget with --without-ssl, libz was always compiled in 
(even if you configured with --without-zlib).


Regards,

Gijs


Op 09-01-12 02:15 schreef Evgenii Philippov:
 Actually I currently close my work on wget.

 So these messages are just bug reports for wget collaborators.

 Some additional info:

 export PS1=Ok\
 Ok
 uname -asm
 Linux host_name 2.6.38-13-generic #53-Ubuntu SMP Mon Nov 28 19:33:45
 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
 Ok
 lsb_release -dr
 Description: Ubuntu 11.04
 Release: 11.04
 Ok

 With best regards,
 Thank you for a wonderful utility,
 --
 Evgeniy


=== modified file 'ChangeLog'
--- ChangeLog	2011-12-12 20:30:39 +
+++ ChangeLog	2012-01-09 13:40:01 +
@@ -1,3 +1,7 @@
+2012-01-09  Gijs van Tulder  gvtul...@gmail.com
+
+	* configure.ac: Always try to use libz, even without SSL.
+
 2011-12-12  Giuseppe Scrivano  gscriv...@gnu.org
 
 	* Makefile.am (EXTRA_DIST): Add build-aux/bzr-version-gen.

=== modified file 'configure.ac'
--- configure.ac	2011-11-04 21:25:00 +
+++ configure.ac	2012-01-09 13:40:01 +
@@ -65,6 +65,9 @@
 [[  --without-ssl   disable SSL autodetection
   --with-ssl={gnutls,openssl} specify the SSL backend.  GNU TLS is the default.]])
 
+AC_ARG_WITH(zlib,
+[[  --without-zlib  disable zlib ]])
+
 AC_ARG_ENABLE(opie,
 [  --disable-opie  disable support for opie or s/key FTP login],
 ENABLE_OPIE=$enableval, ENABLE_OPIE=yes)
@@ -234,6 +237,10 @@
 dnl Checks for libraries.
 dnl
 
+AS_IF([test x$with_zlib != xno], [
+  AC_CHECK_LIB(z, compress)
+])
+
 AS_IF([test x$with_ssl = xopenssl], [
 dnl some versions of openssl use zlib compression
 AC_CHECK_LIB(z, compress)

=== modified file 'src/ChangeLog'
--- src/ChangeLog	2012-01-08 23:03:23 +
+++ src/ChangeLog	2012-01-09 13:40:01 +
@@ -1,3 +1,10 @@
+2012-01-09  Gijs van Tulder  gvtul...@gmail.com
+
+	* init.c: Disable WARC compression if zlib is disabled.
+	* main.c: Do not show the 'no-warc-compression' option if zlib is
+	disabled.
+	* warc.c: Do not compress WARC files if zlib is disabled.
+
 2012-01-09  Sasikantha Babu   sasikanth@gmail.com (tiny change)
 	* connect.c (connect_to_ip): properly formatted ipv6 address display.
 	(socket_family): New function - returns socket family type.

=== modified file 'src/init.c'
--- src/init.c	2011-11-04 21:25:00 +
+++ src/init.c	2012-01-09 13:40:01 +
@@ -267,7 +267,9 @@
   { waitretry,opt.waitretry, cmd_time },
   { warccdx,  opt.warc_cdx_enabled,  cmd_boolean },
   { warccdxdedup, opt.warc_cdx_dedup_filename,  cmd_file },
+#ifdef HAVE_LIBZ
   { warccompression,  opt.warc_compression_enabled, cmd_boolean },
+#endif
   { warcdigests,  opt.warc_digests_enabled, cmd_boolean },
   { warcfile, opt.warc_filename, cmd_file },
   { warcheader,   NULL,   cmd_spec_warc_header },
@@ -374,7 +376,11 @@
   opt.show_all_dns_entries = false;
 
   opt.warc_maxsize = 0; /* 1024 * 1024 * 1024; */
+#ifdef HAVE_LIBZ
   opt.warc_compression_enabled = true;
+#else
+  opt.warc_compression_enabled = false;
+#endif
   opt.warc_digests_enabled = true;
   opt.warc_cdx_enabled = false;
   opt.warc_cdx_dedup_filename = NULL;

=== modified file 'src/main.c'
--- src/main.c	2011-11-04 21:25:00 +
+++ src/main.c	2012-01-09 13:40:01 +
@@ -289,7 +289,9 @@
 { wait, 'w', OPT_VALUE, wait, -1 },
 { waitretry, 0, OPT_VALUE, waitretry, -1 },
 { warc-cdx, 0, OPT_BOOLEAN, warccdx, -1 },
+#ifdef HAVE_LIBZ
 { warc-compression, 0, OPT_BOOLEAN, warccompression, -1 },
+#endif
 { warc-dedup, 0, OPT_VALUE, warccdxdedup, -1 },
 { warc-digests, 0, OPT_BOOLEAN, warcdigests, -1 },
 { warc-file, 0, OPT_VALUE, warcfile, -1 },
@@ -674,8 +676,10 @@
--warc-cdxwrite CDX index files.\n),
 N_(\
--warc-dedup=FILENAME do not store records listed in this CDX file.\n),
+#ifdef HAVE_LIBZ
 N_(\
--no-warc-compression do not compress WARC files with GZIP.\n),
+#endif
 N_(\
--no-warc-digests do not calculate SHA1 digests.\n),
 N_(\

=== modified file 'src/warc.c'
--- src/warc.c	2011-11-20 17:28:19 +
+++ src/warc.c	2012-01-09 13:40:01 +
@@ -14,7 +14,9 @@
 #include sha1.h
 #include base32.h
 #include unistd.h
+#ifdef HAVE_LIBZ
 #include zlib.h
+#endif
 #ifdef HAVE_LIBUUID
 #include uuid/uuid.h
 #endif
@@ -42,6 +44,7 @@
 /* The current WARC file (or NULL, if WARC is disabled). */
 static FILE *warc_current_file;
 
+#ifdef HAVE_LIBZ
 /* The gzip stream for the current WARC file
(or NULL, if WARC or gzip is disabled). */
 static gzFile

Re: [Bug-wget] WARC, new version

2011-11-04 Thread Gijs van Tulder

 lovely.  I am going to push it soon with some small adjustments.

That's good to hear.

There's one other small adjustment that you may want to make, see the 
attached patch. One of the WARC functions uses the basename function, 
which causes problems on OS X. Including libgen.h and strdup-ing the 
output of basename seems to solve this problem.


Thanks,

Gijs


Op 04-11-11 22:27 schreef Giuseppe Scrivano:

Gijs van Tuldergvtul...@gmail.com  writes:


Hi Giuseppe,

* I've changed the configure.ac and src/Makefile.am.
* I've added a ChangeLog entry.


lovely.  I am going to push it soon with some small adjustments.

Thanks for the great work.  Whenever it happens to be in the same place,
I'll buy you a beer :-)

Cheers,
Giuseppe


--- a/src/warc.c	2011-11-04 17:41:11.383704054 +0100
+++ b/src/warc.c	2011-11-04 23:06:28.693712714 +0100
@@ -19,6 +19,10 @@
 #include uuid/uuid.h
 #endif
 
+#ifndef WINDOWS
+#include libgen.h
+#endif
+
 #include warc.h
 
 extern char *version_string;
@@ -605,7 +609,7 @@
 
   char *filename_copy, *filename_basename;
   filename_copy = strdup (filename);
-  filename_basename = basename (filename_copy);
+  filename_basename = strdup (basename (filename_copy));
 
   warc_write_start_record ();
   warc_write_header (WARC-Type, warcinfo);
@@ -619,6 +623,7 @@
   if (warc_tmp == NULL)
 {
   free (filename_copy);
+  free (filename_basename);
   return false;
 }
 
@@ -646,6 +651,7 @@
 }
 
   free (filename_copy);
+  free (filename_basename);
   fclose (warc_tmp);
   return warc_write_ok;
 }


[Bug-wget] Memory leak when using GnuTLS

2011-10-31 Thread Gijs van Tulder

Hi,

I think there is a memory leak in the GnuTLS part of wget. When 
downloading multiple files from a HTTPS server, wget with GnuTLS uses a 
lot of memory.


Perhaps an explanation for this can be found in src/http.c. The gethttp 
calls ssl_init for each download:


 /* Initialize the SSL context.  After this has once been done,
it becomes a no-op.  */
 if (!ssl_init ())

The OpenSSL version of ssl_init, in src/openssl.c, checks if SSL has 
already been initialized and doesn't repeat the work.


But the GnuTLS version doesn't:

 bool
 ssl_init ()
 {
   const char *ca_directory;
   DIR *dir;

   gnutls_global_init ();
   gnutls_certificate_allocate_credentials (credentials);

GnuTLS is initialized again and again, but there is never a call to 
gnutls_global_deinit.


I've attached a small patch to add a check to ssl_init in src/gnutls.c, 
similar to the check already in src/openssl.c. With it, wget can still 
download over HTTPS and the memory usage stays within reasonable limits.


Thanks,

Gijs
=== modified file 'src/gnutls.c'
--- src/gnutls.c	2011-09-04 11:30:01 +
+++ src/gnutls.c	2011-10-31 22:58:38 +
@@ -59,10 +59,17 @@
confused with actual gnutls functions -- such as the gnutls_read
preprocessor macro.  */
 
+/* Becomes true if GnuTLS is initialized. */
+static bool ssl_initialized = false;
+
 static gnutls_certificate_credentials credentials;
 bool
 ssl_init ()
 {
+  /* GnuTLS should be initialized only once. */
+  if (ssl_initialized)
+return true;
+
   const char *ca_directory;
   DIR *dir;
 
@@ -104,6 +111,9 @@
   if (opt.ca_cert)
 gnutls_certificate_set_x509_trust_file (credentials, opt.ca_cert,
 GNUTLS_X509_FMT_PEM);
+
+  ssl_initialized = true;
+
   return true;
 }
 



Re: [Bug-wget] WARC, new version

2011-10-30 Thread Gijs van Tulder

Hi David,

David H. Lipman wrote:

I have seen WARC mentioned but have not seen a definition.


WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web 
resources. It  is used for making archives of web sites. The Internet 
Archive, for example, uses it as the file format for their Wayback 
Machine and Heritrix crawler.


The nice thing about WARC is that it lets you store all information 
about your web crawl: the files you download, of course, but also things 
like the HTTP request and response headers, information about redirects 
and error pages. WARC also provides a place to keep the related 
metadata. It is, in short, a way to store everything, in a standardized 
file format.


Adding WARC to wget means that you'll be able to do things like

  wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

which will produce (next to the normal wget download) a file named 
'gnu.warc.gz' that contains every HTTP request and every HTTP response 
that wget made. This is a 'archival grade' copy of the mirrored site.


Once you have the WARC file, you could store it in your archive, extract 
files, run your own local Wayback Machine [2, 3].


wget is already a very useful tool to make a quick copy of a website, 
adding WARC support helps to make the copy is as complete as possible.


Maybe that answers some of your questions?

Regards,

Gijs


[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php



Re: [Bug-wget] WARC output

2011-09-26 Thread Gijs van Tulder

 can you please send a complete diff against the current development
 tree version?

Here's the diff of the WARC additions (1.9MB zipped) to revision 2565:

 http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2

Thanks,

Gijs



Re: [Bug-wget] WARC output

2011-09-25 Thread Gijs van Tulder

Hi.

It's been a while since we've discussed the WARC addition to Wget. Is 
there anything I can help with?


Gijs



Re: [Bug-wget] WARC output

2011-08-10 Thread Gijs van Tulder

Giuseppe Scrivano writes:

 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
   http://code.google.com/p/warc-tools/

 how much code is really needed from that library?  I wonder if we can
 avoid this dependency at all.

The library comes with some utilities, an HTTrack plugin, a Java module 
etc. These extra things are not needed for Wget. But of the C library, I 
used pretty much everything. The library handles all the WARC writing 
stuff. It can also read WARCs, but that's not needed here.


Rough estimate: 12.000 lines of code (excluding comments).

It's probably important to note that I have changed a few small things 
in the warc-tools library. (I have records in Git.)



As for the other dependencies:
- I used an MIT-licenced base32 encoder (there seems to be no such
  module in Gnulib), but that's quite small so could be replaced;
- it links to the UUID library.


 Can you please track all contributors?  Any contribution to GNU wget
 requires copyright assigments to the FSF.

Yes, it's all in the Git history, so it's easy to make a list. (There's 
only one other contributor of code, others helped with testing.)


 In the meanwhile, can you check if you are following the GNU Coding
 Standards for the new code?

I tried to do that. So except for the warc-tools library, which uses a 
different standard, all new code follows the GNU standards (I hope).


Thanks,

Gijs



[Bug-wget] WARC output

2011-08-09 Thread Gijs van Tulder

Hi,

I'd like to propose a new feature that allows Wget to make WARC files.

Perhaps you're already familiar with it, but in short: WARC is a file 
format for web archives. In a single WARC file, you can store every file 
of the website, plus the HTTP request and response headers and other 
metadata. This makes it a very useful format for web archivists: you 
keep everything together, in the most detailed and original form.


The WARC format (an ISO standard, ISO 28500) has been developed by the 
International Internet Preservation Consortium, which includes the 
Internet Archive and many national libraries. It is supposed to become 
*the* standard file format for web archives. For example, it is used in 
the Internet Archive's Wayback Machine and its Heritrix crawler. There 
are several projects building tools to work with WARC files.



It would be cool if Wget could become one of these tools. Already the 
Swiss army knife for mirroring websites, the one thing that Wget is 
missing is a good way to store these mirrors. The current output of 
--mirror is not sufficient for archival purposes:


 - it throws away the HTTP headers (of the request and response);
 - it doesn't keep 404 pages and redirects;
 - it doesn't store the original urls but mangles the filenames;
 - and, if you're not careful, it even rewrites the links inside
   the documents that it has downloaded.

The WARC format supports these things.


With some help from others, I've added WARC functions to Wget. With the 
--warc-file option you can specify that the mirror should also be 
written to a WARC archive. Wget will then keep everything, including the 
HTTP request and response headers, redirects and 404 pages.


Do you think this is something that could be included in the main Wget 
version? If that's the case, what should be the next step?


Description, links to more information about WARC:
 http://www.archiveteam.org/index.php?title=Wget_with_WARC_output

Code:
 https://github.com/alard/wget-warc/
 https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2

The implementation makes use of the open source WARC Tools library
(Apache License 2.0):
 http://code.google.com/p/warc-tools/


I look forward to your response.

Kind regards,

Gijs van Tulder