subject:"Re\: \[Bug\-wget\] WARC output"

Re: [Bug-wget] WARC output

2011-10-08 Thread Giuseppe Scrivano

Hi Gijs,


Gijs van Tulder gvtul...@gmail.com writes:

 can you please send a complete diff against the current development
 tree version?

 Here's the diff of the WARC additions (1.9MB zipped) to revision 2565:

  http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2

the patch is huge and I think we don't want to add some many files into
the wget tree.  Can't we assume the user will install the warc tools by
herself and let configure check if they are installed or not?  This will
require some more work but the result will be much less intrusive.  What
do you think?

Thanks,
Giuseppe

Re: [Bug-wget] WARC output

2011-10-08 Thread Ángel González


Giuseppe Scrivano wrote:

the patch is huge and I think we don't want to add some many files into
the wget tree.  Can't we assume the user will install the warc tools by
herself and let configure check if they are installed or not?  This will
require some more work but the result will be much less intrusive.  What
do you think?

Thanks,
Giuseppe

I don't think all those files are even remotely needed.
I am seeing for instance, python files for creating warc interacting 
with curl.

Why would that be useful in wget repository?
I -optimistically- think we could make warc files with a simpler 
implementation.

Also, the patch seems to duplicate code (compare lines 337731-337810 with
337944-338013 in the patch file). Surely that could be refactored?

Re: [Bug-wget] WARC output

2011-09-26 Thread Giuseppe Scrivano

Gijs van Tulder gvtul...@gmail.com writes:

 Hi.

 It's been a while since we've discussed the WARC addition to Wget. Is
 there anything I can help with?

can you please send a complete diff against the current development tree
version?

I'll take a look at it ASAP.

Thanks,
Giuseppe

Re: [Bug-wget] WARC output

2011-09-26 Thread Gijs van Tulder


 can you please send a complete diff against the current development
 tree version?

Here's the diff of the WARC additions (1.9MB zipped) to revision 2565:

 http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2

Thanks,

Gijs

Re: [Bug-wget] WARC output

2011-09-25 Thread Gijs van Tulder


Hi.

It's been a while since we've discussed the WARC addition to Wget. Is 
there anything I can help with?


Gijs

Re: [Bug-wget] WARC output

2011-08-10 Thread Giuseppe Scrivano

Gijs van Tulder gvtul...@gmail.com writes:

 It would be cool if Wget could become one of these tools. Already the
 Swiss army knife for mirroring websites, the one thing that Wget is
 missing is a good way to store these mirrors. The current output of
 --mirror is not sufficient for archival purposes:

Sure we do!



 With some help from others, I've added WARC functions to Wget. With
 the --warc-file option you can specify that the mirror should also be
 written to a WARC archive. Wget will then keep everything, including

Can you please track all contributors?  Any contribution to GNU wget
requires copyright assigments to the FSF.



 Do you think this is something that could be included in the main Wget
 version? If that's the case, what should be the next step?

Sure, I will take a look at the code in the next days.  In the
meanwhile, can you check if you are following the GNU Coding Standards
for the new code[1]?



 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
  http://code.google.com/p/warc-tools/

how much code is really needed from that library?  I wonder if we can
avoid this dependency at all.

Cheers,
Giuseppe



1) http://www.gnu.org/prep/standards/

Re: [Bug-wget] WARC output

2011-08-10 Thread Gijs van Tulder


Giuseppe Scrivano writes:

 The implementation makes use of the open source WARC Tools library
 (Apache License 2.0):
   http://code.google.com/p/warc-tools/

 how much code is really needed from that library?  I wonder if we can
 avoid this dependency at all.

The library comes with some utilities, an HTTrack plugin, a Java module 
etc. These extra things are not needed for Wget. But of the C library, I 
used pretty much everything. The library handles all the WARC writing 
stuff. It can also read WARCs, but that's not needed here.


Rough estimate: 12.000 lines of code (excluding comments).

It's probably important to note that I have changed a few small things 
in the warc-tools library. (I have records in Git.)



As for the other dependencies:
- I used an MIT-licenced base32 encoder (there seems to be no such
  module in Gnulib), but that's quite small so could be replaced;
- it links to the UUID library.


 Can you please track all contributors?  Any contribution to GNU wget
 requires copyright assigments to the FSF.

Yes, it's all in the Git history, so it's easy to make a list. (There's 
only one other contributor of code, others helped with testing.)


 In the meanwhile, can you check if you are following the GNU Coding
 Standards for the new code?

I tried to do that. So except for the warc-tools library, which uses a 
different standard, all new code follows the GNU standards (I hope).


Thanks,

Gijs

Re: [Bug-wget] WARC output

2011-08-09 Thread Patrick Steil

That sounds awesome! You have my vote... :)

On Tue, Aug 9, 2011 at 4:49 AM, Gijs van Tulder gvtul...@gmail.com wrote:

Hi,

I'd like to propose a new feature that allows Wget to make WARC files.

Perhaps you're already familiar with it, but in short: WARC is a file
format for web archives. In a single WARC file, you can store every file of
the website, plus the HTTP request and response headers and other metadata.
This makes it a very useful format for web archivists: you keep everything
together, in the most detailed and original form.

The WARC format (an ISO standard, ISO 28500) has been developed by the
International Internet Preservation Consortium, which includes the Internet
Archive and many national libraries. It is supposed to become *the* standard
file format for web archives. For example, it is used in the Internet
Archive's Wayback Machine and its Heritrix crawler. There are several
projects building tools to work with WARC files.

It would be cool if Wget could become one of these tools. Already the Swiss
army knife for mirroring websites, the one thing that Wget is missing is a
good way to store these mirrors. The current output of --mirror is not
sufficient for archival purposes:

- it throws away the HTTP headers (of the request and response);
- it doesn't keep 404 pages and redirects;
- it doesn't store the original urls but mangles the filenames;
- and, if you're not careful, it even rewrites the links inside
the documents that it has downloaded.

The WARC format supports these things.

With some help from others, I've added WARC functions to Wget. With the
--warc-file option you can specify that the mirror should also be written to
a WARC archive. Wget will then keep everything, including the HTTP request
and response headers, redirects and 404 pages.

Do you think this is something that could be included in the main Wget
version? If that's the case, what should be the next step?

Description, links to more information about WARC:

http://www.archiveteam.org/**index.php?title=Wget_with_**WARC_outputhttp://www.archiveteam.org/index.php?title=Wget_with_WARC_output

Code:
https://github.com/alard/wget-**warc/https://github.com/alard/wget-warc/
https://github.com/downloads/**alard/wget-warc/wget-warc-**
20110809.tar.bz2https://github.com/downloads/alard/wget-warc/wget-warc-20110809.tar.bz2

The implementation makes use of the open source WARC Tools library
(Apache License 2.0):
http://code.google.com/p/warc-**tools/http://code.google.com/p/warc-tools/

I look forward to your response.

Kind regards,

Gijs van Tulder

*Patrick Steil | ChurchBuzz.org*

Church Website Optimization http://www.churchbuzz.org/
Like us on Facebook http://facebook.com/churchbuzz!

Mobile: 940-391-9250

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

Re: [Bug-wget] WARC output

8 matches

Site Navigation

Mail list logo

Footer information