Re: [Bug-wget] Async webcrawling

2018-07-31 Thread Tim Rühsen
On 31.07.2018 20:17, James Read wrote:
> Thanks,
> 
> as I understand it though there is only so much you can do with
> threading. For more scalable solutions you need to go with async
> programming techniques. See http://www.kegel.com/c10k.html for a summary
> of the problem. I want to do large scale webcrawling and am not sure if
> wget2 is up to the job.

Well, you'll surprised how fast wget2 is. Especially with HTTP/2
spreading more and more, you can easily fill larger bandwidths with even
a few threads. Of course it also heavily depends on the server's
capabilities and ping/RTT values you have.

Since you can control host spanning, you could also split your workload
onto several processes (or even hosts).

Are you going to crawl complete web sites or just a few files per site ?
The speed heavily depends on those (and more) details.

If it turns out that you really need a highly specialized crawler, it
might be the best to use libwget's API. I did so for scanning the top 1M
Alexa sites a while ago and it worked out pretty well (took ~2h on a
500/50 mbps cable connection). The source is in examples/ directory.

Maybe you just start with a test.

I am personally pretty interested in tuning bottlenecks (CPU, memory,
bandwidth, ...), so let me know if there is something and I go for it.

You can also PM me with more details, if you don't like to post it in
public.

Regards, Tim

> 
> On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen  > wrote:
> 
> On 31.07.2018 18:39, James Read wrote:
> > Hi,
> >
> > how much work would it take to convert wget into a fully fledged
> > asynchronous webcrawler?
> >
> > I was thinking something like using select. Ideally, I want to be
> able to
> > supply wget with a list of starting point URLs and then for wget
> to crawl
> > the web from those starting points in an asynchronous fashion.
> >
> > James
> >
> 
> Just use wget2. It is already packaged in Debian sid.
> To build from git source, see https://gitlab.com/gnuwget/wget2
> .
> 
> To build from tarball (much easier), download from
> https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz
> .
> 
> Regards, Tim
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [Bug-wget] Async webcrawling

2018-07-31 Thread James Read
Thanks,

as I understand it though there is only so much you can do with threading.
For more scalable solutions you need to go with async programming
techniques. See http://www.kegel.com/c10k.html for a summary of the
problem. I want to do large scale webcrawling and am not sure if wget2 is
up to the job.

On Tue, Jul 31, 2018 at 6:22 PM, Tim Rühsen  wrote:

> On 31.07.2018 18:39, James Read wrote:
> > Hi,
> >
> > how much work would it take to convert wget into a fully fledged
> > asynchronous webcrawler?
> >
> > I was thinking something like using select. Ideally, I want to be able to
> > supply wget with a list of starting point URLs and then for wget to crawl
> > the web from those starting points in an asynchronous fashion.
> >
> > James
> >
>
> Just use wget2. It is already packaged in Debian sid.
> To build from git source, see https://gitlab.com/gnuwget/wget2.
>
> To build from tarball (much easier), download from
> https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.
>
> Regards, Tim
>
>


Re: [Bug-wget] Async webcrawling

2018-07-31 Thread Tim Rühsen
On 31.07.2018 18:39, James Read wrote:
> Hi,
> 
> how much work would it take to convert wget into a fully fledged
> asynchronous webcrawler?
> 
> I was thinking something like using select. Ideally, I want to be able to
> supply wget with a list of starting point URLs and then for wget to crawl
> the web from those starting points in an asynchronous fashion.
> 
> James
> 

Just use wget2. It is already packaged in Debian sid.
To build from git source, see https://gitlab.com/gnuwget/wget2.

To build from tarball (much easier), download from
https://alpha.gnu.org/gnu/wget/wget2-1.99.1.tar.gz.

Regards, Tim



signature.asc
Description: OpenPGP digital signature


[Bug-wget] Async webcrawling

2018-07-31 Thread James Read
Hi,

how much work would it take to convert wget into a fully fledged
asynchronous webcrawler?

I was thinking something like using select. Ideally, I want to be able to
supply wget with a list of starting point URLs and then for wget to crawl
the web from those starting points in an asynchronous fashion.

James


Re: [Bug-wget] Any explanation for the '-nc' returned value?

2018-07-31 Thread Yuxi Hao
Forgot to say, I use this in scripts to update some software. There is already 
the 'old' file existing.
If the first try fails, I certainly should do as you say.
If it returns failed for 'preventing do anything when local file exists', it is 
weird. That is what we want.
I mean it should return succeed. It worked as we specified: don't overwrite nor 
download in to a new file. (RCE?)
I am not going to change the behavior of '-nc', but just confused with the 
return value.

Thanks for your reply and patience Tim! I can just change it in my own 
compilation :p

Best Regards,
YX Hao

-Original Message-
From: Tim Rühsen
To: Yuxi Hao; 'Dale R. Worley'
Cc: bug-wget@gnu.org
Subject: Re: [Bug-wget] Any explanation for the '-nc' returned value?

On 30.07.2018 16:44, Yuxi Hao wrote:
> Let's take an example in practice.
> When there is a bad network connection, I try wget with '-nc' directly first, 
> if it fails, then I'll try it with proxy. If it says "File * already there ; 
> not retrieving.", and returns 1 as described (error occurred, failed), that 
> is so weird!

After the first try failed, you should explicitly move/remove the file out of 
the way. That is not weird, it's a safety feature. It might save you when 
having a typo or when an URL is retrieved that you itself can't trust. You 
could easily overwrite files in your home directory, e.g.
.profile or .bashrc. That is easily used as an Remote Code Execution (RCE).

So no way we "fix" this ;-)