Re: Nutch 2.2.1: Web Content size of a particular website

Lewis John Mcgibbney Wed, 09 Apr 2014 12:47:25 -0700

Hi


On Wed, Apr 9, 2014 at 8:43 AM, <[email protected]> wrote:

>
> user Digest 9 Apr 2014 14:43:51 -0000 Issue 2188
>
> I might not be thinking in the right direction so need some help. Is there
> a way to find an approximate web content size of a particular website in
> Nutch 2.2.1?
>

You can obtain the WebPage Content by looking in to the FetcherReducer at
the following line

if (content!=null && content.getContent()!=null) length=
content.getContent().length;

Content.getContent().length returns as byte[] containing the binary content
retrieved for this resource.

You would then need to think about how you could sum up the values of these
lengths in an attempt to obtain a ~total for some domain.


> I have crawled a research website which has lot of images, pdfs, etc. and I
> am interested to know the content size of all the files in that website.
> Please advise.
>
> I haven't needed to do this as of yet so I don't have a concrete answer as
to how you could implement it all.
hth
Lewis

Re: Nutch 2.2.1: Web Content size of a particular website

Reply via email to