Re: Nutch 2.2.1: Web Content size of a particular website

A Laxmi Thu, 10 Apr 2014 07:03:14 -0700

Thanks, Lewis! I will look into it.


On Wed, Apr 9, 2014 at 3:46 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi
>
>
> On Wed, Apr 9, 2014 at 8:43 AM, <[email protected]> wrote:
>
> >
> > user Digest 9 Apr 2014 14:43:51 -0000 Issue 2188
> >
> > I might not be thinking in the right direction so need some help. Is
> there
> > a way to find an approximate web content size of a particular website in
> > Nutch 2.2.1?
> >
>
> You can obtain the WebPage Content by looking in to the FetcherReducer at
> the following line
>
> if (content!=null && content.getContent()!=null) length=
> content.getContent().length;
>
> Content.getContent().length returns as byte[] containing the binary content
> retrieved for this resource.
>
> You would then need to think about how you could sum up the values of these
> lengths in an attempt to obtain a ~total for some domain.
>
>
> > I have crawled a research website which has lot of images, pdfs, etc.
> and I
> > am interested to know the content size of all the files in that website.
> > Please advise.
> >
> > I haven't needed to do this as of yet so I don't have a concrete answer
> as
> to how you could implement it all.
> hth
> Lewis
>

Re: Nutch 2.2.1: Web Content size of a particular website

Reply via email to