Thanks, Lewis! I will look into it.
On Wed, Apr 9, 2014 at 3:46 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi > > > On Wed, Apr 9, 2014 at 8:43 AM, <[email protected]> wrote: > > > > > user Digest 9 Apr 2014 14:43:51 -0000 Issue 2188 > > > > I might not be thinking in the right direction so need some help. Is > there > > a way to find an approximate web content size of a particular website in > > Nutch 2.2.1? > > > > You can obtain the WebPage Content by looking in to the FetcherReducer at > the following line > > if (content!=null && content.getContent()!=null) length= > content.getContent().length; > > Content.getContent().length returns as byte[] containing the binary content > retrieved for this resource. > > You would then need to think about how you could sum up the values of these > lengths in an attempt to obtain a ~total for some domain. > > > > I have crawled a research website which has lot of images, pdfs, etc. > and I > > am interested to know the content size of all the files in that website. > > Please advise. > > > > I haven't needed to do this as of yet so I don't have a concrete answer > as > to how you could implement it all. > hth > Lewis >

