Hi Lewis, I figured out a much simpler way of getting the aggregate value of contentLength. Here is what I did -
since every url in Nutch/HBase is indexed in Solr. From Solr Admin page, I have set the following: 1. fl = contentLength, url 2. wt = csv, and ofcourse q was default *:* That was it! It generated a csv with all the information I need on the Solr Admin window. I copied that to an excel and calculated the aggregate of contentLength. :) However, I saw about 1000 urls out of 200,000 urls that have empty contentLength. One example of such url with empty contentLength is below. Any comments why Nutch didn't capture contentLength for the below url? http://andrewsforest.oregonstate.edu/pubs/mtgnotes/monthmtg/minmo.cfm?minmo=9408&topnav=42 Thanks.. On Fri, Apr 25, 2014 at 2:27 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi, > > On Fri, Apr 25, 2014 at 11:15 AM, <[email protected]> > wrote: > > > > > From what you said earlier, > > Isn't that the same as contentLength in index-more plugin which is > > determined according > > to the type of download page? > > > > Pretty much ;) > It would be interesting to see if you could use Gora to Query by Field for > all domains with the a certain key e.g. same site domain. This would > aggregate results and you could sum all contentLength's. Alternatively of > course you could use HBase shell or Cassandra CQL. Only problem with > Cassandra is that *everything* is in Bytes in CQL as we write in Bytes so > it looks really messy. You would be better to use Gora for the queries in > Cassandra. > Please let me know how you get on. > Thanks > Lewis >

