Hi Zhanibek, I would like to refer specifically to Markus' thread which he initiated a short time ago [1] sharing close similarity to your own questions. I think the main question to be answered now is how do we extract tf-idf from a crawled website? And as we now refer to Nutch as an independent software project focussed solely on crawling this is a question which would provide significant value to understanding more about the inner workings.
Markus mentioned that there many aspects we need to consider before trying to compile a tf-idf score e.g. link score, norms, boosts, functions etc. This is making it relatively hard for me (and I suspect others) to accurately comment on the actual components we are required to consider and understand in this specific context before we can address the fundamental question at hand... I think there is a good deal of lateral thinking required here ;0) In the mean time have you had any chance to delve into this? [1] http://www.mail-archive.com/user%40nutch.apache.org/msg03517.html On Wed, Aug 3, 2011 at 5:28 AM, Zhanibek Datbayev <[email protected]>wrote: > Hello Nutch Users, > I've googled for a while and still can not find answers to the following: > 1. After I crawl a web site, how can I extract tf-idf for it? > 2. How can I access original web pages crawled? > 3. Is it possible to get for each word id it corresponds to? > > Thanks in advance! > > -Zhanibek > -- *Lewis*

