Re: how to extract tf-idf

Markus Jelsma Mon, 08 Aug 2011 03:55:16 -0700

Hmm, it's not quite the same topic as i was more talking about relevance in 
the scope of web crawling.


Anyway, index to Solr, it provides neat components such as Terms and 
TermVectors which you can use to retrieve tf*idf information on term. 

See the Solr wiki for documentation on these components.

> Hi Zhanibek,
> 
> I would like to refer specifically to Markus' thread which he initiated a
> short time ago [1] sharing close similarity to your own questions. I think
> the main question to be answered now is how do we extract tf-idf from a
> crawled website? And as we now refer to Nutch as an independent software
> project focussed solely on crawling this is a question which would provide
> significant value to understanding more about the inner workings.
> 
> Markus mentioned that there many aspects we need to consider before trying
> to compile a tf-idf score e.g. link score, norms, boosts, functions etc.
> This is making it relatively hard for me (and I suspect others) to
> accurately comment on the actual components we are required to consider and
> understand in this specific context before we can address the fundamental
> question at hand...
> 
> I think there is a good deal of lateral thinking required here ;0)
> 
> In the mean time have you had any chance to delve into this?
> 
> 
> [1] http://www.mail-archive.com/user%40nutch.apache.org/msg03517.html
> 
> On Wed, Aug 3, 2011 at 5:28 AM, Zhanibek Datbayev <[email protected]>wrote:
> > Hello Nutch Users,
> > I've googled for a while and still can not find answers to the following:
> > 1. After I crawl a web site, how can I extract tf-idf for it?
> > 2. How can I access original web pages crawled?
> > 3. Is it possible to get for each word id it corresponds to?
> > 
> > Thanks in advance!
> > 
> > -Zhanibek

Re: how to extract tf-idf

Reply via email to