Hi,

I am new to apache nutch and web crawlers in general, I am trying to build
a vertical search engine for real estate.

Now, How do I implement the crawler? Probably use Nutch for the crawling
and modify it to only extract links from a page if the page contents are
relevant to real estate. I'd probably need to write some kind of relevancy
scoring function which uses a mixture of keywords, ontology and some kind
of similarity detection based on sites I know to be relevant.

Now is there any way by which I can configure Nutch to use my relevancy
scoring function or do I need to change the source code, Also I would
prefer working in python over java as I am much more familiar with it, so
is there any library in python for nutch.

Apart from this I would really appreciate any more pointers regarding nutch
in general.

Thanks
Vishal

Reply via email to