Hi, I am new to apache nutch and web crawlers in general, I am trying to build a vertical search engine for real estate.
Now, How do I implement the crawler? Probably use Nutch for the crawling and modify it to only extract links from a page if the page contents are relevant to real estate. I'd probably need to write some kind of relevancy scoring function which uses a mixture of keywords, ontology and some kind of similarity detection based on sites I know to be relevant. Now is there any way by which I can configure Nutch to use my relevancy scoring function or do I need to change the source code, Also I would prefer working in python over java as I am much more familiar with it, so is there any library in python for nutch. Apart from this I would really appreciate any more pointers regarding nutch in general. Thanks Vishal

