Re: Crawling some specific url & avoiding other urls

Edward Drapkin Fri, 05 Nov 2010 21:19:47 -0700

On 11/5/2010 10:37 PM, nitin hardeniya wrote:

dear All,


I am using nutch for crawling all the user reviews on a page of IMDB .the
url will be
http://www.imdb.com/title/tt1375666/usercomments
http://www.imdb.com/title/tt1375666/usercomments?start=50
I want to crawl all these with only user review as text.

on each of thes url there will be link to user profile like of each user on
clicking you  will redirect to url like avoiding other urls

http://www.imdb.com/user/ur10583368/comments

which has all the movie review written by a user in this case  ur10583368
but this user could have written multiple reviews and the pattern for those
urls will be


http://www.imdb.com/user/ur10583368/comments?order=date&start=10 while
highlighted area will change for each page

Now I need all these reviews as well .

please help.
i just want to crawl only these url

1) set up url filtration to crawl the messages and the pages that index them

2) set up an indexing filter to parse the pages you want into the lucenefields you want and otherwise add empty documents

3) remove empty documents from your index
4) ?!
5) Profit!

Thanks,
Eddie

Re: Crawling some specific url & avoiding other urls

Reply via email to