Hi,

    I tried to use Nutch to crawl craiglist.   The seed I use is




    http://losangeles.craigslist.org/wst/ctd/
http://losangeles.craigslist.org/sfv/ctd/
http://losangeles.craigslist.org/lac/ctd/
http://losangeles.craigslist.org/sgv/ctd/
http://losangeles.craigslist.org/lgb/ctd/
http://losangeles.craigslist.org/ant/ctd/

http://losangeles.craigslist.org/wst/cto/
http://losangeles.craigslist.org/sfv/cto/
http://losangeles.craigslist.org/lac/cto/
http://losangeles.craigslist.org/sgv/cto/
http://losangeles.craigslist.org/lgb/cto/
http://losangeles.craigslist.org/ant/cto/


  What I want to get is the result page like this one , for example ,
http://losangeles.craigslist.org/lac/ctd/2501038362.html  , which is a
specific car selling page .
  What I DON'T what to get is the result page like this one , for example ,
http://losangeles.craigslist.org/cta/.

 However , in my query result , I can always have results like
http://losangeles.craigslist.org/cta/.

 Actually , I can get this kind of this website from craiglist, just part of
them , but not all of them.  I tried to adjust the crawl command line
parameter, but there is no much change .

 So what I plan to do is to modify the crawl code in Nutch src code. Where
can I start ?  What kind of work can I do to optimize the crawl process in
src code ?

-- 
Cheng Li

Reply via email to