Hi,
I tried to use Nutch to crawl craiglist. The seed I use is
http://losangeles.craigslist.org/wst/ctd/
http://losangeles.craigslist.org/sfv/ctd/
http://losangeles.craigslist.org/lac/ctd/
http://losangeles.craigslist.org/sgv/ctd/
http://losangeles.craigslist.org/lgb/ctd/
http://losangeles.craigslist.org/ant/ctd/
http://losangeles.craigslist.org/wst/cto/
http://losangeles.craigslist.org/sfv/cto/
http://losangeles.craigslist.org/lac/cto/
http://losangeles.craigslist.org/sgv/cto/
http://losangeles.craigslist.org/lgb/cto/
http://losangeles.craigslist.org/ant/cto/
What I want to get is the result page like this one , for example ,
http://losangeles.craigslist.org/lac/ctd/2501038362.html , which is a
specific car selling page .
What I DON'T what to get is the result page like this one , for example ,
http://losangeles.craigslist.org/cta/.
However , in my query result , I can always have results like
http://losangeles.craigslist.org/cta/.
Actually , I can get this kind of this website from craiglist, just part of
them , but not all of them. I tried to adjust the crawl command line
parameter, but there is no much change .
So what I plan to do is to modify the crawl code in Nutch src code. Where
can I start ? What kind of work can I do to optimize the crawl process in
src code ?
--
Cheng Li