In that case.. You can generate from one database . Do db update to a different crawl db..
On Jan 15, 2011, at 10:06 AM, "Marseld Dedgjonaj" <[email protected]> wrote: > Hi, > Thanks for your response. > If I set -depth 1, this will function only for the first crawl. > But sense initial urls are very dynamic webpages and the content changes > every hour, > I need to crawl the initial urls continuously(only initial urls). > > > Best Regards, > Marseldi > -----Original Message----- > From: [email protected] [mailto:[email protected]] > Sent: Saturday, January 15, 2011 6:58 PM > To: [email protected] > Subject: Re: How store only home page of domains but crawl all the pages to > detect all different domains > > > Can not you do this by specifying -depth 1 in crawl command? > > > > > > > > > -----Original Message----- > From: Marseld Dedgjonaj <[email protected]> > To: user <[email protected]>; markus.jelsma <[email protected]>; > 'Asier Martínez' <[email protected]> > Sent: Sat, Jan 15, 2011 3:44 am > Subject: RE: How store only home page of domains but crawl all the pages to > detect all different domains > > > Hi Markus, > > I am also interested in using different regex-urlfilter for Generate step > > because I need to crawl only homepage of 10 websites continuously and index > > all links which are in the homepage but not go crawling recursively. > > I think it can be done by puting in regex-urlfilter file for generate only > > these 10 websites, > > And in other steps(fetch, updated, invertlinks, index) to use the default > > regex-urlfilter. > > As I see you said that is possible to have different nutch-site configs for > > each step. > > How can I configure to have another nutch-site config file for generate step > > and another nutch-site config file for other steps? > > Should I change code for this or is just a configuration trick? > > > > Please help me for this. I really need it. > > > > Best regards, > > Marseld > > > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: Thursday, January 13, 2011 1:51 PM > > To: Asier Martínez > > Cc: [email protected] > > Subject: Re: How store only home page of domains but crawl all the pages to > > detect all different domains > > > > Hi, > > > > You will need to create different versions of the regex-urlfilter.txt for > > the > > different jobs. You can have different nutch-site configs where each has a > > different setting for urlfilter.regex.file, pointing to the relevant regex- > > urlfilter file. Or you can just copy regex-urlfilter-<JOB>.txt to regex- > > urlfilter.txt before executing that job. > > > > Cheers, > > > > On Thursday 13 January 2011 02:06:15 Asier Martínez wrote: > >> Oh Thank you Markus for your input. The homepage thing I have "solved" > >> in my crawler in Python, but I founded that Nutch works more more fast > >> than my original crawler based on Twitested Lib. > >> And I want to learn more :-). > >> > >> I didn't know about different url filters for fetching, updating etc, > >> ¿Where can I change those filters? > >> > >> Thank you, > >> > >> 2011/1/12 Markus Jelsma <[email protected]>: > >>> Hi, > >>> > >>> This is rather tricky. You can crawl a lot but index a little if you use > >>> different url filters for fetching, updating the db and indexing so that > >>> part is rather easy. > >>> > >>> The question is how to define a home page in the url filters. For this > >>> website its /, for another its /home.html and another redirects to > >>> subdomain.domain.extension and even another will redirect to language > >>> based url. > >>> > >>> Cheers, > >>> > >>>> Hi to all, > >>>> here is my problem. I want to crawl "all" ( to certain depth limit, > >>>> you know ) the pages of certain domains/subdomains to detect them, but > >>>> only store the home pages of the domains.( I don't have the list of > >>>> the domains ) > >>>> ¿There is a easy way to do this? or I have to change the source code > >>>> of some plugin? where can I start to looking? > >>>> > >>>> Thanks in advance, > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 > > > > > > > > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni > <b>Punë > > të Mirë</b> dhe <b>të Mirë për Punë</b>... > > Vizitoni: <a target="_blank" > href="http://www.punaime.al/">www.punaime.al</a></span></p> > > <p><a target="_blank" href="http://www.punaime.al/"><span > style="text-decoration: > > none;"><img width="165" height="31" border="0" alt="punaime" > > src="http://www.ikub.al/images/punaime.al_small.png" /></span></a></p> > > > > > > > > > > > > > <p class="MsoNormal"><span style="color: rgb(31, 73, 125);">Gjeni > <b>Punë të Mirë</b> dhe <b>të Mirë për > Punë</b>... Vizitoni: <a target="_blank" > href="http://www.punaime.al/">www.punaime.al</a></span></p> > <p><a target="_blank" href="http://www.punaime.al/"><span > style="text-decoration: none;"><img width="165" height="31" border="0" > alt="punaime" src="http://www.ikub.al/images/punaime.al_small.png" > /></span></a></p> > >

