RE: Depth option
I would recommend to use the domain-urlfilter, it is the most straightforward method of controlling the list of hosts in the crawldb. M -Original message- From:Shadi Saleh propat...@gmail.com Sent: Sunday 4th January 2015 16:23 To: user user@nutch.apache.org Subject: Depth option Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Depth option
Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Re: Depth option
Yes, you are correct. no need to use the url filter. But this will work only if your crawldb remains empty. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com wrote: Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Re: Depth option
Thanks Adil, crawldb is not empty, now it contains old and current folder, should I clean it before I start new crawl? what is the proper way? Best On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: Yes, you are correct. no need to use the url filter. But this will work only if your crawldb remains empty. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com wrote: Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578* -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Re: Depth option
Shadi, I am not sure what will be the case if example.com itself has external links,I think it will fetch those with depth 1,but if you want to disbale the fetching of external links , just set the external.links property to false,you dont need any url filter set up if you do so. On Jan 4, 2015 10:37 AM, Shadi Saleh propat...@gmail.com wrote: Thanks Adil, crawldb is not empty, now it contains old and current folder, should I clean it before I start new crawl? what is the proper way? Best On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi aiabb...@gmail.com wrote: Yes, you are correct. no need to use the url filter. But this will work only if your crawldb remains empty. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com wrote: Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578* -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578*
Depth option
I believe you need to clean it. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:35 PM, Shadi Saleh propat...@gmail.com javascript:_e(%7B%7D,'cvml','propat...@gmail.com'); wrote: Thanks Adil, crawldb is not empty, now it contains old and current folder, should I clean it before I start new crawl? what is the proper way? Best On Sun, Jan 4, 2015 at 4:28 PM, Adil Ishaque Abbasi aiabb...@gmail.com javascript:_e(%7B%7D,'cvml','aiabb...@gmail.com'); wrote: Yes, you are correct. no need to use the url filter. But this will work only if your crawldb remains empty. Regards Adil I. Abbasi On Sun, Jan 4, 2015 at 8:22 PM, Shadi Saleh propat...@gmail.com javascript:_e(%7B%7D,'cvml','propat...@gmail.com'); wrote: Hello, I want to check this point please. I am using crawl to crawl www.example.com with depth =1 option, So if that website contains url to other website e.g. www.example2.com nutch will not crawl it , is it enogh to use depth option or should I use url filer? Best -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578* -- *Shadi SalehPh.D StudentInstitute of Formal and Applied LinguisticsFaculty of Mathematics and Physics* *-Charles University in Prague* *16017 Prague 6 - Czech Republic Mob +420773515578* -- Regards Adil I. Abbasi