Re: JobTracker gets stuck with DFS problems
"Again, this procedure does NOT work when using HDFS - you won't even see the partial output (without some serious hacking)" Got it ! "You can simply set the fetcher.parsing config option to false." Found it ! Thanks for the help 2010/5/3 Andrzej Bialecki > On 2010-05-03 22:58, Emmanuel de Castro Santana wrote: > > "The first method of recovering that is mentioned there works only > > (sometimes) for crawls performed using LocalJobTracker and local file > > system. It does not work in any other case." > > > > if I stop the crawling process, take the crawled content from the dfs > into > > my local disk, do the fix and then put it back into hdfs, would it work ? > Or > > would there be a problem about dfs replication of the new files ? > > Again, this procedure does NOT work when using HDFS - you won't even see > the partial output (without some serious hacking). > > > > > "Yes, but if the parsing fails you still have the downloaded content, > > which you can re-parse again after you fixed the config or the code..." > > > > Interesting ... I did not see any option like a -noParsing in the > "bin/nutch > > crawl" command, that means I will have to code my own .sh for crawling, > one > > that uses the -noparsing option of the fetcher right ? > > You can simply set the fetcher.parsing config option to false. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana
Re: JobTracker gets stuck with DFS problems
On 2010-05-03 22:58, Emmanuel de Castro Santana wrote: > "The first method of recovering that is mentioned there works only > (sometimes) for crawls performed using LocalJobTracker and local file > system. It does not work in any other case." > > if I stop the crawling process, take the crawled content from the dfs into > my local disk, do the fix and then put it back into hdfs, would it work ? Or > would there be a problem about dfs replication of the new files ? Again, this procedure does NOT work when using HDFS - you won't even see the partial output (without some serious hacking). > > "Yes, but if the parsing fails you still have the downloaded content, > which you can re-parse again after you fixed the config or the code..." > > Interesting ... I did not see any option like a -noParsing in the "bin/nutch > crawl" command, that means I will have to code my own .sh for crawling, one > that uses the -noparsing option of the fetcher right ? You can simply set the fetcher.parsing config option to false. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: JobTracker gets stuck with DFS problems
"The first method of recovering that is mentioned there works only (sometimes) for crawls performed using LocalJobTracker and local file system. It does not work in any other case." if I stop the crawling process, take the crawled content from the dfs into my local disk, do the fix and then put it back into hdfs, would it work ? Or would there be a problem about dfs replication of the new files ? "Yes, but if the parsing fails you still have the downloaded content, which you can re-parse again after you fixed the config or the code..." Interesting ... I did not see any option like a -noParsing in the "bin/nutch crawl" command, that means I will have to code my own .sh for crawling, one that uses the -noparsing option of the fetcher right ? 2010/5/3 Andrzej Bialecki > On 2010-05-03 19:59, Emmanuel de Castro Santana wrote: > > "Unfortunately, no. You should at least crawl without parsing, so that > > when you download the content you can run the parsing separately, and > > repeat it if it fails." > > > > I've just found this in the FAQ, can it be done ? > > > > > http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F > > The first method of recovering that is mentioned there works only > (sometimes) for crawls performed using LocalJobTracker and local file > system. It does not work in any other case. > > > > > By the way, about not parsing, isn't necessary to parse the content > anyway > > in order to generate links for the next segment ? If this is true, one > would > > have to run parse separatedly, which would result the same. > > Yes, but if the parsing fails you still have the downloaded content, > which you can re-parse again after you fixed the config or the code... > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana
Re: JobTracker gets stuck with DFS problems
On 2010-05-03 19:59, Emmanuel de Castro Santana wrote: > "Unfortunately, no. You should at least crawl without parsing, so that > when you download the content you can run the parsing separately, and > repeat it if it fails." > > I've just found this in the FAQ, can it be done ? > > http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F The first method of recovering that is mentioned there works only (sometimes) for crawls performed using LocalJobTracker and local file system. It does not work in any other case. > > By the way, about not parsing, isn't necessary to parse the content anyway > in order to generate links for the next segment ? If this is true, one would > have to run parse separatedly, which would result the same. Yes, but if the parsing fails you still have the downloaded content, which you can re-parse again after you fixed the config or the code... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: JobTracker gets stuck with DFS problems
Thank you for answering, these tips are gonna be valuable "too small number of file handles on your machines (run ulimit -n, this should be set to 16k or more, the default is 1k)." Set it to 16k. The process has been runing for a day now and it did not get stuck yet. I still often see errors related to the dfs though. Anyways, are there other properties I should pay attention to ? "do you use a SAN or other type of NAS as your storage?" No, this is not the case "network equipment, such as router or switch, or network card: quite often low-end equipment cannot handle high volume of traffic, even though it's equipped with all the right ports ;)" This should be one of the reasons "It's much much better to crawl all sites at the same time. This allows you to benefit from parallel crawling - otherwise your fetcher will be always stuck in the politeness crawl delay." Good, that means we are on the right track "Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails." I've just found this in the FAQ, can it be done ? http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F By the way, about not parsing, isn't necessary to parse the content anyway in order to generate links for the next segment ? If this is true, one would have to run parse separatedly, which would result the same. 2010/4/30 Andrzej Bialecki > On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: > > Hi All > > > > We are using Nutch to crawl ~500K pages with a 3 node cluster, each node > > features a dual core processor running with 4Gb RAM and circa 100Gb > storage. > > All nodes run on CentOS. > > > > These 500K pages are scattered into several sites, each one of them > having > > from 5k up to 200k pages. For each site we start a different crawl > process > > (using bin/nutch crawl), but they are all almost simultaneously started. > > > > We are trying to tune Hadoop's configurations in order to have a reliable > > daily crawling process. After a while of crawling we see some problems > > occurring, mainly on the TaskTracker nodes, most of them are related to > > access to the HDFS. We often see "Bad response 1 for block" and > "Filesystem > > closed", among others. When these errors start to get more frequent, the > > JobTracker gets stuck and we have to run stop-all. If we adjust the > maximum > > of map and reduce tasks to lower values, the process takes longer to get > > stuck, but we haven't found the adequate configuration yet. > > > > Given that setup, there are some question we have been struggling to find > an > > answer > > > > 1. What could be the most probable reason for the hdfs problems ? > > I suspect the following issues in this order: > > * too small number of file handles on your machines (run ulimit -n, this > should be set to 16k or more, the default is 1k). > > * do you use a SAN or other type of NAS as your storage? > > * network equipment, such as router or switch, or network card: quite > often low-end equipment cannot handle high volume of traffic, even > though it's equipped with all the right ports ;) > > > > > 2. Is it better to start a unique crawl with all sites inside or to just > > keep it the way we are doing (i.e start a different crawl process for > each > > site) ? > > It's much much better to crawl all sites at the same time. This allows > you to benefit from parallel crawling - otherwise your fetcher will be > always stuck in the politeness crawl delay. > > > > > 3. When it all goes down, is there a way to restart crawling from where > the > > process stopped ? > > Unfortunately, no. You should at least crawl without parsing, so that > when you download the content you can run the parsing separately, and > repeat it if it fails. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana
Re: JobTracker gets stuck with DFS problems
On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: > Hi All > > We are using Nutch to crawl ~500K pages with a 3 node cluster, each node > features a dual core processor running with 4Gb RAM and circa 100Gb storage. > All nodes run on CentOS. > > These 500K pages are scattered into several sites, each one of them having > from 5k up to 200k pages. For each site we start a different crawl process > (using bin/nutch crawl), but they are all almost simultaneously started. > > We are trying to tune Hadoop's configurations in order to have a reliable > daily crawling process. After a while of crawling we see some problems > occurring, mainly on the TaskTracker nodes, most of them are related to > access to the HDFS. We often see "Bad response 1 for block" and "Filesystem > closed", among others. When these errors start to get more frequent, the > JobTracker gets stuck and we have to run stop-all. If we adjust the maximum > of map and reduce tasks to lower values, the process takes longer to get > stuck, but we haven't found the adequate configuration yet. > > Given that setup, there are some question we have been struggling to find an > answer > > 1. What could be the most probable reason for the hdfs problems ? I suspect the following issues in this order: * too small number of file handles on your machines (run ulimit -n, this should be set to 16k or more, the default is 1k). * do you use a SAN or other type of NAS as your storage? * network equipment, such as router or switch, or network card: quite often low-end equipment cannot handle high volume of traffic, even though it's equipped with all the right ports ;) > > 2. Is it better to start a unique crawl with all sites inside or to just > keep it the way we are doing (i.e start a different crawl process for each > site) ? It's much much better to crawl all sites at the same time. This allows you to benefit from parallel crawling - otherwise your fetcher will be always stuck in the politeness crawl delay. > > 3. When it all goes down, is there a way to restart crawling from where the > process stopped ? Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com