Re: need your support
Hi Sahar, Can you post your: 1. crawl-urlfilter 2. nutch-site.xml Also how are you running this program below? I'm CC'ing nutch-user@ so the community can benefit from this thread. Cheers, Chris On 1/20/10 1:42 PM, "sahar elkazaz" wrote: Dear/ sirur I have follow all steps on your article to run nutch and use this java program to access the segments: package nutch; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.nutch.searcher.Hit; import org.apache.nutch.searcher.HitDetails; import org.apache.nutch.searcher.Hits; import org.apache.nutch.searcher.NutchBean; import org.apache.nutch.searcher.Query; import org.apache.nutch.searcher.Summary; import org.apache.nutch.util.NutchConfiguration; public class nutch { /** For debugging. */ public static void main(String[] args) throws Exception { Configuration conf = NutchConfiguration.create(); conf = NutchConfiguration.create(); NutchBean bean = new NutchBean(conf); Query query = Query.parse("animal" + "", conf); Hits hits = bean.search(query, 10); System.out.println("Total hits: " + hits.getTotal()); int length = (int)Math.min(hits.getTotal(), 10); Hit[] show = hits.getHits(0, length); HitDetails[] details = bean.getDetails(show); Summary[] summaries = bean.getSummary(details, query); for ( int i = 0; i (FileSystem.java:1438) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) &nb sp; at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:89) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77) at nutch.nutch.main(nutch.java:25) 10/01/20 22:29:28 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation .login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:50) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:7 7) at nutch.nutch.main(nutch.java:25) 10/01/20 22:29:28 INFO searcher.SearchBean: opening indexes in crawl/indexes 10/01/20 22:29:28 WARN fs.FileSystem: uri=file:/// javax.security.auth.login.LoginException: Login failed: at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275) at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257) at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67) at org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438) ;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120) at org.apache.nutch.searcher.IndexSearcher.(IndexSearcher.java:59) at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:77) at org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:51) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102) at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77) at nutch.nutch.main(nutch.java :25) 10/01/20 22:29:28 INFO plugin.PluginRepository: Plugins: looking in: D:\nutch-1.0\plugins 10/01/20 22:29:28 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 10/01/20 22:29:28 INFO plugin.PluginRepository: Registered Plugins: 10/01/20 22:29:28 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Query Filter (query-basic) 10/01/20 22:29:28 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 10/01/20 22:29:28 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 10/01/20 22:29:28 INFO plugin.Plugi nRepository: Site Query Filter (
Re: Configurin nutch-site.xml
Well that does not work this way really. If you want to use is it make it run on one node (pseudo-distributed mode) and then deploy. If you have it running in pseudo-distributed it won't use the local filesystem this is why I don't understand your remarks in the initial mail. NUTCH logs are in NUTCH_HOME/logs look for the hadoop file it will tell you what is happening more or less. 2010/1/20 Santiago Pérez > > I launch the hdfs because I want to make it work in one computer and when > it > works, launching in several as a distributed version. > > Which logs do you need to check? > > > MilleBii wrote: > > > > Why do you launch hdfs if you don't want use it ? > > > > What are the logs saying, all fetch urls are logged usually ? But > > nothing is displaid > > > > 2010/1/20, Santiago Pérez : > >> > >> Hej, > >> > >> I am configuring Nutch for just crawling webs in several machines > >> (currently > >> I want to test with only one). > >> Building Nutch with ant was successfully > >> > >>bin/hadoop namenode -format > >>bin/start-all.sh > >> > >> They show correct logs > >> > >> bin/hadoop dfs -put urls urls > >> bin/hadoop dfs -ls > >> > >> They show the urls directory correctly > >> > >> But when I launch it the fetcher starts but does not show any message of > >> parsing and it stops in the second depth. The crawl-urlfilter and > >> nutch-default are well configured because they work great using local > >> filesystem (instead of hdfs). I guess it is because nutch-site is empty. > >> > >> What should be its content? > >> > >> core-site.xml: > >> > >> > >> > >> > >> > >> > >> > >> > >> fs.default.name > >> hdfs://localhost:9000/ > >> > >> The name of the default file system. Either the literal string > >> "local" or a host:port for NDFS. > >> > >> > >> > >> > >> > >> > >> --- > >> > >> hdfs-site.xml: > >> > >> > >> > >> > >> > >> > >> > >> > >> dfs.name.dir > >> /root/filesystem/name > >> > >> > >> > >> dfs.data.dir > >> /root/filesystem/data > >> > >> > >> > >> dfs.replication > >> 1 > >> > >> > >> > >> > >> > >> --- > >> > >> > >> mapred-site.xml: > >> > >> > >> > >> > >> > >> > >> > >> > >> mapred.job.tracker > >> hdfs://localhost:9001/ > >> > >> The host and port that the MapReduce job tracker runs at. If > >> "local", then jobs are run in-process as a single map and > >> reduce task. > >> > >> > >> > >> > >> mapred.map.tasks > >> 2 > >> > >> define mapred.map tasks to be number of slave hosts > >> > >> > >> > >> > >> mapred.reduce.tasks > >> 2 > >> > >> define mapred.reduce tasks to be number of slave hosts > >> > >> > >> > >> > >> mapred.system.dir > >> /root/filesystem/mapreduce/system > >> > >> > >> > >> mapred.local.dir > >> /root/filesystem/mapreduce/local > >> > >> > >> > >> -- > >> View this message in context: > >> > http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > > -MilleBii- > > > > > > -- > View this message in context: > http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- -MilleBii-
Redundancy issue in crawling
Hello, I am trying to save memory, cpu, and bandwidth as much as possible in case if I have million of urls to crawl. crawl-urlfilter.txt in the conf/ directory. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*apache.org/ # skip everything else -. Assume my urls/seed file contains million of urls to fetch, crawl, generate, and etc, and I don't want to go through million of lines to review each. Let's say I have these lines in my urls/seed file (file will million urls) http://apache.org http://subdomain1.apache.org http://subdomain2.apache.org http://subdomain3.apache.org http://subdomain4.apache.org http://subdomain5.apache.org Correct me if I am wrong, but if 'http://apache.org' also crawls subdomains with '+^http://([a-z0-9]*\.)*apache.org/' in crawl-urlfilter.txt, then wouldn't it crawl more than once if I have subdomain lines in my urls/seed file? I don't mind if I have one or two, but it would waste a lot of cpu, memory, and bandwidth as my continue to grow with urls. If that was the issue, then have anyone thought of a way to file out all the subdomains in his/her urls/seed file. I'm trying to find out of a way (maybe there is a better method) to search each line for for than 1 dot ".". If there is no other way, then does any know to search which lines have more than 1 "." in each line using unix commands (vi,awk,sed)? Thank you very much
Re: Configurin nutch-site.xml
I launch the hdfs because I want to make it work in one computer and when it works, launching in several as a distributed version. Which logs do you need to check? MilleBii wrote: > > Why do you launch hdfs if you don't want use it ? > > What are the logs saying, all fetch urls are logged usually ? But > nothing is displaid > > 2010/1/20, Santiago Pérez : >> >> Hej, >> >> I am configuring Nutch for just crawling webs in several machines >> (currently >> I want to test with only one). >> Building Nutch with ant was successfully >> >>bin/hadoop namenode -format >>bin/start-all.sh >> >> They show correct logs >> >> bin/hadoop dfs -put urls urls >> bin/hadoop dfs -ls >> >> They show the urls directory correctly >> >> But when I launch it the fetcher starts but does not show any message of >> parsing and it stops in the second depth. The crawl-urlfilter and >> nutch-default are well configured because they work great using local >> filesystem (instead of hdfs). I guess it is because nutch-site is empty. >> >> What should be its content? >> >> core-site.xml: >> >> >> >> >> >> >> >> >> fs.default.name >> hdfs://localhost:9000/ >> >> The name of the default file system. Either the literal string >> "local" or a host:port for NDFS. >> >> >> >> >> >> >> --- >> >> hdfs-site.xml: >> >> >> >> >> >> >> >> >> dfs.name.dir >> /root/filesystem/name >> >> >> >> dfs.data.dir >> /root/filesystem/data >> >> >> >> dfs.replication >> 1 >> >> >> >> >> >> --- >> >> >> mapred-site.xml: >> >> >> >> >> >> >> >> >> mapred.job.tracker >> hdfs://localhost:9001/ >> >> The host and port that the MapReduce job tracker runs at. If >> "local", then jobs are run in-process as a single map and >> reduce task. >> >> >> >> >> mapred.map.tasks >> 2 >> >> define mapred.map tasks to be number of slave hosts >> >> >> >> >> mapred.reduce.tasks >> 2 >> >> define mapred.reduce tasks to be number of slave hosts >> >> >> >> >> mapred.system.dir >> /root/filesystem/mapreduce/system >> >> >> >> mapred.local.dir >> /root/filesystem/mapreduce/local >> >> >> >> -- >> View this message in context: >> http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > -MilleBii- > > -- View this message in context: http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27248860.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Configurin nutch-site.xml
Why do you launch hdfs if you don't want use it ? What are the logs saying, all fetch urls are logged usually ? But nothing is displaid 2010/1/20, Santiago Pérez : > > Hej, > > I am configuring Nutch for just crawling webs in several machines (currently > I want to test with only one). > Building Nutch with ant was successfully > >bin/hadoop namenode -format >bin/start-all.sh > > They show correct logs > > bin/hadoop dfs -put urls urls > bin/hadoop dfs -ls > > They show the urls directory correctly > > But when I launch it the fetcher starts but does not show any message of > parsing and it stops in the second depth. The crawl-urlfilter and > nutch-default are well configured because they work great using local > filesystem (instead of hdfs). I guess it is because nutch-site is empty. > > What should be its content? > > core-site.xml: > > > > > > > > > fs.default.name > hdfs://localhost:9000/ > > The name of the default file system. Either the literal string > "local" or a host:port for NDFS. > > > > > > > --- > > hdfs-site.xml: > > > > > > > > > dfs.name.dir > /root/filesystem/name > > > > dfs.data.dir > /root/filesystem/data > > > > dfs.replication > 1 > > > > > > --- > > > mapred-site.xml: > > > > > > > > > mapred.job.tracker > hdfs://localhost:9001/ > > The host and port that the MapReduce job tracker runs at. If > "local", then jobs are run in-process as a single map and > reduce task. > > > > > mapred.map.tasks > 2 > > define mapred.map tasks to be number of slave hosts > > > > > mapred.reduce.tasks > 2 > > define mapred.reduce tasks to be number of slave hosts > > > > > mapred.system.dir > /root/filesystem/mapreduce/system > > > > mapred.local.dir > /root/filesystem/mapreduce/local > > > > -- > View this message in context: > http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- -MilleBii-
Configurin nutch-site.xml
Hej, I am configuring Nutch for just crawling webs in several machines (currently I want to test with only one). Building Nutch with ant was successfully bin/hadoop namenode -format bin/start-all.sh They show correct logs bin/hadoop dfs -put urls urls bin/hadoop dfs -ls They show the urls directory correctly But when I launch it the fetcher starts but does not show any message of parsing and it stops in the second depth. The crawl-urlfilter and nutch-default are well configured because they work great using local filesystem (instead of hdfs). I guess it is because nutch-site is empty. What should be its content? core-site.xml: fs.default.name hdfs://localhost:9000/ The name of the default file system. Either the literal string "local" or a host:port for NDFS. --- hdfs-site.xml: dfs.name.dir /root/filesystem/name dfs.data.dir /root/filesystem/data dfs.replication 1 --- mapred-site.xml: mapred.job.tracker hdfs://localhost:9001/ The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapred.map.tasks 2 define mapred.map tasks to be number of slave hosts mapred.reduce.tasks 2 define mapred.reduce tasks to be number of slave hosts mapred.system.dir /root/filesystem/mapreduce/system mapred.local.dir /root/filesystem/mapreduce/local -- View this message in context: http://old.nabble.com/Configurin-nutch-site.xml-tp27245750p27245750.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch 1.0 slow crawls
Well politeness can still be the problem... If for instance your are crawling blogs like wordpress or blogspot, they are all different url but with same IP so the fetcher will wait. 2010/1/20, axi : > > Hi to all, > I'm a novice user of Nutch, I have it on a debian machine, and I have probe > the latest release of nutch 1.0 with very slow results in crawling, I have a > 10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1 > Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java > conf etc.. but anything have effect. There are a lot of spin waiting threads > there, only 10-20 of them working and I have injected 1M different hosts, so > politeness is not the problem. I swithched back to 0.9 nutch, and then it > works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine > cpu. > > ¿Are this issues solved on dev version of nutch or why is this happens? > > Thanks in advance, > > -- > View this message in context: > http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- -MilleBii-
Re: Nutch 1.0 slow crawls
Hi, See https://issues.apache.org/jira/browse/NUTCH-721 for an explanation. This has been fixed in the SVN version. Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2010/1/20 axi : > > Hi to all, > I'm a novice user of Nutch, I have it on a debian machine, and I have probe > the latest release of nutch 1.0 with very slow results in crawling, I have a > 10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1 > Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java > conf etc.. but anything have effect. There are a lot of spin waiting threads > there, only 10-20 of them working and I have injected 1M different hosts, so > politeness is not the problem. I swithched back to 0.9 nutch, and then it > works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine > cpu. > > ¿Are this issues solved on dev version of nutch or why is this happens? > > Thanks in advance, > > -- > View this message in context: > http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html > Sent from the Nutch - User mailing list archive at Nabble.com. > >
Nutch 1.0 slow crawls
Hi to all, I'm a novice user of Nutch, I have it on a debian machine, and I have probe the latest release of nutch 1.0 with very slow results in crawling, I have a 10 Megabytes/s connection and it only crawls at 300 Kb/s with peaks of 1 Mb/s. I tweaked everything, dns, linux tcp settings, thread numbers, java conf etc.. but anything have effect. There are a lot of spin waiting threads there, only 10-20 of them working and I have injected 1M different hosts, so politeness is not the problem. I swithched back to 0.9 nutch, and then it works like a charm at good speeds 5-6 Mb/s with the bottleneck on machine cpu. ¿Are this issues solved on dev version of nutch or why is this happens? Thanks in advance, -- View this message in context: http://old.nabble.com/Nutch-1.0-slow-crawls-tp27243302p27243302.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to change url score?
Hi, The SVN version of Nutch has a new functionality for the Injector which allows you to specify the score of a URL (see https://issues.apache.org/jira/browse/NUTCH-655) Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2010/1/20 xiao yang : > I'm crawling a group of web sites for some time. Now I want to add a new > site: http://xxx.com > Here is the process: > > 1. put xxx.com into a file: urls, and put it on Hadoop > 2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000 > > However, the newly added site is not crawled for its score is too low. > > URL: http://xxx.com/ > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Sun Jan 17 14:59:08 CST 2010 > Modified time: Thu Jan 01 08:00:00 CST 1970 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > > How can I change the score manually so this site will be included in the > next crawl round? > > Thanks! > Xiao >
How to change url score?
I'm crawling a group of web sites for some time. Now I want to add a new site: http://xxx.com Here is the process: 1. put xxx.com into a file: urls, and put it on Hadoop 2. run bin/nutch crawl urls -dir crawl -depth 5 -threads 1 -topN 1000 However, the newly added site is not crawled for its score is too low. URL: http://xxx.com/ Version: 7 Status: 1 (db_unfetched) Fetch time: Sun Jan 17 14:59:08 CST 2010 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null How can I change the score manually so this site will be included in the next crawl round? Thanks! Xiao