File not found error
Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308)
reg crawled pages with status=2
Hi, our requirement is that the Nutch should not recrawl crawl the pages that was being already crawled. ie., the crawling should not happen for the web pages with the status as '2' in the webpage table. It should not recrawl and should not put the outlinks as well. can you please let me know whether it is possible by changing some configuration parameters in nutch site xml? Thanks and Regards Deepa =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Re: updatedb deletes all metadata except _csh_
Hi Alex, I am really sorry for not making the connection here. On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote: So far, this looks like a bug in updatedb when filtering with batchId. I could only found one solution, to check if new pages are in the datastore and if they are skip them. Otherwise updatedb with option -all will also work. https://issues.apache.org/jira/browse/NUTCH-1679 If you can run with this patch, then please post your results here.
Re: updatedb deletes all metadata except _csh_
Hi, I already came up with similar changes to the code as in this patch. Only suggestion to this patch's code is that to move checking if url exists in the datastore under if (!additionsAllowed) { return; } and close datastore. Thanks. Alex. -Original Message- From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com To: user user@nutch.apache.org Sent: Tue, Jun 24, 2014 9:07 am Subject: Re: updatedb deletes all metadata except _csh_ Hi Alex, I am really sorry for not making the connection here. On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote: So far, this looks like a bug in updatedb when filtering with batchId. I could only found one solution, to check if new pages are in the datastore and if they are skip them. Otherwise updatedb with option -all will also work. https://issues.apache.org/jira/browse/NUTCH-1679 If you can run with this patch, then please post your results here.
Re: File not found error
you might want to check to see if Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: File not found error
Well I'm just using nutch in local mode, no hdfs (as far as I know)... My latest thing is trying to determine if there is a filesystem issue. It's not really clear what file is not found. I have about 10 different configs, this is just one of them and they all have the urls folder. The script worked for quite a while before this just started happening on it's own. That's why I'm suspecting a filesystem error. On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote: you might want to check to see if Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal( JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: File not found error
Okay, I got it working again. Not sure exactly what happened, but fsck didn't help. I noticed the last line showed native method so moved the native binaries out of the /lib folder. Lo and behold, the next time I ran it, it used the java libs and displayed the filename it was having a problem with. It was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I just went and moved the /tmp/hadoop-root directory and then it started working again. Permissions looked fine, so it might have just been corrupt. Thanks for the help! On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte jlafi...@brandextract.com wrote: Well I'm just using nutch in local mode, no hdfs (as far as I know)... My latest thing is trying to determine if there is a filesystem issue. It's not really clear what file is not found. I have about 10 different configs, this is just one of them and they all have the urls folder. The script worked for quite a while before this just started happening on it's own. That's why I'm suspecting a filesystem error. On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote: you might want to check to see if Injector: urlDir: di/urls still exist in your hdfs. On 06/24/2014 12:30 AM, John Lafitte wrote: Using Nutch 1.7 Out of the blue all of my crawl jobs started failing a few days ago. I checked the user logs and nobody logged into the server and there were no reboots or any other obvious issues. There is plenty of disk space. Here is the error I'm getting, any help is appreciated: Injector: starting at 2014-06-24 07:26:54 Injector: crawlDb: di/crawl/crawldb Injector: urlDir: di/urls Injector: Converting injected urls to crawl db entries. Injector: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission( RawLocalFileSystem.java:514) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs( RawLocalFileSystem.java:349) at org.apache.hadoop.fs.FilterFileSystem.mkdirs( FilterFileSystem.java:193) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir( JobSubmissionFiles.java:126) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs( UserGroupInformation.java:1190) at org.apache.hadoop.mapred.JobClient.submitJobInternal( JobClient.java:936) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Injector.run(Injector.java:318) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Injector.main(Injector.java:308) -- Kaveh Minooie
Re: Please share your experience of using Nutch in production
On 23 June 2014 01:44, Meraj A. Khan mera...@gmail.com wrote: Gora, Thanks for sharing your admin perspective , rest assured I am not trying to circumvent any politeness requirements in any way , as I mentioned earlier , I am with in the crawl-delay limits that are being set by the web masters if any , however , you have confirmed my hunch that I might have to reach out to individual webmasters to try and convince them to not block my IP address . [...] If you are taking the reasonable precautions that you mentioned earlier, there is no reason that you should be getting banned by webmasters. Unless a crawler is actually causing issues for the site performance, it might not even come to the attention of the webmaster at all. By being at a disadvantage , I meant at a disadvantage compared to major players like Google, Bing and Yahoo bots , whom the webmasters probably would not block access, and by Nutch variant , I meant an instance of a customized crawler based on Nutch. People are unlikely to ban Google et al, as there are clear benefits to having them search one's site. If you would like special privileges, such as being able to hit the site hard, you will have to convince the webmaster that it your crawler also brings some such benefit to them. Regards, Gora