File not found error

2014-06-24 Thread John Lafitte
Using Nutch 1.7

Out of the blue all of my crawl jobs started failing a few days ago.  I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues.  There is plenty of disk space.  Here
is the error I'm getting, any help is appreciated:

Injector: starting at 2014-06-24 07:26:54
Injector: crawlDb: di/crawl/crawldb
Injector: urlDir: di/urls
Injector: Converting injected urls to crawl db entries.
Injector: ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
 at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
 at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
 at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
 at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.nutch.crawl.Injector.main(Injector.java:308)


reg crawled pages with status=2

2014-06-24 Thread Deepa Jayaveer
Hi,
  our requirement is that the Nutch should not recrawl crawl the pages 
that was being already crawled. 
ie., the crawling should not happen for the web pages with the status as 
'2' in the webpage table. It should not recrawl and should
not put the outlinks as well.

can you please let me know whether it is possible by changing some 
configuration parameters in nutch site xml?

Thanks and Regards
Deepa
=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you




Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread Lewis John Mcgibbney
Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote:


 So far, this looks like a bug in updatedb when filtering with batchId.

 I could only found one solution, to check if new pages are in the datastore
 and if they are skip them.
 Otherwise updatedb with option -all will also work.


https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.


Re: updatedb deletes all metadata except _csh_

2014-06-24 Thread alxsss
Hi,


I already came up with similar changes to the code as in this patch. Only 
suggestion to this patch's code is that to move checking if url exists in the 
datastore under


if (!additionsAllowed) {
 return;
   }


and close datastore.


Thanks.
Alex.
-Original Message-
From: Lewis John Mcgibbney lewis.mcgibb...@gmail.com
To: user user@nutch.apache.org
Sent: Tue, Jun 24, 2014 9:07 am
Subject: Re: updatedb deletes all metadata except _csh_


Hi Alex,

I am really sorry for not making the connection here.

On Tue, Jun 24, 2014 at 12:31 AM, user-digest-h...@nutch.apache.org wrote:


 So far, this looks like a bug in updatedb when filtering with batchId.

 I could only found one solution, to check if new pages are in the datastore
 and if they are skip them.
 Otherwise updatedb with option -all will also work.


https://issues.apache.org/jira/browse/NUTCH-1679

If you can run with this patch, then please post your results here.

 


Re: File not found error

2014-06-24 Thread kaveh minooie

you might want to check to see if

 Injector: urlDir: di/urls

still exist in your hdfs.



On 06/24/2014 12:30 AM, John Lafitte wrote:

Using Nutch 1.7

Out of the blue all of my crawl jobs started failing a few days ago.  I
checked the user logs and nobody logged into the server and there were no
reboots or any other obvious issues.  There is plenty of disk space.  Here
is the error I'm getting, any help is appreciated:

Injector: starting at 2014-06-24 07:26:54
Injector: crawlDb: di/crawl/crawldb
Injector: urlDir: di/urls
Injector: Converting injected urls to crawl db entries.
Injector: ENOENT: No such file or directory
at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
  at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:514)
  at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:349)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:193)
  at
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
  at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:416)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
  at org.apache.nutch.crawl.Injector.run(Injector.java:318)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.crawl.Injector.main(Injector.java:308)



--
Kaveh Minooie


Re: File not found error

2014-06-24 Thread John Lafitte
Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
latest thing is trying to determine if there is a filesystem issue.  It's
not really clear what file is not found.  I have about 10 different
configs, this is just one of them and they all have the urls folder.  The
script worked for quite a while before this just started happening on it's
own.  That's why I'm suspecting a filesystem error.


On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote:

 you might want to check to see if

  Injector: urlDir: di/urls

 still exist in your hdfs.




 On 06/24/2014 12:30 AM, John Lafitte wrote:

 Using Nutch 1.7

 Out of the blue all of my crawl jobs started failing a few days ago.  I
 checked the user logs and nobody logged into the server and there were no
 reboots or any other obvious issues.  There is plenty of disk space.  Here
 is the error I'm getting, any help is appreciated:

 Injector: starting at 2014-06-24 07:26:54
 Injector: crawlDb: di/crawl/crawldb
 Injector: urlDir: di/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: ENOENT: No such file or directory
 at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
 at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 RawLocalFileSystem.java:514)
   at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 RawLocalFileSystem.java:349)
 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 FilterFileSystem.java:193)
   at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 JobSubmissionFiles.java:126)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(
 UserGroupInformation.java:1190)
   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Injector.main(Injector.java:308)


 --
 Kaveh Minooie



Re: File not found error

2014-06-24 Thread John Lafitte
Okay, I got it working again.  Not sure exactly what happened, but fsck
didn't help.  I noticed the last line showed native method so moved the
native binaries out of the /lib folder.  Lo and behold, the next time I ran
it, it used the java libs and displayed the filename it was having a
problem with.  It
was /tmp/hadoop-root/mapred/staging/root850517656/.staging so given that I
just went and moved the /tmp/hadoop-root directory and then it started
working again.  Permissions looked fine, so it might have just been corrupt.

Thanks for the help!


On Tue, Jun 24, 2014 at 9:03 PM, John Lafitte jlafi...@brandextract.com
wrote:

 Well I'm just using nutch in local mode, no hdfs (as far as I know)...  My
 latest thing is trying to determine if there is a filesystem issue.  It's
 not really clear what file is not found.  I have about 10 different
 configs, this is just one of them and they all have the urls folder.  The
 script worked for quite a while before this just started happening on it's
 own.  That's why I'm suspecting a filesystem error.


 On Tue, Jun 24, 2014 at 6:53 PM, kaveh minooie ka...@plutoz.com wrote:

 you might want to check to see if

  Injector: urlDir: di/urls

 still exist in your hdfs.




 On 06/24/2014 12:30 AM, John Lafitte wrote:

 Using Nutch 1.7

 Out of the blue all of my crawl jobs started failing a few days ago.  I
 checked the user logs and nobody logged into the server and there were no
 reboots or any other obvious issues.  There is plenty of disk space.
  Here
 is the error I'm getting, any help is appreciated:

 Injector: starting at 2014-06-24 07:26:54
 Injector: crawlDb: di/crawl/crawldb
 Injector: urlDir: di/urls
 Injector: Converting injected urls to crawl db entries.
 Injector: ENOENT: No such file or directory
 at org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method)
 at org.apache.hadoop.fs.FileUtil.execSetPermission(FileUtil.java:701)
   at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:656)
 at
 org.apache.hadoop.fs.RawLocalFileSystem.setPermission(
 RawLocalFileSystem.java:514)
   at
 org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(
 RawLocalFileSystem.java:349)
 at org.apache.hadoop.fs.FilterFileSystem.mkdirs(
 FilterFileSystem.java:193)
   at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(
 JobSubmissionFiles.java:126)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:942)
   at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
 at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(
 UserGroupInformation.java:1190)
   at org.apache.hadoop.mapred.JobClient.submitJobInternal(
 JobClient.java:936)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
   at org.apache.nutch.crawl.Injector.run(Injector.java:318)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.nutch.crawl.Injector.main(Injector.java:308)


 --
 Kaveh Minooie





Re: Please share your experience of using Nutch in production

2014-06-24 Thread Gora Mohanty
On 23 June 2014 01:44, Meraj A. Khan mera...@gmail.com wrote:
 Gora,

 Thanks for sharing your admin perspective , rest assured  I am not trying
 to circumvent any politeness requirements in any way , as I mentioned
 earlier , I am with in the crawl-delay limits that are being set by the web
 masters if any , however , you have confirmed my hunch that I might have to
 reach out to individual webmasters to try and convince them to not block my
 IP address .
[...]

If you are taking the reasonable precautions that you mentioned
earlier, there is
no reason that you should be getting banned by webmasters. Unless a crawler
is actually causing issues for the site performance, it might not even come to
the attention of the webmaster at all.

 By being at a disadvantage , I meant at a disadvantage compared to major
 players like Google, Bing and Yahoo bots , whom the webmasters probably
 would not block access, and by Nutch variant , I meant an instance of a
 customized crawler based on Nutch.

People are unlikely to ban Google et al, as there are clear benefits
to having them
search one's site. If you would like special privileges, such as being
able to hit
the site hard, you will have to convince the webmaster that it your crawler also
brings some such benefit to them.

Regards,
Gora