I am trying to reproduce the Common Crawl infrastructure for crawling just
a *few sites* on a *weekly/nightly* basis using AWS, EMR and S3. I am using
the Common Crawl fork for Nutch at https://github.com/Aloisius/nutch (cc
branch).

I use the Crawl job and I pass all the paths for s3 buckets. The inject and
fetch steps work perfect, but it fails on the ParseSegment step (see
following stack trace). I have tried with s3, s3n and s3a schemas.

org.apache.nutch.crawl.Crawl s3a://some-bucket/urls -dir
s3a://some-bucket/crawl -depth 2 -topN 5
  |
  V
org.apache.nutch.parse.ParseSegment
s3a://some-bucket/crawl/segments/20151022105922

Exception in thread "main" java.lang.IllegalArgumentException: Wrong
FS: s3a://some-bucket/crawl/segments/20151022105922/crawl_parse,
expected: hdfs://ip-152-71-19-40.eu-west-1.compute.internal:8020
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:193)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:105)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1118)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1404)
        at 
org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputFormat.java:88)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:564)
        at 
org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
        at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:224)
        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:258)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:231)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)


Is working with only with S3 buckets fully supported by all the steps of
the crawler ? Any clue on what is the problem ?

I guess I could use hdfs for segments but I would like to avoid that, as I
want to terminate the cluster as soon as the crawling finishes, and keep
the data in S3 for next crawlings.

Thank you so much,
Christian Perez-Llamas

Reply via email to