Re: Have yet to complete a very large filesystem crawl

Eddie Drapkin Tue, 10 Aug 2010 16:05:03 -0700

On 8/10/2010 12:55 PM, webdev1977 wrote:

Wow.. this is very frustrating!  I just downloaded and configured the 1.2
tagged version from SVN and I STILL can not complete a file system crawl
using the nutch crawl command.


Has anyone been able to complete a crawl using the nutch crawl command and
using the file: protocol? I have a very very large shared drive that I am
crawling (300,000 + files).
I have very little memory to use, about 2GB total.  I am running this as a
prototype on my Win XP box.

Any ideas based on the stack trace what might be causing this?


--------hadoop.log Snippet
------------------------------------------------------------
2010-08-10 13:16:03,438 WARN  mapred.LocalJobRunner - job_local_0025
java.lang.OutOfMemoryError
         at java.io.FileInputStream.readBytes(Native Method)
         at java.io.FileInputStream.read(Unknown Source)
         at
org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:83)
         at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:136)
         at java.io.BufferedInputStream.read1(Unknown Source)
         at java.io.BufferedInputStream.read(Unknown Source)
         at java.io.DataInputStream.read(Unknown Source)
         at
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:149)
         at
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
         at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
         at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
         at
org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
         at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
         at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
         at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
         at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
         at
org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:973)
         at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:932)
         at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241)
         at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237)
         at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:42)
         at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
         at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2010-08-10 13:16:03,672 INFO  mapred.JobClient - Job complete:
job_local_0025
2010-08-10 13:16:03,672 INFO  mapred.JobClient - Counters: 17
2010-08-10 13:16:03,672 INFO  mapred.JobClient -   ParserStatus
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     failed=59
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     success=905
2010-08-10 13:16:03,672 INFO  mapred.JobClient -   FileSystemCounters
2010-08-10 13:16:03,672 INFO  mapred.JobClient -
FILE_BYTES_READ=19515258622
2010-08-10 13:16:03,672 INFO  mapred.JobClient -
FILE_BYTES_WRITTEN=25431386296
2010-08-10 13:16:03,672 INFO  mapred.JobClient -   FetcherStatus
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     exception=34
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     success=964
2010-08-10 13:16:03,672 INFO  mapred.JobClient -   Map-Reduce Framework
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Reduce input groups=260
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Combine output
records=0
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Map input records=1000
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Reduce shuffle bytes=0
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Reduce output
records=741
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Spilled Records=5856
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Map output
bytes=309514931
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Map input bytes=145708
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Combine input records=0
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Map output records=2928
2010-08-10 13:16:03,672 INFO  mapred.JobClient -     Reduce input
records=742

You ran out of memory; give Java more heap space. What is it now? Trygiving it as much more as you can.

Re: Have yet to complete a very large filesystem crawl

Reply via email to