disk size issue?
access rights?

On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion <
daniel.bourr...@univ-angers.fr> wrote:

> Hi Markus
> Thx for help.
>
> (Hope i'm not boring everybody)
>
> I've erase everything in crawl/
>
> Launching my nutch, got now
>
> -----
> CrawlDb update: 404 purging: false
> CrawlDb update: Merging segment data into db.
>
> Exception in thread "main" java.io.IOException: Job failed!
>    at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
>    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105)
>    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63)
>    at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140)
>
>    at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>    at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>
> -----
>
>
> Into the logs, got
>
> ____
>
>
> 2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
> false
> 2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging
> segment data into db.
> 2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'crawldb', using default
> 2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
> java.io.IOException: Cannot run program "chmod": java.io.IOException:
> error=12, Cannot allocate memory
>    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475)
>    at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149)
>    at org.apache.hadoop.util.Shell.**run(Shell.java:134)
>    at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(**
> Shell.java:286)
>    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354)
>    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337)
>    at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(**
> RawLocalFileSystem.java:481)
>    at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(**
> RawLocalFileSystem.java:473)
>    at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(**
> FilterFileSystem.java:280)
>    at org.apache.hadoop.fs.**ChecksumFileSystem.create(**
> ChecksumFileSystem.java:372)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372)
>    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364)
>    at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(**
> MapTask.java:111)
>    at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:173)
> Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
> allocate memory
>    at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164)
>    at java.lang.ProcessImpl.start(**ProcessImpl.java:81)
>    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468)
>    ... 15 more
> _____
>
>
>
> On 23/02/2012 10:01, Markus Jelsma wrote:
>
>> Unfetched, unparsed or just a bad corrupt segment. Remove that segment
>> and try
>> again.
>>
>>  Many thanks Remi.
>>>
>>> Finally, after un reboot og the computer (I send my question just before
>>> leaving my desk), Nutch started to crawl (amazing :))) )
>>>
>>> But now, during the crawl process, I got that :
>>>
>>> -----
>>>
>>> LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222161934 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120223093525 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153642 LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222154459 Exception in thread "main"
>>> org.apache.hadoop.mapred.**InvalidInputException: Input path does not
>>> exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160234/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160609/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153805/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222155532/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222160132/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222153642/parse_data Input path does not exist:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120222154459/parse_data at
>>> org.apache.hadoop.mapred.**FileInputFormat.listStatus(**
>>> FileInputFormat.java:19
>>> 0) at
>>> org.apache.hadoop.mapred.**SequenceFileInputFormat.**
>>> listStatus(SequenceFileInp
>>> utFormat.java:44) at
>>> org.apache.hadoop.mapred.**FileInputFormat.getSplits(**
>>> FileInputFormat.java:201
>>> ) at
>>> org.apache.hadoop.mapred.**JobClient.writeOldSplits(**
>>> JobClient.java:810)
>>>      at
>>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
>>> JobClient.java:781)
>>>      at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
>>> java:730)
>>>      at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>> java:1249)
>>>      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:175)
>>>      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:149)
>>>      at org.apache.nutch.crawl.Crawl.**run(Crawl.java:143)
>>>      at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>      at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>
>>> -----
>>>
>>> and nothing special in the logs :
>>>
>>> last lines are :
>>>
>>>
>>> 2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
>>> at 2012-02-23 09:46:42, elapsed: 00:00:01
>>> 2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
>>> 2012-02-23 09:46:42
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
>>> 2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
>>> 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments/
>>> 20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb:
>>> adding
>>> segment:
>>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
>>> local/crawl/segments
>>> /20120222154459
>>>
>>> On 22/02/2012 16:36, remi tassing wrote:
>>>
>>>> Hey Daniel,
>>>>
>>>> You can find more output log in logs/Hadoop files
>>>>
>>>> Remi
>>>>
>>>> On Wednesday, February 22, 2012, Daniel Bourrion<
>>>>
>>>> daniel.bourr...@univ-angers.fr**>   wrote:
>>>>
>>>>> Hi.
>>>>> I'm a french librarian (that explains the bad english coming now... :)
>>>>> )
>>>>>
>>>>> Newbie on Nutch, that looks exactly what i'm searching for (an
>>>>> opensource
>>>>>
>>>> solution that should crawl our specific domaine and have it's crawl
>>>> results pushed into Solr).
>>>>
>>>>  I've install a test nutch using
>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>
>>>>>
>>>>>
>>>>> Got an error but I don't really know it nor understand where to try to
>>>>>
>>>> correct what causes that.
>>>>
>>>>  Here's a copy of the error messages - any help welcome.
>>>>> Best
>>>>>
>>>>> ------------------------------**--------------------
>>>>> daniel@daniel-linux:~/Bureau/**apache-nutch-1.4-bin/runtime/**local$
>>>>>
>>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5
>>>>
>>>>  solrUrl is not set, indexing will be skipped...
>>>>> crawl started in: crawl
>>>>> rootUrlDir = urls
>>>>> threads = 10
>>>>> depth = 3
>>>>> solrUrl=null
>>>>> topN = 5
>>>>> Injector: starting at 2012-02-22 16:06:04
>>>>> Injector: crawlDb: crawl/crawldb
>>>>> Injector: urlDir: urls
>>>>> Injector: Converting injected urls to crawl db entries.
>>>>> Injector: Merging injected urls into crawl db.
>>>>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
>>>>> Generator: starting at 2012-02-22 16:06:06
>>>>> Generator: Selecting best-scoring urls due for fetch.
>>>>> Generator: filtering: true
>>>>> Generator: normalizing: true
>>>>> Generator: topN: 5
>>>>> Generator: jobtracker is 'local', generating exactly one partition.
>>>>> Generator: Partitioning selected urls for politeness.
>>>>> Generator: segment: crawl/segments/20120222160609
>>>>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
>>>>> Fetcher: Your 'http.agent.name' value should be listed first in
>>>>>
>>>> 'http.robots.agents' property.
>>>>
>>>>  Fetcher: starting at 2012-02-22 16:06:10
>>>>> Fetcher: segment: crawl/segments/20120222160609
>>>>> Using queue mode : byHost
>>>>> Fetcher: threads: 10
>>>>> Fetcher: time-out divisor: 2
>>>>> QueueFeeder finished: total 2 records + hit by time limit :0
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> fetching http://bu.univ-angers.fr/
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> fetching http://www.face-ecran.fr/
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Using queue mode : byHost
>>>>> Using queue mode : byHost
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> -finishing thread FetcherThread, activeThreads=2
>>>>> Fetcher: throughput threshold: -1
>>>>> Fetcher: throughput threshold retries: 5
>>>>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
>>>>> -finishing thread FetcherThread, activeThreads=1
>>>>> -finishing thread FetcherThread, activeThreads=0
>>>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
>>>>> -activeThreads=0
>>>>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
>>>>> ParseSegment: starting at 2012-02-22 16:06:13
>>>>> ParseSegment: segment: crawl/segments/20120222160609
>>>>> Parsing: http://bu.univ-angers.fr/
>>>>> Parsing: http://www.face-ecran.fr/
>>>>> Exception in thread "main" java.io.IOException: Job failed!
>>>>>
>>>>>     at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
>>>>> java:1252)
>>>>>     at org.apache.nutch.parse.**ParseSegment.parse(**
>>>>> ParseSegment.java:157)
>>>>>     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:138)
>>>>>     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
>>>>>     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)
>>>>>
>>>>> ------------------------------**--------------------
>>>>>
>>>>> --
>>>>> Avec mes salutations les plus cordiales.
>>>>> __
>>>>>
>>>>> Daniel Bourrion, conservateur des bibliothèques
>>>>> Responsable de la bibliothèque numérique
>>>>> Ligne directe : 02.44.68.80.50
>>>>> SCD Université d'Angers - http://bu.univ-angers.fr
>>>>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>>>>>
>>>>> *************************************
>>>>> " Et par le pouvoir d'un mot
>>>>> Je recommence ma vie "
>>>>>
>>>>>                        Paul Eluard
>>>>>
>>>>> *************************************
>>>>> blog perso : http://archives.face-ecran.fr/
>>>>>
>>>>
> --
> Avec mes salutations les plus cordiales.
> __
>
> Daniel Bourrion, conservateur des bibliothèques
> Responsable de la bibliothèque numérique
> Ligne directe : 02.44.68.80.50
> SCD Université d'Angers - http://bu.univ-angers.fr
> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
>
> *************************************
> " Et par le pouvoir d'un mot
> Je recommence ma vie "
>                       Paul Eluard
> *************************************
> blog perso : http://archives.face-ecran.fr/
>
>

Reply via email to