disk size issue? access rights? On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion < daniel.bourr...@univ-angers.fr> wrote:
> Hi Markus > Thx for help. > > (Hope i'm not boring everybody) > > I've erase everything in crawl/ > > Launching my nutch, got now > > ----- > CrawlDb update: 404 purging: false > CrawlDb update: Merging segment data into db. > > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252) > at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105) > at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63) > at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140) > > at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) > at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55) > > ----- > > > Into the logs, got > > ____ > > > 2012-02-23 11:25:48,803 INFO crawl.CrawlDb - CrawlDb update: 404 purging: > false > 2012-02-23 11:25:48,804 INFO crawl.CrawlDb - CrawlDb update: Merging > segment data into db. > 2012-02-23 11:25:49,353 INFO regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > 2012-02-23 11:25:49,560 INFO regex.RegexURLNormalizer - can't find rules > for scope 'crawldb', using default > 2012-02-23 11:25:49,985 WARN mapred.LocalJobRunner - job_local_0007 > java.io.IOException: Cannot run program "chmod": java.io.IOException: > error=12, Cannot allocate memory > at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475) > at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149) > at org.apache.hadoop.util.Shell.**run(Shell.java:134) > at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(** > Shell.java:286) > at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354) > at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337) > at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(** > RawLocalFileSystem.java:481) > at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(** > RawLocalFileSystem.java:473) > at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(** > FilterFileSystem.java:280) > at org.apache.hadoop.fs.**ChecksumFileSystem.create(** > ChecksumFileSystem.java:372) > at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484) > at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465) > at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372) > at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364) > at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(** > MapTask.java:111) > at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(** > LocalJobRunner.java:173) > Caused by: java.io.IOException: java.io.IOException: error=12, Cannot > allocate memory > at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164) > at java.lang.ProcessImpl.start(**ProcessImpl.java:81) > at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468) > ... 15 more > _____ > > > > On 23/02/2012 10:01, Markus Jelsma wrote: > >> Unfetched, unparsed or just a bad corrupt segment. Remove that segment >> and try >> again. >> >> Many thanks Remi. >>> >>> Finally, after un reboot og the computer (I send my question just before >>> leaving my desk), Nutch started to crawl (amazing :))) ) >>> >>> But now, during the crawl process, I got that : >>> >>> ----- >>> >>> LinkDb: adding segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222161934 LinkDb: adding segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120223093525 LinkDb: adding segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222153642 LinkDb: adding segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222154459 Exception in thread "main" >>> org.apache.hadoop.mapred.**InvalidInputException: Input path does not >>> exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222160234/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222160609/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222153805/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222155532/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222160132/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222153642/parse_data Input path does not exist: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120222154459/parse_data at >>> org.apache.hadoop.mapred.**FileInputFormat.listStatus(** >>> FileInputFormat.java:19 >>> 0) at >>> org.apache.hadoop.mapred.**SequenceFileInputFormat.** >>> listStatus(SequenceFileInp >>> utFormat.java:44) at >>> org.apache.hadoop.mapred.**FileInputFormat.getSplits(** >>> FileInputFormat.java:201 >>> ) at >>> org.apache.hadoop.mapred.**JobClient.writeOldSplits(** >>> JobClient.java:810) >>> at >>> org.apache.hadoop.mapred.**JobClient.submitJobInternal(** >>> JobClient.java:781) >>> at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.** >>> java:730) >>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.** >>> java:1249) >>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:175) >>> at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:149) >>> at org.apache.nutch.crawl.Crawl.**run(Crawl.java:143) >>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) >>> at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55) >>> >>> ----- >>> >>> and nothing special in the logs : >>> >>> last lines are : >>> >>> >>> 2012-02-23 09:46:42,524 INFO crawl.CrawlDb - CrawlDb update: finished >>> at 2012-02-23 09:46:42, elapsed: 00:00:01 >>> 2012-02-23 09:46:42,590 INFO crawl.LinkDb - LinkDb: starting at >>> 2012-02-23 09:46:42 >>> 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb >>> 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL normalize: true >>> 2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL filter: true >>> 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments/ >>> 20120223093220 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222160234 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223093302 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222160609 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222153805 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222155532 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223094427 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223093618 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223094552 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223094500 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222160132 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223093649 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223093210 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222161934 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120223093525 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222153642 2012-02-23 09:46:42,597 INFO crawl.LinkDb - LinkDb: >>> adding >>> segment: >>> file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/** >>> local/crawl/segments >>> /20120222154459 >>> >>> On 22/02/2012 16:36, remi tassing wrote: >>> >>>> Hey Daniel, >>>> >>>> You can find more output log in logs/Hadoop files >>>> >>>> Remi >>>> >>>> On Wednesday, February 22, 2012, Daniel Bourrion< >>>> >>>> daniel.bourr...@univ-angers.fr**> wrote: >>>> >>>>> Hi. >>>>> I'm a french librarian (that explains the bad english coming now... :) >>>>> ) >>>>> >>>>> Newbie on Nutch, that looks exactly what i'm searching for (an >>>>> opensource >>>>> >>>> solution that should crawl our specific domaine and have it's crawl >>>> results pushed into Solr). >>>> >>>> I've install a test nutch using >>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >>>>> >>>>> >>>>> Got an error but I don't really know it nor understand where to try to >>>>> >>>> correct what causes that. >>>> >>>> Here's a copy of the error messages - any help welcome. >>>>> Best >>>>> >>>>> ------------------------------**-------------------- >>>>> daniel@daniel-linux:~/Bureau/**apache-nutch-1.4-bin/runtime/**local$ >>>>> >>>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >>>> >>>> solrUrl is not set, indexing will be skipped... >>>>> crawl started in: crawl >>>>> rootUrlDir = urls >>>>> threads = 10 >>>>> depth = 3 >>>>> solrUrl=null >>>>> topN = 5 >>>>> Injector: starting at 2012-02-22 16:06:04 >>>>> Injector: crawlDb: crawl/crawldb >>>>> Injector: urlDir: urls >>>>> Injector: Converting injected urls to crawl db entries. >>>>> Injector: Merging injected urls into crawl db. >>>>> Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02 >>>>> Generator: starting at 2012-02-22 16:06:06 >>>>> Generator: Selecting best-scoring urls due for fetch. >>>>> Generator: filtering: true >>>>> Generator: normalizing: true >>>>> Generator: topN: 5 >>>>> Generator: jobtracker is 'local', generating exactly one partition. >>>>> Generator: Partitioning selected urls for politeness. >>>>> Generator: segment: crawl/segments/20120222160609 >>>>> Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03 >>>>> Fetcher: Your 'http.agent.name' value should be listed first in >>>>> >>>> 'http.robots.agents' property. >>>> >>>> Fetcher: starting at 2012-02-22 16:06:10 >>>>> Fetcher: segment: crawl/segments/20120222160609 >>>>> Using queue mode : byHost >>>>> Fetcher: threads: 10 >>>>> Fetcher: time-out divisor: 2 >>>>> QueueFeeder finished: total 2 records + hit by time limit :0 >>>>> Using queue mode : byHost >>>>> Using queue mode : byHost >>>>> Using queue mode : byHost >>>>> fetching http://bu.univ-angers.fr/ >>>>> Using queue mode : byHost >>>>> Using queue mode : byHost >>>>> fetching http://www.face-ecran.fr/ >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> Using queue mode : byHost >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> Using queue mode : byHost >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> Using queue mode : byHost >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> Using queue mode : byHost >>>>> Using queue mode : byHost >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> -finishing thread FetcherThread, activeThreads=2 >>>>> Fetcher: throughput threshold: -1 >>>>> Fetcher: throughput threshold retries: 5 >>>>> -activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0 >>>>> -finishing thread FetcherThread, activeThreads=1 >>>>> -finishing thread FetcherThread, activeThreads=0 >>>>> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 >>>>> -activeThreads=0 >>>>> Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03 >>>>> ParseSegment: starting at 2012-02-22 16:06:13 >>>>> ParseSegment: segment: crawl/segments/20120222160609 >>>>> Parsing: http://bu.univ-angers.fr/ >>>>> Parsing: http://www.face-ecran.fr/ >>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>> >>>>> at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.** >>>>> java:1252) >>>>> at org.apache.nutch.parse.**ParseSegment.parse(** >>>>> ParseSegment.java:157) >>>>> at org.apache.nutch.crawl.Crawl.**run(Crawl.java:138) >>>>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) >>>>> at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55) >>>>> >>>>> ------------------------------**-------------------- >>>>> >>>>> -- >>>>> Avec mes salutations les plus cordiales. >>>>> __ >>>>> >>>>> Daniel Bourrion, conservateur des bibliothèques >>>>> Responsable de la bibliothèque numérique >>>>> Ligne directe : 02.44.68.80.50 >>>>> SCD Université d'Angers - http://bu.univ-angers.fr >>>>> Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex >>>>> >>>>> ************************************* >>>>> " Et par le pouvoir d'un mot >>>>> Je recommence ma vie " >>>>> >>>>> Paul Eluard >>>>> >>>>> ************************************* >>>>> blog perso : http://archives.face-ecran.fr/ >>>>> >>>> > -- > Avec mes salutations les plus cordiales. > __ > > Daniel Bourrion, conservateur des bibliothèques > Responsable de la bibliothèque numérique > Ligne directe : 02.44.68.80.50 > SCD Université d'Angers - http://bu.univ-angers.fr > Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex > > ************************************* > " Et par le pouvoir d'un mot > Je recommence ma vie " > Paul Eluard > ************************************* > blog perso : http://archives.face-ecran.fr/ > >