Hi Markus
Thx for help.
(Hope i'm not boring everybody)
I've erase everything in crawl/
Launching my nutch, got now
-----
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:63)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:140)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
-----
Into the logs, got
____
2012-02-23 11:25:48,803 INFO crawl.CrawlDb - CrawlDb update: 404
purging: false
2012-02-23 11:25:48,804 INFO crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2012-02-23 11:25:49,353 INFO regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2012-02-23 11:25:49,560 INFO regex.RegexURLNormalizer - can't find
rules for scope 'crawldb', using default
2012-02-23 11:25:49,985 WARN mapred.LocalJobRunner - job_local_0007
java.io.IOException: Cannot run program "chmod": java.io.IOException:
error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:354)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:337)
at
org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:481)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:473)
at
org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:280)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:364)
at
org.apache.hadoop.mapred.MapTask.localizeConfiguration(MapTask.java:111)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:173)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 15 more
_____
On 23/02/2012 10:01, Markus Jelsma wrote:
Unfetched, unparsed or just a bad corrupt segment. Remove that segment and try
again.
Many thanks Remi.
Finally, after un reboot og the computer (I send my question just before
leaving my desk), Nutch started to crawl (amazing :))) )
But now, during the crawl process, I got that :
-----
LinkDb: adding segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222161934 LinkDb: adding segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120223093525 LinkDb: adding segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222153642 LinkDb: adding segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222154459 Exception in thread "main"
org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222160234/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222160609/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222153805/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222155532/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222160132/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222153642/parse_data Input path does not exist:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120222154459/parse_data at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19
0) at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp
utFormat.java:44) at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201
) at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
-----
and nothing special in the logs :
last lines are :
2012-02-23 09:46:42,524 INFO crawl.CrawlDb - CrawlDb update: finished
at 2012-02-23 09:46:42, elapsed: 00:00:01
2012-02-23 09:46:42,590 INFO crawl.LinkDb - LinkDb: starting at
2012-02-23 09:46:42
2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL normalize: true
2012-02-23 09:46:42,591 INFO crawl.LinkDb - LinkDb: URL filter: true
2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments/
20120223093220 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222160234 2012-02-23 09:46:42,593 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223093302 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222160609 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222153805 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222155532 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223094427 2012-02-23 09:46:42,594 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223093618 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223094552 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223094500 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222160132 2012-02-23 09:46:42,595 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223093649 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223093210 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222161934 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120223093525 2012-02-23 09:46:42,596 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222153642 2012-02-23 09:46:42,597 INFO crawl.LinkDb - LinkDb: adding
segment:
file:/home/daniel/Bureau/apache-nutch-1.4-bin/runtime/local/crawl/segments
/20120222154459
On 22/02/2012 16:36, remi tassing wrote:
Hey Daniel,
You can find more output log in logs/Hadoop files
Remi
On Wednesday, February 22, 2012, Daniel Bourrion<
[email protected]> wrote:
Hi.
I'm a french librarian (that explains the bad english coming now... :) )
Newbie on Nutch, that looks exactly what i'm searching for (an
opensource
solution that should crawl our specific domaine and have it's crawl
results pushed into Solr).
I've install a test nutch using
http://wiki.apache.org/nutch/NutchTutorial
Got an error but I don't really know it nor understand where to try to
correct what causes that.
Here's a copy of the error messages - any help welcome.
Best
--------------------------------------------------
daniel@daniel-linux:~/Bureau/apache-nutch-1.4-bin/runtime/local$
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-02-22 16:06:04
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
Generator: starting at 2012-02-22 16:06:06
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120222160609
Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in
'http.robots.agents' property.
Fetcher: starting at 2012-02-22 16:06:10
Fetcher: segment: crawl/segments/20120222160609
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://bu.univ-angers.fr/
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.face-ecran.fr/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
ParseSegment: starting at 2012-02-22 16:06:13
ParseSegment: segment: crawl/segments/20120222160609
Parsing: http://bu.univ-angers.fr/
Parsing: http://www.face-ecran.fr/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:138)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
--------------------------------------------------
--
Avec mes salutations les plus cordiales.
__
Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/
--
Avec mes salutations les plus cordiales.
__
Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex
***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/