Wow crawling works much better - works, indeed, now that I replace openJDK by SUN-java6-JDK (I'm on Uubuntu)

Thanks
D

On 23/02/2012 11:47, remi tassing wrote:
disk size issue?
access rights?

On Thu, Feb 23, 2012 at 12:39 PM, Daniel Bourrion<
daniel.bourr...@univ-angers.fr>  wrote:

Hi Markus
Thx for help.

(Hope i'm not boring everybody)

I've erase everything in crawl/

Launching my nutch, got now

-----
CrawlDb update: 404 purging: false
CrawlDb update: Merging segment data into db.

Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252)
    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**105)
    at org.apache.nutch.crawl.**CrawlDb.update(CrawlDb.java:**63)
    at org.apache.nutch.crawl.Crawl.**run(Crawl.java:140)

    at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
    at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)

-----


Into the logs, got

____


2012-02-23 11:25:48,803 INFO  crawl.CrawlDb - CrawlDb update: 404 purging:
false
2012-02-23 11:25:48,804 INFO  crawl.CrawlDb - CrawlDb update: Merging
segment data into db.
2012-02-23 11:25:49,353 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2012-02-23 11:25:49,560 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'crawldb', using default
2012-02-23 11:25:49,985 WARN  mapred.LocalJobRunner - job_local_0007
java.io.IOException: Cannot run program "chmod": java.io.IOException:
error=12, Cannot allocate memory
    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:475)
    at org.apache.hadoop.util.Shell.**runCommand(Shell.java:149)
    at org.apache.hadoop.util.Shell.**run(Shell.java:134)
    at org.apache.hadoop.util.Shell$**ShellCommandExecutor.execute(**
Shell.java:286)
    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:354)
    at org.apache.hadoop.util.Shell.**execCommand(Shell.java:337)
    at org.apache.hadoop.fs.**RawLocalFileSystem.**execCommand(**
RawLocalFileSystem.java:481)
    at org.apache.hadoop.fs.**RawLocalFileSystem.**setPermission(**
RawLocalFileSystem.java:473)
    at org.apache.hadoop.fs.**FilterFileSystem.**setPermission(**
FilterFileSystem.java:280)
    at org.apache.hadoop.fs.**ChecksumFileSystem.create(**
ChecksumFileSystem.java:372)
    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:484)
    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:465)
    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:372)
    at org.apache.hadoop.fs.**FileSystem.create(FileSystem.**java:364)
    at org.apache.hadoop.mapred.**MapTask.localizeConfiguration(**
MapTask.java:111)
    at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
LocalJobRunner.java:173)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot
allocate memory
    at java.lang.UNIXProcess.<init>(**UNIXProcess.java:164)
    at java.lang.ProcessImpl.start(**ProcessImpl.java:81)
    at java.lang.ProcessBuilder.**start(ProcessBuilder.java:468)
    ... 15 more
_____



On 23/02/2012 10:01, Markus Jelsma wrote:

Unfetched, unparsed or just a bad corrupt segment. Remove that segment
and try
again.

  Many thanks Remi.
Finally, after un reboot og the computer (I send my question just before
leaving my desk), Nutch started to crawl (amazing :))) )

But now, during the crawl process, I got that :

-----

LinkDb: adding segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222161934 LinkDb: adding segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120223093525 LinkDb: adding segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222153642 LinkDb: adding segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222154459 Exception in thread "main"
org.apache.hadoop.mapred.**InvalidInputException: Input path does not
exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222160234/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222160609/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222153805/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222155532/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222160132/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222153642/parse_data Input path does not exist:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120222154459/parse_data at
org.apache.hadoop.mapred.**FileInputFormat.listStatus(**
FileInputFormat.java:19
0) at
org.apache.hadoop.mapred.**SequenceFileInputFormat.**
listStatus(SequenceFileInp
utFormat.java:44) at
org.apache.hadoop.mapred.**FileInputFormat.getSplits(**
FileInputFormat.java:201
) at
org.apache.hadoop.mapred.**JobClient.writeOldSplits(**
JobClient.java:810)
      at
org.apache.hadoop.mapred.**JobClient.submitJobInternal(**
JobClient.java:781)
      at org.apache.hadoop.mapred.**JobClient.submitJob(JobClient.**
java:730)
      at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
java:1249)
      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:175)
      at org.apache.nutch.crawl.LinkDb.**invert(LinkDb.java:149)
      at org.apache.nutch.crawl.Crawl.**run(Crawl.java:143)
      at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
      at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)

-----

and nothing special in the logs :

last lines are :


2012-02-23 09:46:42,524 INFO  crawl.CrawlDb - CrawlDb update: finished
at 2012-02-23 09:46:42, elapsed: 00:00:01
2012-02-23 09:46:42,590 INFO  crawl.LinkDb - LinkDb: starting at
2012-02-23 09:46:42
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: linkdb: crawl/linkdb
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2012-02-23 09:46:42,591 INFO  crawl.LinkDb - LinkDb: URL filter: true
2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb: adding segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments/
20120223093220 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222160234 2012-02-23 09:46:42,593 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223093302 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222160609 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222153805 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222155532 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223094427 2012-02-23 09:46:42,594 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223093618 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223094552 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223094500 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222160132 2012-02-23 09:46:42,595 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223093649 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223093210 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222161934 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120223093525 2012-02-23 09:46:42,596 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222153642 2012-02-23 09:46:42,597 INFO  crawl.LinkDb - LinkDb:
adding
segment:
file:/home/daniel/Bureau/**apache-nutch-1.4-bin/runtime/**
local/crawl/segments
/20120222154459

On 22/02/2012 16:36, remi tassing wrote:

Hey Daniel,

You can find more output log in logs/Hadoop files

Remi

On Wednesday, February 22, 2012, Daniel Bourrion<

daniel.bourr...@univ-angers.fr**>    wrote:

Hi.
I'm a french librarian (that explains the bad english coming now... :)
)

Newbie on Nutch, that looks exactly what i'm searching for (an
opensource

solution that should crawl our specific domaine and have it's crawl
results pushed into Solr).

  I've install a test nutch using
http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>


Got an error but I don't really know it nor understand where to try to

correct what causes that.

  Here's a copy of the error messages - any help welcome.
Best

------------------------------**--------------------
daniel@daniel-linux:~/Bureau/**apache-nutch-1.4-bin/runtime/**local$

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

  solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-02-22 16:06:04
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2012-02-22 16:06:06, elapsed: 00:00:02
Generator: starting at 2012-02-22 16:06:06
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: topN: 5
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: crawl/segments/20120222160609
Generator: finished at 2012-02-22 16:06:10, elapsed: 00:00:03
Fetcher: Your 'http.agent.name' value should be listed first in

'http.robots.agents' property.

  Fetcher: starting at 2012-02-22 16:06:10
Fetcher: segment: crawl/segments/20120222160609
Using queue mode : byHost
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://bu.univ-angers.fr/
Using queue mode : byHost
Using queue mode : byHost
fetching http://www.face-ecran.fr/
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2012-02-22 16:06:13, elapsed: 00:00:03
ParseSegment: starting at 2012-02-22 16:06:13
ParseSegment: segment: crawl/segments/20120222160609
Parsing: http://bu.univ-angers.fr/
Parsing: http://www.face-ecran.fr/
Exception in thread "main" java.io.IOException: Job failed!

     at org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**
java:1252)
     at org.apache.nutch.parse.**ParseSegment.parse(**
ParseSegment.java:157)
     at org.apache.nutch.crawl.Crawl.**run(Crawl.java:138)
     at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65)
     at org.apache.nutch.crawl.Crawl.**main(Crawl.java:55)

------------------------------**--------------------

--
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

*************************************
" Et par le pouvoir d'un mot
Je recommence ma vie "

                        Paul Eluard

*************************************
blog perso : http://archives.face-ecran.fr/

--
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

*************************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                       Paul Eluard
*************************************
blog perso : http://archives.face-ecran.fr/



--
Avec mes salutations les plus cordiales.
__

Daniel Bourrion, conservateur des bibliothèques
Responsable de la bibliothèque numérique
Ligne directe : 02.44.68.80.50
SCD Université d'Angers - http://bu.univ-angers.fr
Bu Saint Serge - 57 Quai Félix Faure - 49100 Angers cedex

***********************************
" Et par le pouvoir d'un mot
Je recommence ma vie "
                       Paul Eluard
***********************************
blog perso : http://archives.face-ecran.fr/

Reply via email to