Re: How does generate work ?

2009-12-03 Thread Andrzej Bialecki

MilleBii wrote:

Oops continuing previous mail.

So I wonder if there would be a better  algorithm 'generate' which
would maintain a constant rate of host per 100 url ... Below a certain
threshold it stops or better starts including URLs of lower scores.


That's exactly how the max.urls.per.host limit works.



Using scores is de-optimzing the fetching process... Having said that
I should first read the code and try to understand it.


That wouldn't hurt in any case ;)

There is also a method in ScoringFilter-s (e.g. the default 
scoring-opic), where it determines the priority of URL during 
generation. See ScoringFilter.generatorSortValue(..), you can modify 
this method in scoring-opic (or in your own scoring filter) to 
prioritize certain urls over others.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file crawl/linkdb/.locked already exists

2009-12-03 Thread BELLINI ADAM


hi,
i'm performing a RECRAWL using the recrawl.sh script, and i had this error when 
inverting the links: 

FATAL crawl.LinkDb - LinkDb: java.io.IOException: lock file 
crawl/linkdb/.locked already exists
echo - Invert Links (Step 4 of $steps) -
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*

i understood that the linkdb already exists (because of the last crawl). my 
question is: should i delete or backup the old linkdb (at every recrawl) before 
iverting links ?












  
_
Eligible CDN College  University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Re: How does generate work ?

2009-12-03 Thread MilleBii
Hum... I use the max urls and sets it to 600... Because in the worst
case you have 6s (measured on logs) in between urls of same host: so 6
x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
longer than 1hour... Unfortunately it is not what I see

I also tried the  by.ip  option, because some blogs site allocate a
different domain name for each user... I saw no improvements

I look at the time limit feature as a workaround this nbre host issue
and was thinking that there could be a more structural way to solve it

2009/12/3, Andrzej Bialecki a...@getopt.org:
 MilleBii wrote:
 Oops continuing previous mail.

 So I wonder if there would be a better  algorithm 'generate' which
 would maintain a constant rate of host per 100 url ... Below a certain
 threshold it stops or better starts including URLs of lower scores.

 That's exactly how the max.urls.per.host limit works.


 Using scores is de-optimzing the fetching process... Having said that
 I should first read the code and try to understand it.

 That wouldn't hurt in any case ;)

 There is also a method in ScoringFilter-s (e.g. the default
 scoring-opic), where it determines the priority of URL during
 generation. See ScoringFilter.generatorSortValue(..), you can modify
 this method in scoring-opic (or in your own scoring filter) to
 prioritize certain urls over others.

 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
-MilleBii-


Re: How does generate work ?

2009-12-03 Thread Julien Nioche
 Hum... I use the max urls and sets it to 600... Because in the worst
 case you have 6s (measured on logs) in between urls of same host: so 6
 x 600= 3600 s = 1 hour. In the worst case the long tail shouldn't last
 longer than 1hour... Unfortunately it is not what I see


that's assuming that all input URLs are read at once and put into
their corresponding queue and are ready to be fetched. in reality
there is a cap on the amount of URLs stored in the queues (see
fetchQueues.totalSize in the logs) which is equal to 50 * number of
threads.

the value of 50 is fixed but we could add a parameter to modify it. a
workaround is simply to use more threads to increase the number of
URLs stored in the queues.

if you look at the logs you'll see that there are often situations
where the fetchQueues.totalSize is at the max value allowed but not
all fetcher threads are active which means that one or more queues
prevent new URLs to be put in the queues by being large and filling up
the fetchQueues.totalSize.

we can't read ahead the URL entries given to the mapper without having
to store them somewhat so the easiest option is probably to allow a
custom multiplication factor for the  fetchQueues.totalSize cap and
make so that it could be more than 50. That would increase the memory
consumption a bit but definitely make the fetching rate a bit more
constant. You can also simply use more threads but there would be a
risk of getting time outs if you specify too large a value.

makes sense?



 I also tried the  by.ip  option, because some blogs site allocate a
 different domain name for each user... I saw no improvements

ip resolution is quite slow because is it not multithreaded so that
would not help anyway

Julien


 I look at the time limit feature as a workaround this nbre host issue
 and was thinking that there could be a more structural way to solve it

 2009/12/3, Andrzej Bialecki a...@getopt.org:
  MilleBii wrote:
  Oops continuing previous mail.
 
  So I wonder if there would be a better  algorithm 'generate' which
  would maintain a constant rate of host per 100 url ... Below a certain
  threshold it stops or better starts including URLs of lower scores.
 
  That's exactly how the max.urls.per.host limit works.
 
 
  Using scores is de-optimzing the fetching process... Having said that
  I should first read the code and try to understand it.
 
  That wouldn't hurt in any case ;)
 
  There is also a method in ScoringFilter-s (e.g. the default
  scoring-opic), where it determines the priority of URL during
  generation. See ScoringFilter.generatorSortValue(..), you can modify
  this method in scoring-opic (or in your own scoring filter) to
  prioritize certain urls over others.
 
  --
  Best regards,
  Andrzej Bialecki     
    ___. ___ ___ ___ _ _   __
  [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
  ___|||__||  \|  ||  |  Embedded Unix, System Integration
  http://www.sigram.com  Contact: info at sigram dot com
 
 


 --
 -MilleBii-




-- 
DigitalPebble Ltd
http://www.digitalpebble.com


db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM

hi,

i'm crawling my intranet , and i have setted the db.fetch.interval.default to 
be 5 hours, but it seens that it doesnt work correctly

property

  namedb.fetch.interval.default/name

  value18000/value

  descriptionThe number of seconds between re-fetches of a page (5 hours ).

  /description

/property


the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
spend just 2 hours (with depth=10).


but when performing the recrawl with the recrawl.sh script (with crawldb full), 
it takes like 2 hours for each depth !!

and when i checked the log file i found that one URL is fetched like several 
times ! so did my 5 hours db.fetch.interval.default works correctly ?

why it's refetching same URLs several time at each depth (depth =10), i 
understood that the timestamp of pages will not allow a refetch since the time 
(5 hours) is not spent yet !

plz can you just explain me how this db.fetch.interval.defaul works ? should i 
use only one depth  since i have all the urls in the crawldb ?




i'm using this recrawl script :

depth=10

echo - Inject (Step 1 of $steps) -
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls

echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
for((i=0; i  $depth; i++))

do

  echo --- Beginning crawl at depth `expr $i + 1` of $depth ---

$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN

  if [ $? -ne 0 ]
  then
echo runbot: Stopping at depth $depth. No more URLs to fetch.
break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

  if [ $? -ne 0 ]
  then
echo runbot: fetch $segment at depth `expr $i + 1` failed.
echo runbot: Deleting segment $segment.
rm $RMARGS $segment
continue
  fi

echo  - Updating Dadatabase ( $steps) -


  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

done

echo - Merge Segments (Step 3 of $steps) -
$NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*


rm   $crawl/segments
mv  $crawl/MERGEDsegments $crawl/segments

echo - Invert Links (Step 4 of $steps) -
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*



  
_
Eligible CDN College  University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

RE: db.fetch.interval.default

2009-12-03 Thread BELLINI ADAM

hi
i dumped the database, and this is what i found:


Status: 1 (db_unfetched)
Fetch time: Thu Dec 03 15:53:24 EST 2009
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 18000 seconds (0 days)
Score: 2.0549393
Signature: null
Metadata:




so if meeting this url several time in 2 hours, that means becoz of the 0 days 
it gonna be fetched several times ??
it will not look at the 18000 secondes ???


thx




 Date: Thu, 3 Dec 2009 22:39:29 +0100
 From: reinhard.sch...@aon.at
 To: nutch-user@lucene.apache.org
 Subject: Re: db.fetch.interval.default
 
 hi,
 
 i have identified one source of such a problem and opened an issue at jira.
 you can apply this patch and check whether is solves your problem.
 
 https://issues.apache.org/jira/browse/NUTCH-774
 
 btw you can check also your crawldb for such items - the retry interval
 is set to 0.
 just dump the crawldb and search for it.
 
 regards
 reinhard
 
 BELLINI ADAM schrieb:
  hi,
 
  i'm crawling my intranet , and i have setted the db.fetch.interval.default 
  to be 5 hours, but it seens that it doesnt work correctly
 
  property
 
namedb.fetch.interval.default/name
 
value18000/value
 
descriptionThe number of seconds between re-fetches of a page (5 hours 
  ).
 
/description
 
  /property
 
 
  the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
  spend just 2 hours (with depth=10).
 
 
  but when performing the recrawl with the recrawl.sh script (with crawldb 
  full), it takes like 2 hours for each depth !!
 
  and when i checked the log file i found that one URL is fetched like 
  several times ! so did my 5 hours db.fetch.interval.default works correctly 
  ?
 
  why it's refetching same URLs several time at each depth (depth =10), i 
  understood that the timestamp of pages will not allow a refetch since the 
  time (5 hours) is not spent yet !
 
  plz can you just explain me how this db.fetch.interval.defaul works ? 
  should i use only one depth  since i have all the urls in the crawldb ?
 
 
 
 
  i'm using this recrawl script :
 
  depth=10
 
  echo - Inject (Step 1 of $steps) -
  $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
 
  echo - Generate, Fetch, Parse, Update (Step 2 of $steps) -
  for((i=0; i  $depth; i++))
 
  do
 
echo --- Beginning crawl at depth `expr $i + 1` of $depth ---
 
  $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
 
if [ $? -ne 0 ]
then
  echo runbot: Stopping at depth $depth. No more URLs to fetch.
  break
fi
segment=`ls -d $crawl/segments/* | tail -1`
 
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
 
if [ $? -ne 0 ]
then
  echo runbot: fetch $segment at depth `expr $i + 1` failed.
  echo runbot: Deleting segment $segment.
  rm $RMARGS $segment
  continue
fi
 
  echo  - Updating Dadatabase ( $steps) -
 
 
$NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
 
  done
 
  echo - Merge Segments (Step 3 of $steps) -
  $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
 
 
  rm   $crawl/segments
  mv  $crawl/MERGEDsegments $crawl/segments
 
  echo - Invert Links (Step 4 of $steps) -
  $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
 
 
 

  _
  Eligible CDN College  University students can upgrade to Windows 7 before 
  Jan 3 for only $39.99. Upgrade now!
  http://go.microsoft.com/?linkid=9691819

 
  
_
Eligible CDN College  University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Why does a url with a fetch status of 'fetch_gone' show up as 'db_unfetched'?

2009-12-03 Thread J.G.Konrad
Why does a url with a fetch status of 'fetch_gone' show up as
'db_unfetched'? Shouldn't the crawldb entry have a status of
'db_gone'? This is happening in nutch-1.0

Here is one example of what I'm talking about
=
[jkon...@rampage search]$ ./bin/nutch readseg -get
testParseSegment/20091202111849
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s;
Crawl Generate::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Fri Nov 27 16:28:09 PST 2009
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 7776000 seconds (90 days)
Score: 7.535359E-10
Signature: null
Metadata: _ngt_: 1259781530311

Crawl Fetch::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Wed Dec 02 12:25:21 PST 2009
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Score: 2.47059988E10
Signature: null
Metadata: _ngt_: 1259781530311_pst_: notfound(14), lastModified=0:
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s

[jkon...@rampage search]$ ./bin/nutch readdb testParseSegment/c -url
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s;
URL: http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Apr 03 01:25:21 PDT 2010
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 6998400 seconds (81 days)
Score: 2.47059988E10
Signature: null
Metadata: _pst_: notfound(14), lastModified=0:
http://answers.yahoo.com/question/index?qid=20080802122654AA7qj6s
=


Thanks,
  Jason