Hi,

 I tested a fetch on a segment with few parsed pages with Webscarab.

 In fact, 95% of all HTTP requests get a 404 response as the pages don't longer 
exist.

 Such pages have the status 'db_gone' in the crawldb, but they are still 
generated and fetched.

 I noticed that in Nutch 1.4, there is an option to remove the db_gone urls 
from the crawldb, that would solve my problem.

 But currently, I use a script to monitor Nutch processes and it's not 
compatible with Nutch 1.4, some development is needed to adapt it, but is there 
something I can do to remove these db_gone in Nutch 1.2 ?

 Thanks.

----- Message d'origine -----
De : Danicela nutch
Envoyés : 17.02.12 17:32
À : [email protected], [email protected]
Objet : Re : Re: Too few parsed pages

 Hi, I have now 242 parsed pages for 18662 fetched pages. The performance of my 
crawl has been significantly reduced due to this poor efficiency. Is there 
anything I can do to prevent this ? Shouldn't the generate choose newer URLs 
instead of already fetched ones ? If I understand, the pages which aren't 
parsed are pages that did not change since the last fetch, does it mean that 
the HTML contents of each page is sent in the segment to the fetchlist during 
the generate ? I mean, if the parser makes the comparison between the current 
and the older contents, it should have the old content in the segment, as it 
doesn't read the crawldb. If this is true, does it also mean that the crawldb 
contains all HTML contents from all pages ? (as the generate gives it to the 
segments) Thanks for helping. ----- Message d'origine ----- De : Markus Jelsma 
Envoyés : 06.02.12 17:06 À : [email protected] Objet : Re: Too few parsed 
pages Nothing, this is good. If a page is not modified you d
 on't need to parse it again as it was already parsed in an older segment. On 
Monday 06 February 2012 17:03:52 Danicela nutch wrote: > I don't understand, 
what should I do ? > > ----- Message d'origine ----- > De : Markus Jelsma > 
Envoyés : 06.02.12 16:45 > À : [email protected] > Objet : Re: Too few 
parsed pages > > Likely db_not_modified records, they are not parsed. On Monday 
06 February > 2012 16:44:25 Danicela nutch wrote: > Hi, > > When I make a 
readseg -list > on a segment, I have 60.000 'FETCHED' pages, > but only 10.000 
'PARSED' > pages. One month ago, I had something like 40.000 > 'PARSED' pages 
in my > segments, and this number reduced a little every day. > If I look in 
the > logs of the segments, I can find approximately these > numbers if I count 
> the number of treated pages. But I find nothing strange > in the parse > that 
could explain the fact I have so few pages in the end. > > What can > explain 
th e fact I have so few pages which are parsed ? > > Tha
 nks. -- > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex

Reply via email to