A bit more info:

The addLink documentation: "Links are only permitted in the webdb if they
have a valid source MD5 for a Page that is also in the webdb". Yet I can
insert a link with the MD5 of a page that is not in the webdb.

Also, I can now filter out the offending links by reading both the pages and
the links by MD5, adding the following (seemingly missing) method to the
WebDBReader class:

    /**
     * Iterate through all the Links, sorted by MD5
     */
    public Enumeration linksByMD5() throws IOException {
        return new MapEnumerator(linksByMD5);
    }

> -----Mensaje original-----
> De: Handl, Jorge [mailto:[EMAIL PROTECTED]
> Enviado el: Lunes, 05 de Septiembre de 2005 16:51
> Para: nutch-dev@lucene.apache.org
> Asunto: linksByMD5
> 
> 
> Hi!
> 
> I'm writing a webdb purger, and I have an issue with writing 
> to the new db
> the links of the pages that haven't been purged.
> 
> The docs seem to imply that adding a link having a source 
> page that is not
> present in the webdb should fail, but apparently it doesn't.
> 
> So I try to filter out the links that shouldn't be inserted, 
> but I can't
> access the links by MD5, even though I find both linksByURL 
> and linksByMD5
> directories in the webdb... Why is that so?
> 
> Thanks!
> 

Reply via email to