A bit more info: The addLink documentation: "Links are only permitted in the webdb if they have a valid source MD5 for a Page that is also in the webdb". Yet I can insert a link with the MD5 of a page that is not in the webdb.
Also, I can now filter out the offending links by reading both the pages and the links by MD5, adding the following (seemingly missing) method to the WebDBReader class: /** * Iterate through all the Links, sorted by MD5 */ public Enumeration linksByMD5() throws IOException { return new MapEnumerator(linksByMD5); } > -----Mensaje original----- > De: Handl, Jorge [mailto:[EMAIL PROTECTED] > Enviado el: Lunes, 05 de Septiembre de 2005 16:51 > Para: nutch-dev@lucene.apache.org > Asunto: linksByMD5 > > > Hi! > > I'm writing a webdb purger, and I have an issue with writing > to the new db > the links of the pages that haven't been purged. > > The docs seem to imply that adding a link having a source > page that is not > present in the webdb should fail, but apparently it doesn't. > > So I try to filter out the links that shouldn't be inserted, > but I can't > access the links by MD5, even though I find both linksByURL > and linksByMD5 > directories in the webdb... Why is that so? > > Thanks! >