Mmm yes.. What will actually happen if we use a regex normalizer to produce a common form by setting a static value X for the first, second and fourth URI- segments?
This would produce http://example.com/X/X/123/X for both URL's. Now we have a duplicate URL, will the reducer deduplicate and write ot one URL? If that's the case you can normalize throughout the whole crawl cycle and also add new items to the crawldb. Cheers > Can't you normalise these URLs into a common form? The normalisation will > be done as part of the injection and in the subsequent steps so you will > have only one URL to fetch > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote: > > Hi Markus, > > > > Thanks but I need to prevent download, because I have more CPU resources > > than bandwidth :) Therefore, it is more important to deal with the beast > > before born. > > > > Dincer > > > > > > 2011/8/18 Markus Jelsma <[email protected]> > > > > > Hi, > > > > > > At the moment you cannot do this out-of-the-box. It's a very, very, > > > nasty problem that needs a lot of thinking if you want to prevent > > > downloading such > > > URL's. > > > What you can do is just download them and mark them as duplicate by > > > > either > > > > > using the simple hashing algorithm or a more advanced text profile > > > signature. > > > > > > Cheers, > > > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote: > > > > Hi, > > > > > > > > I have two URLs such as: > > > > http://example.com/pageBla/John/*123*/blabla > > > > http://example.com/pageBla/Doe/*123*/albalb > > > > The thing is these two URLs are same because of the id part of the > > > > URL (which is *123* in this sample). How could I manage to prevent > > > > download same thing twice because of that? > > > > > > > > I think I can customize injection classes but how could I check if > > > > > > another > > > > > > > form of the URL is already fetched? > > > > > > > > Any ideas? Thanks > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350

