> Markus and Julien, > > Thanks for the advice. The point is there are other URLs that don't fit my > sample. Let this be the a subset of injected URLs on runtime, say: > > - http://example.com/pageBla/John/*123*/blabla > - http://example.com/pageBla/Doe/*123*/albalb > - http://example.com/pageBla/Doe/*456*/albalb > - http://example.com/pageBla/Doe/*789*/abcdef > > In this case, I cannot use url-normalization, I guess. Because, it is not > certain that there are more than one URL with 456 id. Long story short, I > need to be aware that this URL has already been added in the list either > form, in order to eliminate with any methods, i.e. normalization.
If you use regex normalization (regex-normalizer plugin) in injection and processing outlinks you bring them back to the common format. If /john/123/asdf has been indexed but normalized as X/123/X then /doe/123/blaat will be normalized to X/123/X as well. Then you've succesfully deduplicated these URL's. Of course, provided that the crawldb fixes duplicates to a single URL which i believe it does. I'm not sure though. > > What do you think? > > Dincer > > > 2011/8/18 Markus Jelsma <[email protected]> > > > Mmm yes.. > > > > What will actually happen if we use a regex normalizer to produce a > > common form by setting a static value X for the first, second and fourth > > URI- segments? > > > > This would produce http://example.com/X/X/123/X for both URL's. Now we > > have a > > duplicate URL, will the reducer deduplicate and write ot one URL? > > > > If that's the case you can normalize throughout the whole crawl cycle and > > also > > add new items to the crawldb. > > > > Cheers > > > > > Can't you normalise these URLs into a common form? The normalisation > > > will be done as part of the injection and in the subsequent steps so > > > you will have only one URL to fetch > > > > > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote: > > > > Hi Markus, > > > > > > > > Thanks but I need to prevent download, because I have more CPU > > > > resources > > > > > > than bandwidth :) Therefore, it is more important to deal with the > > > > beast > > > > > > before born. > > > > > > > > Dincer > > > > > > > > > > > > 2011/8/18 Markus Jelsma <[email protected]> > > > > > > > > > Hi, > > > > > > > > > > At the moment you cannot do this out-of-the-box. It's a very, very, > > > > > nasty problem that needs a lot of thinking if you want to prevent > > > > > downloading such > > > > > URL's. > > > > > What you can do is just download them and mark them as duplicate by > > > > > > > > either > > > > > > > > > using the simple hashing algorithm or a more advanced text profile > > > > > signature. > > > > > > > > > > Cheers, > > > > > > > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote: > > > > > > Hi, > > > > > > > > > > > > I have two URLs such as: > > > > > > http://example.com/pageBla/John/*123*/blabla > > > > > > http://example.com/pageBla/Doe/*123*/albalb > > > > > > The thing is these two URLs are same because of the id part of > > > > > > the URL (which is *123* in this sample). How could I manage to > > > > > > prevent download same thing twice because of that? > > > > > > > > > > > > I think I can customize injection classes but how could I check > > > > > > if > > > > > > > > > > another > > > > > > > > > > > form of the URL is already fetched? > > > > > > > > > > > > Any ideas? Thanks > > > > > > > > > > -- > > > > > Markus Jelsma - CTO - Openindex > > > > > http://www.linkedin.com/in/markus17 > > > > > 050-8536620 / 06-50258350

