Julien, It might not worked because modified URL is not actually a link that is published. However, as far as I get, there is a leak in the dynamic page that it only considers the part (the numbers) until id (inc. id). I have achieved my desired result. It doesn't mean it will always work this way.
I still believe there is a need for customized injection class to observe these kind of URLs for a general solution. For now, I am going through with this, later on if I take a general solution I will share here. Thanks for the suggestions. Best, Dincer 2011/8/19 Julien Nioche <[email protected]> > Markus, > > The filtering and normalisation of URLs happens in the map stage of the > first step of injection (i.e. we get CrawlDatum-s which are then merged > with > the existing ones) before they get deduplicated > > J > > On 18 August 2011 20:57, Markus Jelsma <[email protected]> wrote: > > > > > > Markus and Julien, > > > > > > Thanks for the advice. The point is there are other URLs that don't fit > > my > > > sample. Let this be the a subset of injected URLs on runtime, say: > > > > > > - http://example.com/pageBla/John/*123*/blabla > > > - http://example.com/pageBla/Doe/*123*/albalb > > > - http://example.com/pageBla/Doe/*456*/albalb > > > - http://example.com/pageBla/Doe/*789*/abcdef > > > > > > In this case, I cannot use url-normalization, I guess. Because, it is > not > > > certain that there are more than one URL with 456 id. Long story short, > I > > > need to be aware that this URL has already been added in the list > either > > > form, in order to eliminate with any methods, i.e. normalization. > > > > If you use regex normalization (regex-normalizer plugin) in injection and > > processing outlinks you bring them back to the common format. If > > /john/123/asdf has been indexed but normalized as X/123/X then > > /doe/123/blaat > > will be normalized to X/123/X as well. Then you've succesfully > deduplicated > > these URL's. > > Of course, provided that the crawldb fixes duplicates to a single URL > which > > i > > believe it does. I'm not sure though. > > > > > > What do you think? > > > > > > Dincer > > > > > > > > > 2011/8/18 Markus Jelsma <[email protected]> > > > > > > > Mmm yes.. > > > > > > > > What will actually happen if we use a regex normalizer to produce a > > > > common form by setting a static value X for the first, second and > > fourth > > > > URI- segments? > > > > > > > > This would produce http://example.com/X/X/123/X for both URL's. Now > we > > > > have a > > > > duplicate URL, will the reducer deduplicate and write ot one URL? > > > > > > > > If that's the case you can normalize throughout the whole crawl cycle > > and > > > > also > > > > add new items to the crawldb. > > > > > > > > Cheers > > > > > > > > > Can't you normalise these URLs into a common form? The > normalisation > > > > > will be done as part of the injection and in the subsequent steps > so > > > > > you will have only one URL to fetch > > > > > > > > > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> > wrote: > > > > > > Hi Markus, > > > > > > > > > > > > Thanks but I need to prevent download, because I have more CPU > > > > > > > > resources > > > > > > > > > > than bandwidth :) Therefore, it is more important to deal with > the > > > > > > > > beast > > > > > > > > > > before born. > > > > > > > > > > > > Dincer > > > > > > > > > > > > > > > > > > 2011/8/18 Markus Jelsma <[email protected]> > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > At the moment you cannot do this out-of-the-box. It's a very, > > very, > > > > > > > nasty problem that needs a lot of thinking if you want to > prevent > > > > > > > downloading such > > > > > > > URL's. > > > > > > > What you can do is just download them and mark them as > duplicate > > by > > > > > > > > > > > > either > > > > > > > > > > > > > using the simple hashing algorithm or a more advanced text > > profile > > > > > > > signature. > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote: > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have two URLs such as: > > > > > > > > http://example.com/pageBla/John/*123*/blabla > > > > > > > > http://example.com/pageBla/Doe/*123*/albalb > > > > > > > > The thing is these two URLs are same because of the id part > of > > > > > > > > the URL (which is *123* in this sample). How could I manage > to > > > > > > > > prevent download same thing twice because of that? > > > > > > > > > > > > > > > > I think I can customize injection classes but how could I > check > > > > > > > > if > > > > > > > > > > > > > > another > > > > > > > > > > > > > > > form of the URL is already fetched? > > > > > > > > > > > > > > > > Any ideas? Thanks > > > > > > > > > > > > > > -- > > > > > > > Markus Jelsma - CTO - Openindex > > > > > > > http://www.linkedin.com/in/markus17 > > > > > > > 050-8536620 / 06-50258350 > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

