Re: customizing URL injection

Markus Jelsma Thu, 18 Aug 2011 12:59:41 -0700

> Markus and Julien,
> 
> Thanks for the advice. The point is there are other URLs that don't fit my
> sample. Let this be the a subset of injected URLs on runtime, say:
> 
>    - http://example.com/pageBla/John/*123*/blabla
>    - http://example.com/pageBla/Doe/*123*/albalb
>    - http://example.com/pageBla/Doe/*456*/albalb
>    - http://example.com/pageBla/Doe/*789*/abcdef
> 
> In this case, I cannot use url-normalization, I guess. Because, it is not
> certain that there are more than one URL with 456 id. Long story short, I
> need to be aware that this URL has already been added in the list either
> form, in order to eliminate with any methods, i.e. normalization.


If you use regex normalization (regex-normalizer plugin) in injection and 
processing outlinks you bring them back to the common format. If 
/john/123/asdf has been indexed but normalized as X/123/X then /doe/123/blaat 
will be normalized to X/123/X as well. Then you've succesfully deduplicated 
these URL's.
Of course, provided that the crawldb fixes duplicates to a single URL which i 
believe it does. I'm not sure though.
> 
> What do you think?
> 
> Dincer
> 
> 
> 2011/8/18 Markus Jelsma <[email protected]>
> 
> > Mmm yes..
> > 
> > What will actually happen if we use a regex normalizer to produce a
> > common form by setting a static value X for the first, second and fourth
> > URI- segments?
> > 
> > This would produce http://example.com/X/X/123/X for both URL's. Now we
> > have a
> > duplicate URL, will the reducer deduplicate and write ot one URL?
> > 
> > If that's the case you can normalize throughout the whole crawl cycle and
> > also
> > add new items to the crawldb.
> > 
> > Cheers
> > 
> > > Can't you normalise these URLs into a common form? The normalisation
> > > will be done as part of the injection and in the subsequent steps so
> > > you will have only one URL to fetch
> > > 
> > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote:
> > > > Hi Markus,
> > > > 
> > > > Thanks but I need to prevent download, because I have more CPU
> > 
> > resources
> > 
> > > > than bandwidth :) Therefore, it is more important to deal with the
> > 
> > beast
> > 
> > > > before born.
> > > > 
> > > > Dincer
> > > > 
> > > > 
> > > > 2011/8/18 Markus Jelsma <[email protected]>
> > > > 
> > > > > Hi,
> > > > > 
> > > > > At the moment you cannot do this out-of-the-box. It's a very, very,
> > > > > nasty problem that needs a lot of thinking if you want to prevent
> > > > > downloading such
> > > > > URL's.
> > > > > What you can do is just download them and mark them as duplicate by
> > > > 
> > > > either
> > > > 
> > > > > using the simple hashing algorithm or a more advanced text profile
> > > > > signature.
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I have two URLs such as:
> > > > > > http://example.com/pageBla/John/*123*/blabla
> > > > > > http://example.com/pageBla/Doe/*123*/albalb
> > > > > > The thing is these two URLs are same because of the id part of
> > > > > > the URL (which is *123* in this sample). How could I manage to
> > > > > > prevent download same thing twice because of that?
> > > > > > 
> > > > > > I think I can customize injection classes but how could I check
> > > > > > if
> > > > > 
> > > > > another
> > > > > 
> > > > > > form of the URL is already fetched?
> > > > > > 
> > > > > > Any ideas? Thanks
> > > > > 
> > > > > --
> > > > > Markus Jelsma - CTO - Openindex
> > > > > http://www.linkedin.com/in/markus17
> > > > > 050-8536620 / 06-50258350

Re: customizing URL injection

Reply via email to