Can't you normalise these URLs into a common form? The normalisation will be
done as part of the injection and in the subsequent steps so you will have
only one URL to fetch

On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote:

> Hi Markus,
>
> Thanks but I need to prevent download, because I have more CPU resources
> than bandwidth :) Therefore, it is more important to deal with the beast
> before born.
>
> Dincer
>
>
> 2011/8/18 Markus Jelsma <[email protected]>
>
> > Hi,
> >
> > At the moment you cannot do this out-of-the-box. It's a very, very, nasty
> > problem that needs a lot of thinking if you want to prevent downloading
> > such
> > URL's.
> > What you can do is just download them and mark them as duplicate by
> either
> > using the simple hashing algorithm or a more advanced text profile
> > signature.
> >
> > Cheers,
> >
> > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote:
> > > Hi,
> > >
> > > I have two URLs such as:
> > > http://example.com/pageBla/John/*123*/blabla
> > > http://example.com/pageBla/Doe/*123*/albalb
> > > The thing is these two URLs are same because of the id part of the URL
> > > (which is *123* in this sample). How could I manage to prevent download
> > > same thing twice because of that?
> > >
> > > I think I can customize injection classes but how could I check if
> > another
> > > form of the URL is already fetched?
> > >
> > > Any ideas? Thanks
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to