Markus and Julien,

Thanks for the advice. The point is there are other URLs that don't fit my
sample. Let this be the a subset of injected URLs on runtime, say:

   - http://example.com/pageBla/John/*123*/blabla
   - http://example.com/pageBla/Doe/*123*/albalb
   - http://example.com/pageBla/Doe/*456*/albalb
   - http://example.com/pageBla/Doe/*789*/abcdef

In this case, I cannot use url-normalization, I guess. Because, it is not
certain that there are more than one URL with 456 id. Long story short, I
need to be aware that this URL has already been added in the list either
form, in order to eliminate with any methods, i.e. normalization.

What do you think?

Dincer


2011/8/18 Markus Jelsma <[email protected]>

> Mmm yes..
>
> What will actually happen if we use a regex normalizer to produce a common
> form by setting a static value X for the first, second and fourth URI-
> segments?
>
> This would produce http://example.com/X/X/123/X for both URL's. Now we
> have a
> duplicate URL, will the reducer deduplicate and write ot one URL?
>
> If that's the case you can normalize throughout the whole crawl cycle and
> also
> add new items to the crawldb.
>
> Cheers
>
> > Can't you normalise these URLs into a common form? The normalisation will
> > be done as part of the injection and in the subsequent steps so you will
> > have only one URL to fetch
> >
> > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote:
> > > Hi Markus,
> > >
> > > Thanks but I need to prevent download, because I have more CPU
> resources
> > > than bandwidth :) Therefore, it is more important to deal with the
> beast
> > > before born.
> > >
> > > Dincer
> > >
> > >
> > > 2011/8/18 Markus Jelsma <[email protected]>
> > >
> > > > Hi,
> > > >
> > > > At the moment you cannot do this out-of-the-box. It's a very, very,
> > > > nasty problem that needs a lot of thinking if you want to prevent
> > > > downloading such
> > > > URL's.
> > > > What you can do is just download them and mark them as duplicate by
> > >
> > > either
> > >
> > > > using the simple hashing algorithm or a more advanced text profile
> > > > signature.
> > > >
> > > > Cheers,
> > > >
> > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote:
> > > > > Hi,
> > > > >
> > > > > I have two URLs such as:
> > > > > http://example.com/pageBla/John/*123*/blabla
> > > > > http://example.com/pageBla/Doe/*123*/albalb
> > > > > The thing is these two URLs are same because of the id part of the
> > > > > URL (which is *123* in this sample). How could I manage to prevent
> > > > > download same thing twice because of that?
> > > > >
> > > > > I think I can customize injection classes but how could I check if
> > > >
> > > > another
> > > >
> > > > > form of the URL is already fetched?
> > > > >
> > > > > Any ideas? Thanks
> > > >
> > > > --
> > > > Markus Jelsma - CTO - Openindex
> > > > http://www.linkedin.com/in/markus17
> > > > 050-8536620 / 06-50258350
>

Reply via email to