Julien,

It might not worked because modified URL is not actually a link that is
published. However, as far as I get, there is a leak in the dynamic page
that it only considers the part (the numbers) until id (inc. id). I have
achieved my desired result. It doesn't mean it will always work this way.

I still believe there is a need for customized injection class to observe
these kind of URLs for a general solution.

For now, I am going through with this, later on if I take a general solution
I will share here. Thanks for the suggestions.

Best,
Dincer


2011/8/19 Julien Nioche <[email protected]>

> Markus,
>
> The filtering and normalisation of URLs happens in the map stage of the
> first step of injection (i.e. we get CrawlDatum-s which are then merged
> with
> the existing ones) before they get deduplicated
>
> J
>
> On 18 August 2011 20:57, Markus Jelsma <[email protected]> wrote:
>
> >
> > > Markus and Julien,
> > >
> > > Thanks for the advice. The point is there are other URLs that don't fit
> > my
> > > sample. Let this be the a subset of injected URLs on runtime, say:
> > >
> > >    - http://example.com/pageBla/John/*123*/blabla
> > >    - http://example.com/pageBla/Doe/*123*/albalb
> > >    - http://example.com/pageBla/Doe/*456*/albalb
> > >    - http://example.com/pageBla/Doe/*789*/abcdef
> > >
> > > In this case, I cannot use url-normalization, I guess. Because, it is
> not
> > > certain that there are more than one URL with 456 id. Long story short,
> I
> > > need to be aware that this URL has already been added in the list
> either
> > > form, in order to eliminate with any methods, i.e. normalization.
> >
> > If you use regex normalization (regex-normalizer plugin) in injection and
> > processing outlinks you bring them back to the common format. If
> > /john/123/asdf has been indexed but normalized as X/123/X then
> > /doe/123/blaat
> > will be normalized to X/123/X as well. Then you've succesfully
> deduplicated
> > these URL's.
> > Of course, provided that the crawldb fixes duplicates to a single URL
> which
> > i
> > believe it does. I'm not sure though.
> > >
> > > What do you think?
> > >
> > > Dincer
> > >
> > >
> > > 2011/8/18 Markus Jelsma <[email protected]>
> > >
> > > > Mmm yes..
> > > >
> > > > What will actually happen if we use a regex normalizer to produce a
> > > > common form by setting a static value X for the first, second and
> > fourth
> > > > URI- segments?
> > > >
> > > > This would produce http://example.com/X/X/123/X for both URL's. Now
> we
> > > > have a
> > > > duplicate URL, will the reducer deduplicate and write ot one URL?
> > > >
> > > > If that's the case you can normalize throughout the whole crawl cycle
> > and
> > > > also
> > > > add new items to the crawldb.
> > > >
> > > > Cheers
> > > >
> > > > > Can't you normalise these URLs into a common form? The
> normalisation
> > > > > will be done as part of the injection and in the subsequent steps
> so
> > > > > you will have only one URL to fetch
> > > > >
> > > > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]>
> wrote:
> > > > > > Hi Markus,
> > > > > >
> > > > > > Thanks but I need to prevent download, because I have more CPU
> > > >
> > > > resources
> > > >
> > > > > > than bandwidth :) Therefore, it is more important to deal with
> the
> > > >
> > > > beast
> > > >
> > > > > > before born.
> > > > > >
> > > > > > Dincer
> > > > > >
> > > > > >
> > > > > > 2011/8/18 Markus Jelsma <[email protected]>
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > At the moment you cannot do this out-of-the-box. It's a very,
> > very,
> > > > > > > nasty problem that needs a lot of thinking if you want to
> prevent
> > > > > > > downloading such
> > > > > > > URL's.
> > > > > > > What you can do is just download them and mark them as
> duplicate
> > by
> > > > > >
> > > > > > either
> > > > > >
> > > > > > > using the simple hashing algorithm or a more advanced text
> > profile
> > > > > > > signature.
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have two URLs such as:
> > > > > > > > http://example.com/pageBla/John/*123*/blabla
> > > > > > > > http://example.com/pageBla/Doe/*123*/albalb
> > > > > > > > The thing is these two URLs are same because of the id part
> of
> > > > > > > > the URL (which is *123* in this sample). How could I manage
> to
> > > > > > > > prevent download same thing twice because of that?
> > > > > > > >
> > > > > > > > I think I can customize injection classes but how could I
> check
> > > > > > > > if
> > > > > > >
> > > > > > > another
> > > > > > >
> > > > > > > > form of the URL is already fetched?
> > > > > > > >
> > > > > > > > Any ideas? Thanks
> > > > > > >
> > > > > > > --
> > > > > > > Markus Jelsma - CTO - Openindex
> > > > > > > http://www.linkedin.com/in/markus17
> > > > > > > 050-8536620 / 06-50258350
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Reply via email to