Markus,

The filtering and normalisation of URLs happens in the map stage of the
first step of injection (i.e. we get CrawlDatum-s which are then merged with
the existing ones) before they get deduplicated

J

On 18 August 2011 20:57, Markus Jelsma <[email protected]> wrote:

>
> > Markus and Julien,
> >
> > Thanks for the advice. The point is there are other URLs that don't fit
> my
> > sample. Let this be the a subset of injected URLs on runtime, say:
> >
> >    - http://example.com/pageBla/John/*123*/blabla
> >    - http://example.com/pageBla/Doe/*123*/albalb
> >    - http://example.com/pageBla/Doe/*456*/albalb
> >    - http://example.com/pageBla/Doe/*789*/abcdef
> >
> > In this case, I cannot use url-normalization, I guess. Because, it is not
> > certain that there are more than one URL with 456 id. Long story short, I
> > need to be aware that this URL has already been added in the list either
> > form, in order to eliminate with any methods, i.e. normalization.
>
> If you use regex normalization (regex-normalizer plugin) in injection and
> processing outlinks you bring them back to the common format. If
> /john/123/asdf has been indexed but normalized as X/123/X then
> /doe/123/blaat
> will be normalized to X/123/X as well. Then you've succesfully deduplicated
> these URL's.
> Of course, provided that the crawldb fixes duplicates to a single URL which
> i
> believe it does. I'm not sure though.
> >
> > What do you think?
> >
> > Dincer
> >
> >
> > 2011/8/18 Markus Jelsma <[email protected]>
> >
> > > Mmm yes..
> > >
> > > What will actually happen if we use a regex normalizer to produce a
> > > common form by setting a static value X for the first, second and
> fourth
> > > URI- segments?
> > >
> > > This would produce http://example.com/X/X/123/X for both URL's. Now we
> > > have a
> > > duplicate URL, will the reducer deduplicate and write ot one URL?
> > >
> > > If that's the case you can normalize throughout the whole crawl cycle
> and
> > > also
> > > add new items to the crawldb.
> > >
> > > Cheers
> > >
> > > > Can't you normalise these URLs into a common form? The normalisation
> > > > will be done as part of the injection and in the subsequent steps so
> > > > you will have only one URL to fetch
> > > >
> > > > On 18 August 2011 14:53, Dinçer Kavraal <[email protected]> wrote:
> > > > > Hi Markus,
> > > > >
> > > > > Thanks but I need to prevent download, because I have more CPU
> > >
> > > resources
> > >
> > > > > than bandwidth :) Therefore, it is more important to deal with the
> > >
> > > beast
> > >
> > > > > before born.
> > > > >
> > > > > Dincer
> > > > >
> > > > >
> > > > > 2011/8/18 Markus Jelsma <[email protected]>
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > At the moment you cannot do this out-of-the-box. It's a very,
> very,
> > > > > > nasty problem that needs a lot of thinking if you want to prevent
> > > > > > downloading such
> > > > > > URL's.
> > > > > > What you can do is just download them and mark them as duplicate
> by
> > > > >
> > > > > either
> > > > >
> > > > > > using the simple hashing algorithm or a more advanced text
> profile
> > > > > > signature.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > On Thursday 18 August 2011 15:35:26 Dinçer Kavraal wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have two URLs such as:
> > > > > > > http://example.com/pageBla/John/*123*/blabla
> > > > > > > http://example.com/pageBla/Doe/*123*/albalb
> > > > > > > The thing is these two URLs are same because of the id part of
> > > > > > > the URL (which is *123* in this sample). How could I manage to
> > > > > > > prevent download same thing twice because of that?
> > > > > > >
> > > > > > > I think I can customize injection classes but how could I check
> > > > > > > if
> > > > > >
> > > > > > another
> > > > > >
> > > > > > > form of the URL is already fetched?
> > > > > > >
> > > > > > > Any ideas? Thanks
> > > > > >
> > > > > > --
> > > > > > Markus Jelsma - CTO - Openindex
> > > > > > http://www.linkedin.com/in/markus17
> > > > > > 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to