Hey!

We added the attribute "toUrlOrig" to Outlink.java and now store the
original URL for later use. To set the attribute toUrlOrig we extended the
constructor as follows:

  public Outlink(String toUrl, String anchor, Configuration conf) throws
MalformedURLException {
    this.toUrlOrig = toUrl;   // we added this
    this.toUrl = new URLNormalizers(conf,
URLNormalizers.SCOPE_OUTLINK).normalize(toUrl,
URLNormalizers.SCOPE_OUTLINK);
    if (anchor == null) anchor = "";
    this.anchor = anchor;
  }

And yes, we also changed some behavior of ParseOutputFormat .

Thank you,

BB

On Fri, Sep 23, 2011 at 2:37 PM, Markus Jelsma
<[email protected]>wrote:

> Incoming URL's are first normalized and then filtered in ParseOutputFormat.
> Parse filters are executing in the parser impl which happens in the mapper.
>
> On Tuesday 20 September 2011 20:07:58 Alexander Fahlke wrote:
> > Hi!
> >
> > I am trying to write additional metadata to my CrawlDB. This data has to
> be
> > extracted from the URLs BEFORE they get normalized (via regex
> > urlnormalizer).
> >
> > Here are some sample URLs:
> >
> > 1.
> >
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> > &SERVERID=ZF@@002@@ZF 2.
> >
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> > &SERVERID=ZF@@101@@ZF 3.
> >
> https://www.example.com/customers/cards/index.php?n=%2Fcustomers%2Fcards%2F
> > &SERVERID=ZF@@024@@ZF
> >
> > You can see three slightly different URLs. The only difference is the
> > SERVERID. The n-param and SERVERID-param are getting normalized to
> prevent
> > duplicates.
> > I want to save the n-param before normalization as a metadata for later
> use
> > (create canonical URLs, the n-param controls the page menu).
> >
> >
> > There are some problems I have with this.
> >
> > First: At which point in the crawl are the URLs getting normalized? I was
> > not able to figure that out since there are so many places where urls get
> > normalized.
> > Second: How to add a custom Meta-Date correctly to my CrawlDB and WHEN to
> > do this?
> >     e.g. save Metadata: _pst_: success(1),
> > lastModified=0,XNPARAMETER=%2Fcustomers%2Fcards%2F
> >
> >
> > Any ideas where it would be able to extract BEFORE normalization and also
> > getting access to the specific CrawlDatum to add metadata?
> >
> > BR
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Alexander Fahlke
Software Development
www.informera.de

Reply via email to