Hi Yossi,
Thank you. I finally got it to work using this configuration:
<property>
<name>index.replace.regexp</name>
<value>
url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/
</value>
</property>
cheers,
Ryan
On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote:
> Hi Ryan,
>
>
>
> From looking at the code of index-replace, it uses Java's
> Matcher.replaceAll <
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-
> > , so $1 (for example) should work.
>
>
>
> Yossi.
>
>
>
> > -----Original Message-----
> > From: Ryan Suarez <[email protected]>
> > Sent: 13 October 2018 01:38
> > To: [email protected]
> > Subject: index-replace: variable substitution?
> >
> > Greetings,
> >
> > I'm using binaries of nutch v1.15 with solr v7.3.1, and index-
> > replace to copy a
> > substring of the 'url' field to a new 'site' field. Here is the
> > definition in my nutch-
> > site.xml:
> >
> > <property>
> > <name>index.replace.regexp</name>
> > <value>
> > urlmatch=.*www.mydomain.ca.*
> > <url:site=/.*www.mydomain.ca.*/www/>; url:site=/.*
> > www.mydomain.ca.*/www/
> >
> > urlmatch=.*foo.mydomain.ca.*
> > <url:site=/.*foo.mydomain.ca.*/foo/>
> > url:site=/.*foo.mydomain.ca.*/foo/
> >
> > urlmatch=.*bar.mydomain.ca.*
> > <url:site=/.*bar.mydomain.ca.*/bar/>
> > url:site=/.*bar.mydomain.ca.*/bar/
> > </value>
> > </property>
> >
> > This works as expected. I am given the following site values for
> > the given url
> > values:
> >
> > url: <https://www.mydomain.ca/test/path>
> > https://www.mydomain.ca/test/path -> site: www
> > url: <http://foo.mydomain.ca/some/other/path>
> > http://foo.mydomain.ca/some/other/path -> site: foo
> > url: <https://bar.mydomain.ca/another/example>
> > https://bar.mydomain.ca/another/example -> site: foo
> >
> > However, it means I have to have a definition for every host or
> > subdomain I am
> > crawling (ie. www, foo, bar). Can I use variable substitution in
> > index-replace or
> > is there another way for me to do this automatically?
> >
> > regards,
> > Ryan
>
>