Hi Yossi, Thank you. I finally got it to work using this configuration:
<property> <name>index.replace.regexp</name> <value> url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/ </value> </property> cheers, Ryan On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote: > Hi Ryan, > > > > From looking at the code of index-replace, it uses Java's > Matcher.replaceAll < > https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String- > > , so $1 (for example) should work. > > > > Yossi. > > > > > -----Original Message----- > > From: Ryan Suarez <ryan.sua...@sheridancollege.ca> > > Sent: 13 October 2018 01:38 > > To: user@nutch.apache.org > > Subject: index-replace: variable substitution? > > > > Greetings, > > > > I'm using binaries of nutch v1.15 with solr v7.3.1, and index- > > replace to copy a > > substring of the 'url' field to a new 'site' field. Here is the > > definition in my nutch- > > site.xml: > > > > <property> > > <name>index.replace.regexp</name> > > <value> > > urlmatch=.*www.mydomain.ca.* > > <url:site=/.*www.mydomain.ca.*/www/>; url:site=/.* > > www.mydomain.ca.*/www/ > > > > urlmatch=.*foo.mydomain.ca.* > > <url:site=/.*foo.mydomain.ca.*/foo/> > > url:site=/.*foo.mydomain.ca.*/foo/ > > > > urlmatch=.*bar.mydomain.ca.* > > <url:site=/.*bar.mydomain.ca.*/bar/> > > url:site=/.*bar.mydomain.ca.*/bar/ > > </value> > > </property> > > > > This works as expected. I am given the following site values for > > the given url > > values: > > > > url: <https://www.mydomain.ca/test/path> > > https://www.mydomain.ca/test/path -> site: www > > url: <http://foo.mydomain.ca/some/other/path> > > http://foo.mydomain.ca/some/other/path -> site: foo > > url: <https://bar.mydomain.ca/another/example> > > https://bar.mydomain.ca/another/example -> site: foo > > > > However, it means I have to have a definition for every host or > > subdomain I am > > crawling (ie. www, foo, bar). Can I use variable substitution in > > index-replace or > > is there another way for me to do this automatically? > > > > regards, > > Ryan > >