Greetings,
I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace
to copy a substring of the 'url' field to a new 'site' field. Here is
the definition in my nutch-site.xml:
<property>
<name>index.replace.regexp</name>
<value>
urlmatch=.*www.mydomain.ca.*
url:site=/.*www.mydomain.ca.*/www/
urlmatch=.*foo.mydomain.ca.*
url:site=/.*foo.mydomain.ca.*/foo/
urlmatch=.*bar.mydomain.ca.*
url:site=/.*bar.mydomain.ca.*/bar/
</value>
</property>
This works as expected. I am given the following site values for the
given url values:
url: https://www.mydomain.ca/test/path -> site: www
url: http://foo.mydomain.ca/some/other/path -> site: foo
url: https://bar.mydomain.ca/another/example -> site: foo
However, it means I have to have a definition for every host or
subdomain I am crawling (ie. www, foo, bar). Can I use variable
substitution in index-replace or is there another way for me to do this
automatically?
regards,
Ryan