Greetings,

I'm using binaries of nutch v1.15 with solr v7.3.1, and index-replace
to copy a substring of the 'url' field to a new 'site' field.  Here is
the definition in my nutch-site.xml:

<property>
    <name>index.replace.regexp</name>
    <value>
       urlmatch=.*www.mydomain.ca.*
       url:site=/.*www.mydomain.ca.*/www/

       urlmatch=.*foo.mydomain.ca.*
       url:site=/.*foo.mydomain.ca.*/foo/

       urlmatch=.*bar.mydomain.ca.*
       url:site=/.*bar.mydomain.ca.*/bar/
    </value>
</property>

This works as expected.  I am given the following site values for the
given url values:

url: https://www.mydomain.ca/test/path -> site: www
url: http://foo.mydomain.ca/some/other/path -> site: foo
url: https://bar.mydomain.ca/another/example -> site: foo

However, it means I have to have a definition for every host or
subdomain I am crawling (ie. www, foo, bar).  Can I use variable
substitution in index-replace or is there another way for me to do this
automatically?

regards,
Ryan

Reply via email to