Hi Yossi,

Thank you.  I finally got it to work using this configuration:

<property>
    <name>index.replace.regexp</name>
    <value>
       url:site=/https?:..([a-zA-Z0-9]+).mydomain.ca.*/$1/
    </value>
</property>

cheers,
Ryan

On Sat, 2018-10-13 at 03:13 +0300, Yossi Tamari wrote:
> Hi Ryan,
> 
>  
> 
> From looking at the code of index-replace, it uses Java's
> Matcher.replaceAll <
> https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#replaceAll-java.lang.String-
> > , so $1 (for example) should work.
> 
>  
> 
> Yossi. 
> 
>  
> 
> > -----Original Message-----
> > From: Ryan Suarez <ryan.sua...@sheridancollege.ca>
> > Sent: 13 October 2018 01:38
> > To: user@nutch.apache.org
> > Subject: index-replace: variable substitution?
> > 
> > Greetings,
> > 
> > I'm using binaries of nutch v1.15 with solr v7.3.1, and index-
> > replace to copy a
> > substring of the 'url' field to a new 'site' field.  Here is the
> > definition in my nutch-
> > site.xml:
> > 
> > <property>
> >     <name>index.replace.regexp</name>
> >     <value>
> >        urlmatch=.*www.mydomain.ca.*
> >         <url:site=/.*www.mydomain.ca.*/www/>; url:site=/.*
> > www.mydomain.ca.*/www/
> > 
> >        urlmatch=.*foo.mydomain.ca.*
> >         <url:site=/.*foo.mydomain.ca.*/foo/>
> > url:site=/.*foo.mydomain.ca.*/foo/
> > 
> >        urlmatch=.*bar.mydomain.ca.*
> >         <url:site=/.*bar.mydomain.ca.*/bar/>
> > url:site=/.*bar.mydomain.ca.*/bar/
> >     </value>
> > </property>
> > 
> > This works as expected.  I am given the following site values for
> > the given url
> > values:
> > 
> > url:  <https://www.mydomain.ca/test/path> 
> > https://www.mydomain.ca/test/path -> site: www
> > url:  <http://foo.mydomain.ca/some/other/path> 
> > http://foo.mydomain.ca/some/other/path -> site: foo
> > url:  <https://bar.mydomain.ca/another/example> 
> > https://bar.mydomain.ca/another/example -> site: foo
> > 
> > However, it means I have to have a definition for every host or
> > subdomain I am
> > crawling (ie. www, foo, bar).  Can I use variable substitution in
> > index-replace or
> > is there another way for me to do this automatically?
> > 
> > regards,
> > Ryan
> 
> 

Reply via email to