I'm not sure this is going to work as a lowercase flag is used on the regular expressions.

On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen <[email protected]> wrote:
Hi all,


I'm trying to lower case all URLs via Nutch's regex-normalize.xml

The regex looks like:

<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>

This appears to be correct, yet we're seeing this when we dump the DB:


"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"

"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"


Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?

Regards,

Dean Pullen

--
Markus Jelsma - CTO - Openindex

Reply via email to