And...

Perl constructs not supported by this class:
    The conditional constructs (?{X}) and (?(condition)X|Y),
    The embedded code constructs (?{code}) and (??{code}),
    The embedded comment syntax (?#comment), and
    The preprocessing operations \l \u, \L, and \U.

You should make a custom URL Normalizer to get this to work.

But why? It doesn't seem alright.

On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma <[email protected]> wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.

On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
<[email protected]> wrote:
Hi all,


I'm trying to lower case all URLs via Nutch's regex-normalize.xml

The regex looks like:

<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>

This appears to be correct, yet we're seeing this when we dump the DB:


"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"

"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"


Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?

Regards,

Dean Pullen

--
Markus Jelsma - CTO - Openindex

Reply via email to