And...
Perl constructs not supported by this class:
The conditional constructs (?{X}) and (?(condition)X|Y),
The embedded code constructs (?{code}) and (??{code}),
The embedded comment syntax (?#comment), and
The preprocessing operations \l \u, \L, and \U.
You should make a custom URL Normalizer to get this to work.
But why? It doesn't seem alright.
On Tue, 08 May 2012 14:46:14 +0200, Markus Jelsma
<[email protected]> wrote:
I'm not sure this is going to work as a lowercase flag is used on the
regular expressions.
On Tue, 08 May 2012 13:37:47 +0100, Dean Pullen
<[email protected]> wrote:
Hi all,
I'm trying to lower case all URLs via Nutch's regex-normalize.xml
The regex looks like:
<regex>
<pattern>(.*)</pattern>
<substitution>\L$1\E</substitution>
</regex>
This appears to be correct, yet we're seeing this when we dump the
DB:
"Lhttp://some.page.org/?page=2633&pid=1042ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
"Lhttp://some.page.org/?page=2633&pid=1043ELE&site=191";1;"db_unfetched";Tue
May 01 17:37:56 BST 2012;Thu Jan 01 01:00:00 GMT
1970;0;2592000.0;30.0;500.0;"null"
Notice the URL starts with an L? (Thus not matching http/https in
another config). Is this some problem with the regex above?
Regards,
Dean Pullen
--
Markus Jelsma - CTO - Openindex