If you look at my first email in this thread, it says filtering: false and 
normalising: false. Even then, it didn’t generate anything.

Here’s my regex-urlfilter.txt file:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jp$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.

And here’s my regex-normalize.xml:

<?xml version="1.0"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
     This is intended so that users can specify substitutions to be
     done on URLs. The regex engine that is used is Perl5 compatible.
     The rules are applied to URLs in the order they occur in this file.  -->

<!-- WATCH OUT: an xml parser reads this file an ampersands must be
     expanded to &amp; -->

<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
<regex>
  
<pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
  <substitution>$4</substitution>
</regex>

<!-- changes default pages into standard for /index.html, etc. into /
<regex>
  
<pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
  <substitution>/$3</substitution>
</regex> -->

<!-- removes interpage href anchors such as site.com#location -->
<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <substitution>$1</substitution>
</regex>
<!-- The following rules show how to strip out session IDs, default pages, 
     interpage anchors, etc. Order does matter!  -->
<regex-normalize>

<!-- cleans ?&amp;var=value into ?var=value -->
<regex>
  <pattern>\?&amp;</pattern>
  <substitution>\?</substitution>
</regex>

<!-- cleans multiple sequential ampersands into a single ampersand -->
<regex>
  <pattern>&amp;{2,}</pattern>
  <substitution>&amp;</substitution>
</regex>

<!-- removes trailing ? -->
<regex>
  <pattern>[\?&amp;\.]$</pattern>
  <substitution></substitution>
</regex>

<!-- removes duplicate slashes -->
<regex>
  <pattern>(?&lt;!:)/{2,}</pattern>
  <substitution>/</substitution>
</regex>
</regex-normalize>
-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

On 6 June 2014 at 1:54:02 am, Lewis John Mcgibbney ([email protected]) 
wrote:

I suspect that your generator normalization/filtering prevents this URL from 
getting through

On Thu, Jun 5, 2014 at 1:09 PM, Manikandan Saravanan 
<[email protected]> wrote:

14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: filtering: true
14/06/05 15:59:06 INFO crawl.GeneratorJob: GeneratorJob: normalizing: true

Map input records in Generator phase is 0... this is incorrect.

Lewis

Reply via email to