After doing this 3 less URLs have been rejected.

Thanks and Regards,
Shubham Gupta

On Monday 03 October 2016 10:28 AM, Sachin Shaju wrote:
You may check by commenting all regex filters in url-filter file and can
try +. to see whether it gives the same output.

Regards,
Sachin Shaju

[email protected]

On Mon, Oct 3, 2016 at 10:05 AM, shubham.gupta <[email protected]>
wrote:

Hey

When the inject job is run 90% of my seedurls get rejected. Therefore,
very few urls get crawled and does not give proper outputs.

my regex-urlfilter properties are as follows:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|
wmf|WMF|zip|ZIP|ppt|pdf|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
-^(http://up.anv.bz)
+.

# skip URLs longer than 512 characters
-^.{513,}$

--

Shubham Gupta



Reply via email to