Hi Sebastian,

Thanks for the response, I resolved the issue and the reason is below 
configuration in regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
Sent: 04 December 2018 01:16
To: user@nutch.apache.org
Subject: Re: URL filter rejecting the URLs

Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is 
used
- (optionally) you may simplify the regex: the characters /_= have no special 
semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gaine
+rs_losers\.htm\?cat=([GL])
-.
% echo 
"https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782785388&amp;sdata=4cE6hBJDBE7EYxF4FT25BfosjMlCxsYQ3XRflDZqYiI%3D&amp;reserved=0)"
 \
   | nutch filterchecker -filterName urlfilter-regex -stdin Checking 
combination of these URLFilters: RegexURLFilter
+https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnsei
+ndia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_lose
+rs.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076
+452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63679
+4631782795402&amp;sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3
+D&amp;reserved=0)


And with another "forbidden" URL:
% echo 
"https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DX&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3D&amp;reserved=0)"
 \
  | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination 
of these URLFilters: RegexURLFilter
-https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DX&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3D&amp;reserved=0)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site 
> (https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DG&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3D&amp;reserved=0,
>  
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DL&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402&amp;sdata=AqS%2B%2B6dAQ5Dwd36%2BIoPgZRfG8yxzVo3FvNrX3ZjtQLg%3D&amp;reserved=0),
>  with the filter patter as 
> "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])",
>  it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------------------------------
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------------------------------
> 

Reply via email to