Re: Regex to block some patterns

2018-10-05 Thread Amarnatha Reddy
Hi Sebastian,

Thanks for the update, here is my regex pattern to block my use case after
long spent time.

*-.*(modal[-_a-zA-Z0-9]*[\.]html|exit.html[\/]?\??.*|model[-_a-zA-Z0-9]*[\.]html|exitpage.*|exitPage.*)*

There was some other pattern which caused whole block, I rectified it.


Thanks,
Amarnath Polu


On Fri 5 Oct, 2018, 1:02 PM Sebastian Nagel,
 wrote:

> Hi Amarnath,
>
> the only possibility is that https://www.abc.com/ is skipped
> - by another rule in regex-urlfilter.txt
> - or another URL filter plugin
>
> Please check your configuration carefully. You may also use the tool
>   bin/nutch filterchecker
> to test the filters beforehand: every active filter individually
> and all in combination.
>
> Best,
> Sebastian
>
> On 10/04/2018 06:52 AM, Amarnatha Reddy wrote:
> > Hi Markus,
> >
> > Thanks a lot for the quick update, but i applied the same rule and it's
> > completely rejected and no more urls to inject.
> >
> > I have applied the same regex: -^.+(?:modal|exit).*\.html
> > seed.txt: https://www.abc.com/
> > Seems regex is fine, but it's not working with Nutch1.15 regex
> block...any
> > thoughts please?
> >
> > Here is sample output:
> > [Nutch]$ bin/crawl -i -D abccollection -s urls/ crawl/ -1
> > Injecting seed URLs
> > /test/Nutch/TEST/test2_Nutch/bin/nutch inject crawl//crawldb urls/
> > Injector: starting at 2018-10-04 04:43:14
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injecting seed URL file file:/test/Nutch/TEST/test2_Nutch/urls/seed.txt
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: Total urls rejected by filters: 1
> > Injector: Total urls injected after normalization and filtering: 0
> > Injector: Total urls injected but already in CrawlDb: 0
> > Injector: Total new urls injected: 0
> > Injector: Total urls with status gone removed from CrawlDb
> > (db.update.purge.404): 0
> > Injector: finished at 2018-10-04 04:43:16, elapsed: 00:00:02
> > Thu Oct 4 04:43:16 UTC 2018 : Iteration 1
> > Generating a new segment
> > /test/Nutch/TEST/test2_Nutch/bin/nutch generate -D
> mapreduce.job.reduces=2
> > -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
> > -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
> > crawl//crawldb crawl//segments -topN 5 -numFetchers 1 -noFilter
> > Generator: starting at 2018-10-04 04:43:17
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 5
> > Generator: 0 records selected for fetching, exiting ...
> > Generate returned 1 (no new segments created)
> > Escaping loop: no more URLs to fetch now
> >
> > Thanks,
> > Amarnath Polu
> >
> > On Thu, Oct 4, 2018 at 12:53 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hi Amarnatha,
> >>
> >> -^.+(?:modal|exit).*\.html
> >>
> >> Will work for all exampes given.
> >>
> >> You can test regexes really well online [1]. If each input has true for
> >> lookingAt, Nutch' regexfilter will filter the URL's.
> >>
> >> Regards,
> >> Markus
> >>
> >> [1] https://www.regexplanet.com/advanced/java/index.html
> >>
> >>
> >> -Original message-
> >>> From:Amarnatha Reddy 
> >>> Sent: Wednesday 3rd October 2018 15:23
> >>> To: user@nutch.apache.org
> >>> Subject: Regex to block some patterns
> >>>
> >>> Hi Team,
> >>>
> >>>
> >>>
> >>> I need some assistance to block patterns in my current setup.
> >>>
> >>>
> >>>
> >>> Always my seed url is *https://www.abc.com/ *
> and
> >>> need to crawl all pages except below patterns in Nutch1.15
> >>>
> >>>
> >>> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> >>>
> >>> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these
> >> could
> >>> be end of the domain)
> >>>
> >>>
> >>>
> >>> Below are the few use case urls'
> >>>
> >>>
> >>>
> >>
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> >>>
> >>>
> >>>
> >>> exit.html (here anything like this exit.html? exit.html/?)
> >>>
> >>>
> >>> Ask here is after domain (https://www.abc.com/), starts with
> >>> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> >>>
> >>>
> https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> >>>
> >>>
> >>
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> >>>
> >>>
> >>> *Note: Yes we can directly put - ^(complete url) ,but dont know how
> many
> >>> are there, so need generic regex rule to apply.*
> >>>
> >>>
> >>> i tried below pattern,but it is not working
> >>>
> >>> ## Blocking pattern ends with 
> >>>
> >>> -^(?i)\*(modal*|exit*).html
> >>>
> >>>
> >>>
> 

Re: Regex to block some patterns

2018-10-05 Thread govind nitk
Also, check last regex line.

*# accept anything else*
*+.*

By mistake if you have made it negative( -.), everything will be discarded.

Best,
Govind

On Fri, Oct 5, 2018 at 1:02 PM Sebastian Nagel
 wrote:

> Hi Amarnath,
>
> the only possibility is that https://www.abc.com/ is skipped
> - by another rule in regex-urlfilter.txt
> - or another URL filter plugin
>
> Please check your configuration carefully. You may also use the tool
>   bin/nutch filterchecker
> to test the filters beforehand: every active filter individually
> and all in combination.
>
> Best,
> Sebastian
>
> On 10/04/2018 06:52 AM, Amarnatha Reddy wrote:
> > Hi Markus,
> >
> > Thanks a lot for the quick update, but i applied the same rule and it's
> > completely rejected and no more urls to inject.
> >
> > I have applied the same regex: -^.+(?:modal|exit).*\.html
> > seed.txt: https://www.abc.com/
> > Seems regex is fine, but it's not working with Nutch1.15 regex
> block...any
> > thoughts please?
> >
> > Here is sample output:
> > [Nutch]$ bin/crawl -i -D abccollection -s urls/ crawl/ -1
> > Injecting seed URLs
> > /test/Nutch/TEST/test2_Nutch/bin/nutch inject crawl//crawldb urls/
> > Injector: starting at 2018-10-04 04:43:14
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injecting seed URL file file:/test/Nutch/TEST/test2_Nutch/urls/seed.txt
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: Total urls rejected by filters: 1
> > Injector: Total urls injected after normalization and filtering: 0
> > Injector: Total urls injected but already in CrawlDb: 0
> > Injector: Total new urls injected: 0
> > Injector: Total urls with status gone removed from CrawlDb
> > (db.update.purge.404): 0
> > Injector: finished at 2018-10-04 04:43:16, elapsed: 00:00:02
> > Thu Oct 4 04:43:16 UTC 2018 : Iteration 1
> > Generating a new segment
> > /test/Nutch/TEST/test2_Nutch/bin/nutch generate -D
> mapreduce.job.reduces=2
> > -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
> > -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
> > crawl//crawldb crawl//segments -topN 5 -numFetchers 1 -noFilter
> > Generator: starting at 2018-10-04 04:43:17
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 5
> > Generator: 0 records selected for fetching, exiting ...
> > Generate returned 1 (no new segments created)
> > Escaping loop: no more URLs to fetch now
> >
> > Thanks,
> > Amarnath Polu
> >
> > On Thu, Oct 4, 2018 at 12:53 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hi Amarnatha,
> >>
> >> -^.+(?:modal|exit).*\.html
> >>
> >> Will work for all exampes given.
> >>
> >> You can test regexes really well online [1]. If each input has true for
> >> lookingAt, Nutch' regexfilter will filter the URL's.
> >>
> >> Regards,
> >> Markus
> >>
> >> [1] https://www.regexplanet.com/advanced/java/index.html
> >>
> >>
> >> -Original message-
> >>> From:Amarnatha Reddy 
> >>> Sent: Wednesday 3rd October 2018 15:23
> >>> To: user@nutch.apache.org
> >>> Subject: Regex to block some patterns
> >>>
> >>> Hi Team,
> >>>
> >>>
> >>>
> >>> I need some assistance to block patterns in my current setup.
> >>>
> >>>
> >>>
> >>> Always my seed url is *https://www.abc.com/ *
> and
> >>> need to crawl all pages except below patterns in Nutch1.15
> >>>
> >>>
> >>> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> >>>
> >>> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these
> >> could
> >>> be end of the domain)
> >>>
> >>>
> >>>
> >>> Below are the few use case urls'
> >>>
> >>>
> >>>
> >>
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> >>>
> >>>
> >>>
> >>> exit.html (here anything like this exit.html? exit.html/?)
> >>>
> >>>
> >>> Ask here is after domain (https://www.abc.com/), starts with
> >>> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> >>>
> >>>
> https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> >>>
> >>>
> >>
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> >>>
> >>>
> >>> *Note: Yes we can directly put - ^(complete url) ,but dont know how
> many
> >>> are there, so need generic regex rule to apply.*
> >>>
> >>>
> >>> i tried below pattern,but it is not working
> >>>
> >>> ## Blocking pattern ends with 
> >>>
> >>> -^(?i)\*(modal*|exit*).html
> >>>
> >>>
> >>>
> >>> Kindly help me to setup regex to block my use case.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Amarnath
> >>>
> >>>
> >>>
> >>>
> >>> 

Re: Regex to block some patterns

2018-10-05 Thread Sebastian Nagel
Hi Amarnath,

the only possibility is that https://www.abc.com/ is skipped
- by another rule in regex-urlfilter.txt
- or another URL filter plugin

Please check your configuration carefully. You may also use the tool
  bin/nutch filterchecker
to test the filters beforehand: every active filter individually
and all in combination.

Best,
Sebastian

On 10/04/2018 06:52 AM, Amarnatha Reddy wrote:
> Hi Markus,
> 
> Thanks a lot for the quick update, but i applied the same rule and it's
> completely rejected and no more urls to inject.
> 
> I have applied the same regex: -^.+(?:modal|exit).*\.html
> seed.txt: https://www.abc.com/
> Seems regex is fine, but it's not working with Nutch1.15 regex block...any
> thoughts please?
> 
> Here is sample output:
> [Nutch]$ bin/crawl -i -D abccollection -s urls/ crawl/ -1
> Injecting seed URLs
> /test/Nutch/TEST/test2_Nutch/bin/nutch inject crawl//crawldb urls/
> Injector: starting at 2018-10-04 04:43:14
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injecting seed URL file file:/test/Nutch/TEST/test2_Nutch/urls/seed.txt
> Injector: overwrite: false
> Injector: update: false
> Injector: Total urls rejected by filters: 1
> Injector: Total urls injected after normalization and filtering: 0
> Injector: Total urls injected but already in CrawlDb: 0
> Injector: Total new urls injected: 0
> Injector: Total urls with status gone removed from CrawlDb
> (db.update.purge.404): 0
> Injector: finished at 2018-10-04 04:43:16, elapsed: 00:00:02
> Thu Oct 4 04:43:16 UTC 2018 : Iteration 1
> Generating a new segment
> /test/Nutch/TEST/test2_Nutch/bin/nutch generate -D mapreduce.job.reduces=2
> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
> -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
> crawl//crawldb crawl//segments -topN 5 -numFetchers 1 -noFilter
> Generator: starting at 2018-10-04 04:43:17
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: false
> Generator: normalizing: true
> Generator: topN: 5
> Generator: 0 records selected for fetching, exiting ...
> Generate returned 1 (no new segments created)
> Escaping loop: no more URLs to fetch now
> 
> Thanks,
> Amarnath Polu
> 
> On Thu, Oct 4, 2018 at 12:53 AM Markus Jelsma 
> wrote:
> 
>> Hi Amarnatha,
>>
>> -^.+(?:modal|exit).*\.html
>>
>> Will work for all exampes given.
>>
>> You can test regexes really well online [1]. If each input has true for
>> lookingAt, Nutch' regexfilter will filter the URL's.
>>
>> Regards,
>> Markus
>>
>> [1] https://www.regexplanet.com/advanced/java/index.html
>>
>>
>> -Original message-
>>> From:Amarnatha Reddy 
>>> Sent: Wednesday 3rd October 2018 15:23
>>> To: user@nutch.apache.org
>>> Subject: Regex to block some patterns
>>>
>>> Hi Team,
>>>
>>>
>>>
>>> I need some assistance to block patterns in my current setup.
>>>
>>>
>>>
>>> Always my seed url is *https://www.abc.com/ * and
>>> need to crawl all pages except below patterns in Nutch1.15
>>>
>>>
>>> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
>>>
>>> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these
>> could
>>> be end of the domain)
>>>
>>>
>>>
>>> Below are the few use case urls'
>>>
>>>
>>>
>> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
>>>
>>>
>> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
>>>
>>>
>> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
>>>
>>>
>>>
>>> exit.html (here anything like this exit.html? exit.html/?)
>>>
>>>
>>> Ask here is after domain (https://www.abc.com/), starts with
>>> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
>>>
>>>  https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
>>>
>>>
>> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
>>>
>>>
>>> *Note: Yes we can directly put - ^(complete url) ,but dont know how many
>>> are there, so need generic regex rule to apply.*
>>>
>>>
>>> i tried below pattern,but it is not working
>>>
>>> ## Blocking pattern ends with 
>>>
>>> -^(?i)\*(modal*|exit*).html
>>>
>>>
>>>
>>> Kindly help me to setup regex to block my use case.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Amarnath
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Thanks and Regards,
>>>
>>> *Amarnath Polu*
>>>
>>
> 
> 



Re: Regex to block some patterns

2018-10-03 Thread Amarnatha Reddy
Hi Markus,

Thanks a lot for the quick update, but i applied the same rule and it's
completely rejected and no more urls to inject.

I have applied the same regex: -^.+(?:modal|exit).*\.html
seed.txt: https://www.abc.com/
Seems regex is fine, but it's not working with Nutch1.15 regex block...any
thoughts please?

Here is sample output:
[Nutch]$ bin/crawl -i -D abccollection -s urls/ crawl/ -1
Injecting seed URLs
/test/Nutch/TEST/test2_Nutch/bin/nutch inject crawl//crawldb urls/
Injector: starting at 2018-10-04 04:43:14
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injecting seed URL file file:/test/Nutch/TEST/test2_Nutch/urls/seed.txt
Injector: overwrite: false
Injector: update: false
Injector: Total urls rejected by filters: 1
Injector: Total urls injected after normalization and filtering: 0
Injector: Total urls injected but already in CrawlDb: 0
Injector: Total new urls injected: 0
Injector: Total urls with status gone removed from CrawlDb
(db.update.purge.404): 0
Injector: finished at 2018-10-04 04:43:16, elapsed: 00:00:02
Thu Oct 4 04:43:16 UTC 2018 : Iteration 1
Generating a new segment
/test/Nutch/TEST/test2_Nutch/bin/nutch generate -D mapreduce.job.reduces=2
-D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
-D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
crawl//crawldb crawl//segments -topN 5 -numFetchers 1 -noFilter
Generator: starting at 2018-10-04 04:43:17
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 5
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now

Thanks,
Amarnath Polu

On Thu, Oct 4, 2018 at 12:53 AM Markus Jelsma 
wrote:

> Hi Amarnatha,
>
> -^.+(?:modal|exit).*\.html
>
> Will work for all exampes given.
>
> You can test regexes really well online [1]. If each input has true for
> lookingAt, Nutch' regexfilter will filter the URL's.
>
> Regards,
> Markus
>
> [1] https://www.regexplanet.com/advanced/java/index.html
>
>
> -Original message-
> > From:Amarnatha Reddy 
> > Sent: Wednesday 3rd October 2018 15:23
> > To: user@nutch.apache.org
> > Subject: Regex to block some patterns
> >
> > Hi Team,
> >
> >
> >
> > I need some assistance to block patterns in my current setup.
> >
> >
> >
> > Always my seed url is *https://www.abc.com/ * and
> > need to crawl all pages except below patterns in Nutch1.15
> >
> >
> > Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> >
> > Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these
> could
> > be end of the domain)
> >
> >
> >
> > Below are the few use case urls'
> >
> >
> >
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> >
> >
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> >
> >
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> >
> >
> >
> > exit.html (here anything like this exit.html? exit.html/?)
> >
> >
> > Ask here is after domain (https://www.abc.com/), starts with
> > exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> >
> >  https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> >
> >
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> >
> >
> > *Note: Yes we can directly put - ^(complete url) ,but dont know how many
> > are there, so need generic regex rule to apply.*
> >
> >
> > i tried below pattern,but it is not working
> >
> > ## Blocking pattern ends with 
> >
> > -^(?i)\*(modal*|exit*).html
> >
> >
> >
> > Kindly help me to setup regex to block my use case.
> >
> >
> >
> > Thanks,
> >
> > Amarnath
> >
> >
> >
> >
> > --
> >
> > Thanks and Regards,
> >
> > *Amarnath Polu*
> >
>


-- 




--

Thanks and Regards,

*Amarnath Polu*


RE: Regex to block some patterns

2018-10-03 Thread Markus Jelsma
Hi Amarnatha,

-^.+(?:modal|exit).*\.html

Will work for all exampes given.

You can test regexes really well online [1]. If each input has true for 
lookingAt, Nutch' regexfilter will filter the URL's.

Regards,
Markus

[1] https://www.regexplanet.com/advanced/java/index.html
 
 
-Original message-
> From:Amarnatha Reddy 
> Sent: Wednesday 3rd October 2018 15:23
> To: user@nutch.apache.org
> Subject: Regex to block some patterns
> 
> Hi Team,
> 
> 
> 
> I need some assistance to block patterns in my current setup.
> 
> 
> 
> Always my seed url is *https://www.abc.com/ * and
> need to crawl all pages except below patterns in Nutch1.15
> 
> 
> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> 
> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these could
> be end of the domain)
> 
> 
> 
> Below are the few use case urls'
> 
> 
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> 
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> 
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> 
> 
> 
> exit.html (here anything like this exit.html? exit.html/?)
> 
> 
> Ask here is after domain (https://www.abc.com/), starts with
> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> 
>  https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> 
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> 
> 
> *Note: Yes we can directly put - ^(complete url) ,but dont know how many
> are there, so need generic regex rule to apply.*
> 
> 
> i tried below pattern,but it is not working
> 
> ## Blocking pattern ends with 
> 
> -^(?i)\*(modal*|exit*).html
> 
> 
> 
> Kindly help me to setup regex to block my use case.
> 
> 
> 
> Thanks,
> 
> Amarnath
> 
> 
> 
> 
> --
> 
> Thanks and Regards,
> 
> *Amarnath Polu*
>