Enable selenium Plugin

2018-12-03 Thread Venkata MR
Hi,

As part of crawling javascript dynamic content, I was trying to enable selenium 
plugin to apache nutch 1.15.
I was referring to git link 
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium to 
enable selenium plugin.

I was getting below error, couldn't able to move further. Any pointers.

[ivy:resolve] commons-lang#commons-lang;2.6 in default: excluding 
commons-lang#commons-lang;2.6!commons-lang.jar
[ivy:resolve] com.google.protobuf#protobuf-java;2.5.0 in default: excluding 
com.google.protobuf#protobuf-java;2.5.0!protobuf-java.jar(bundle)
[ivy:resolve] org.apache.httpcomponents#httpcore;4.4.7 in default: excluding 
org.apache.httpcomponents#httpcore;4.4.7!httpcore.jar
[ivy:resolve] ERROR: impossible to get artifacts when data has not been loaded. 
IvyNode = javax.measure#unit-api;1.0
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve]   module not found: 
javax.measure#unit-api;working@LP-5CD7311YGR
[ivy:resolve]    local: tried
[ivy:resolve] 
C:\Users\venkata.mr\.ivy2/local/javax.measure/unit-api/working@LP-5CD7311YGR/ivys/ivy.xml
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LP-5CD7311YGR!unit-api.jar:
[ivy:resolve] 
C:\Users\venkata.mr\.ivy2/local/javax.measure/unit-api/working@LP-5CD7311YGR/jars/unit-api.jar
[ivy:resolve]    maven2: tried
[ivy:resolve] 
http://repo1.maven.org/maven2/javax/measure/unit-api/working@LP-5CD7311YGR/unit-api-work...@lp-5cd7311ygr.pom
[ivy:resolve] -- artifact 
javax.measure#unit-api;working@LP-5CD7311YGR!unit-api.jar:

Thanks & Regards
Venkata MR
+91 98455 77125

::DISCLAIMER::
--
The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.
--


Re: [ask] Crawl Forum Site

2018-12-03 Thread lewis john mcgibbney
Hi Tukang,
In short yes. It would help if you could provide an example of what you've
tried and what you encountered/what your results were.
Lewis

On Mon, Dec 3, 2018 at 6:42 PM  wrote:

>
> From: tkg_cangkul 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 04 Dec 2018 09:40:47 +0700
> Subject: [ask] Crawl Forum Site
> Hi,
>
> Is there possible to crawling Web Forum with Apache Nutch?
> If possible, is there any configuration that i must add?
> I've try it but i've nothing.
>
> Pls help . Need advice.
>
> Thanks
>
> Best Regards,
> Tukang Cangkul
>
>

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


RE: URL filter rejecting the URLs

2018-12-03 Thread Venkata MR
Hi Sebastian,

Thanks for the response, I resolved the issue and the reason is below 
configuration in regex-urlfilter.txt

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

Thanks & Regards
Venkata MR
+91 98455 77125

-Original Message-
From: Sebastian Nagel  
Sent: 04 December 2018 01:16
To: user@nutch.apache.org
Subject: Re: URL filter rejecting the URLs

Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is 
used
- (optionally) you may simplify the regex: the characters /_= have no special 
semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gaine
+rs_losers\.htm\?cat=([GL])
-.
% echo 
"https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782785388sdata=4cE6hBJDBE7EYxF4FT25BfosjMlCxsYQ3XRflDZqYiI%3Dreserved=0)"
 \
   | nutch filterchecker -filterName urlfilter-regex -stdin Checking 
combination of these URLFilters: RegexURLFilter
+https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnsei
+ndia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_lose
+rs.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076
+452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63679
+4631782795402sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3
+Dreserved=0)


And with another "forbidden" URL:
% echo 
"https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DXdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3Dreserved=0)"
 \
  | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination 
of these URLFilters: RegexURLFilter
-https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DXdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3Dreserved=0)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site 
> (https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3Dreserved=0,
>  
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DLdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=AqS%2B%2B6dAQ5Dwd36%2BIoPgZRfG8yxzVo3FvNrX3ZjtQLg%3Dreserved=0),
>  with the filter patter as 
> "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])",
>  it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --
> --
> --
> 
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, 

[ask] Crawl Forum Site

2018-12-03 Thread tkg_cangkul

Hi,

Is there possible to crawling Web Forum with Apache Nutch?
If possible, is there any configuration that i must add?
I've try it but i've nothing.

Pls help . Need advice.

Thanks

Best Regards,
Tukang Cangkul


Re: URL filter rejecting the URLs

2018-12-03 Thread Sebastian Nagel
Hi,

the pattern should work. Of course, you need to make sure that
- there are no other patterns coming before in regex-urlfilter.txt
  which cause the URL to be rejected
- other URL filters being active which reject the URL
- make sure that the folder of the regex-urlfilter.txt you're editing
  is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is 
used
- (optionally) you may simplify the regex: the characters /_= have no special 
semantic
  and do not need to be escaped by \

The easiest way to test it (Nutch 1.15):
% cat $NUTCH_HOME/conf/regex-urlfilter.txt
+^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gainers_losers\.htm\?cat=([GL])
-.
% echo 
"https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)"
 \
   | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
+https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)


And with another "forbidden" URL:
% echo 
"https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)"
 \
  | nutch filterchecker -filterName urlfilter-regex -stdin
Checking combination of these URLFilters: RegexURLFilter
-https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)


Best,
Sebastian

On 12/1/18 2:45 PM, Venkata MR wrote:
> Hi Nutch Users,
> 
> I was trying to crawl the site 
> (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G,
>  
> https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L),
>  with the filter patter as 
> "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])",
>  it is rejecting the urls.
> 
> Tried multiple options but all the cases it is rejecting.
> 
> Any help here is appreciated, Thanks!
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> ::DISCLAIMER::
> --
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
> --
>