Enable selenium Plugin
Hi, As part of crawling javascript dynamic content, I was trying to enable selenium plugin to apache nutch 1.15. I was referring to git link https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium to enable selenium plugin. I was getting below error, couldn't able to move further. Any pointers. [ivy:resolve] commons-lang#commons-lang;2.6 in default: excluding commons-lang#commons-lang;2.6!commons-lang.jar [ivy:resolve] com.google.protobuf#protobuf-java;2.5.0 in default: excluding com.google.protobuf#protobuf-java;2.5.0!protobuf-java.jar(bundle) [ivy:resolve] org.apache.httpcomponents#httpcore;4.4.7 in default: excluding org.apache.httpcomponents#httpcore;4.4.7!httpcore.jar [ivy:resolve] ERROR: impossible to get artifacts when data has not been loaded. IvyNode = javax.measure#unit-api;1.0 [ivy:resolve] [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: javax.measure#unit-api;working@LP-5CD7311YGR [ivy:resolve] local: tried [ivy:resolve] C:\Users\venkata.mr\.ivy2/local/javax.measure/unit-api/working@LP-5CD7311YGR/ivys/ivy.xml [ivy:resolve] -- artifact javax.measure#unit-api;working@LP-5CD7311YGR!unit-api.jar: [ivy:resolve] C:\Users\venkata.mr\.ivy2/local/javax.measure/unit-api/working@LP-5CD7311YGR/jars/unit-api.jar [ivy:resolve] maven2: tried [ivy:resolve] http://repo1.maven.org/maven2/javax/measure/unit-api/working@LP-5CD7311YGR/unit-api-work...@lp-5cd7311ygr.pom [ivy:resolve] -- artifact javax.measure#unit-api;working@LP-5CD7311YGR!unit-api.jar: Thanks & Regards Venkata MR +91 98455 77125 ::DISCLAIMER:: -- The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. --
Re: [ask] Crawl Forum Site
Hi Tukang, In short yes. It would help if you could provide an example of what you've tried and what you encountered/what your results were. Lewis On Mon, Dec 3, 2018 at 6:42 PM wrote: > > From: tkg_cangkul > To: user@nutch.apache.org > Cc: > Bcc: > Date: Tue, 04 Dec 2018 09:40:47 +0700 > Subject: [ask] Crawl Forum Site > Hi, > > Is there possible to crawling Web Forum with Apache Nutch? > If possible, is there any configuration that i must add? > I've try it but i've nothing. > > Pls help . Need advice. > > Thanks > > Best Regards, > Tukang Cangkul > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
RE: URL filter rejecting the URLs
Hi Sebastian, Thanks for the response, I resolved the issue and the reason is below configuration in regex-urlfilter.txt # skip URLs containing certain characters as probable queries, etc. -[?*!@=] Thanks & Regards Venkata MR +91 98455 77125 -Original Message- From: Sebastian Nagel Sent: 04 December 2018 01:16 To: user@nutch.apache.org Subject: Re: URL filter rejecting the URLs Hi, the pattern should work. Of course, you need to make sure that - there are no other patterns coming before in regex-urlfilter.txt which cause the URL to be rejected - other URL filters being active which reject the URL - make sure that the folder of the regex-urlfilter.txt you're editing is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is used - (optionally) you may simplify the regex: the characters /_= have no special semantic and do not need to be escaped by \ The easiest way to test it (Nutch 1.15): % cat $NUTCH_HOME/conf/regex-urlfilter.txt +^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gaine +rs_losers\.htm\?cat=([GL]) -. % echo "https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782785388sdata=4cE6hBJDBE7EYxF4FT25BfosjMlCxsYQ3XRflDZqYiI%3Dreserved=0)" \ | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter +https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnsei +ndia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_lose +rs.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076 +452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63679 +4631782795402sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3 +Dreserved=0) And with another "forbidden" URL: % echo "https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DXdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3Dreserved=0)" \ | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter -https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DXdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=7wYVN3j7OERUcQJPSxPo%2FzHaofHRerqC4GCUT2Lenng%3Dreserved=0) Best, Sebastian On 12/1/18 2:45 PM, Venkata MR wrote: > Hi Nutch Users, > > I was trying to crawl the site > (https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DGdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=XIq2GqpuT1ndZ2gtBPalj%2BaZhxvPm6HYTJmxnnaiT58%3Dreserved=0, > > https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnseindia.com%2Flive_market%2FdynaContent%2Flive_analysis%2Ftop_gainers_losers.htm%3Fcat%3DLdata=02%7C01%7CVenkata.MR%40hcl.com%7C969dc7d0c076452803b408d65957fd73%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636794631782795402sdata=AqS%2B%2B6dAQ5Dwd36%2BIoPgZRfG8yxzVo3FvNrX3ZjtQLg%3Dreserved=0), > with the filter patter as > "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])", > it is rejecting the urls. > > Tried multiple options but all the cases it is rejecting. > > Any help here is appreciated, Thanks! > > Thanks & Regards > Venkata MR > +91 98455 77125 > > ::DISCLAIMER:: > -- > -- > -- > > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. E-mail transmission is not > guaranteed to be secure or error-free as information could be intercepted, > corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses > in transmission. The e mail and its contents (with or without referred > errors) shall therefore not attach any liability on the originator or HCL or > its affiliates. Views or opinions, if any, presented in this email are solely > those of the author and may not necessarily reflect the views or opinions of > HCL or its affiliates. Any form of reproduction,
[ask] Crawl Forum Site
Hi, Is there possible to crawling Web Forum with Apache Nutch? If possible, is there any configuration that i must add? I've try it but i've nothing. Pls help . Need advice. Thanks Best Regards, Tukang Cangkul
Re: URL filter rejecting the URLs
Hi, the pattern should work. Of course, you need to make sure that - there are no other patterns coming before in regex-urlfilter.txt which cause the URL to be rejected - other URL filters being active which reject the URL - make sure that the folder of the regex-urlfilter.txt you're editing is first on the class path. Usually, $NUTCH_HOME/conf/regex-urlfilter.txt is used - (optionally) you may simplify the regex: the characters /_= have no special semantic and do not need to be escaped by \ The easiest way to test it (Nutch 1.15): % cat $NUTCH_HOME/conf/regex-urlfilter.txt +^https?://nseindia\.com/live_market/dynaContent/live_analysis/top_gainers_losers\.htm\?cat=([GL]) -. % echo "https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G)" \ | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter +https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G) And with another "forbidden" URL: % echo "https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X)" \ | nutch filterchecker -filterName urlfilter-regex -stdin Checking combination of these URLFilters: RegexURLFilter -https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=X) Best, Sebastian On 12/1/18 2:45 PM, Venkata MR wrote: > Hi Nutch Users, > > I was trying to crawl the site > (https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=G, > > https://nseindia.com/live_market/dynaContent/live_analysis/top_gainers_losers.htm?cat=L), > with the filter patter as > "+^https?://nseindia\.com\/live\_market\/dynaContent\/live\_analysis\/top\_gainers\_losers\.htm\?cat\=([GL])", > it is rejecting the urls. > > Tried multiple options but all the cases it is rejecting. > > Any help here is appreciated, Thanks! > > Thanks & Regards > Venkata MR > +91 98455 77125 > > ::DISCLAIMER:: > -- > The contents of this e-mail and any attachment(s) are confidential and > intended for the named recipient(s) only. E-mail transmission is not > guaranteed to be secure or error-free as information could be intercepted, > corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses > in transmission. The e mail and its contents (with or without referred > errors) shall therefore not attach any liability on the originator or HCL or > its affiliates. Views or opinions, if any, presented in this email are solely > those of the author and may not necessarily reflect the views or opinions of > HCL or its affiliates. Any form of reproduction, dissemination, copying, > disclosure, modification, distribution and / or publication of this message > without the prior written consent of authorized representative of HCL is > strictly prohibited. If you have received this email in error please delete > it and notify the sender immediately. Before opening any email and/or > attachments, please check them for viruses and other defects. > -- >