Re: Connect Solr and Nutch in Ubuntu 18

2018-10-05 Thread govind nitk
Info given is not sufficient to figure out the problem.

1. You need to add indexer-solr to the plugins list.
2. Check "solr index properties" in nutch-default.xml ( It has lot of
properties)

check out - https://wiki.apache.org/nutch/NutchTutorial for detailed
explanation.



On Fri, Oct 5, 2018 at 3:41 AM Timeka Cobb  wrote:

> Hello there! Does anyone know how to connect the 2 for the core..Ive looked
> high and low also checking the Wiki which doesn't help me at all..can
> anyone give some help in regards to this pretty please..I have all the
> components but don't know how to connect it all
>
> I'm using Ubuntu 18  Bionic Beaver
>


Re: Regex to block some patterns

2018-10-05 Thread govind nitk
Also, check last regex line.

*# accept anything else*
*+.*

By mistake if you have made it negative( -.), everything will be discarded.

Best,
Govind

On Fri, Oct 5, 2018 at 1:02 PM Sebastian Nagel
 wrote:

> Hi Amarnath,
>
> the only possibility is that https://www.abc.com/ is skipped
> - by another rule in regex-urlfilter.txt
> - or another URL filter plugin
>
> Please check your configuration carefully. You may also use the tool
>   bin/nutch filterchecker
> to test the filters beforehand: every active filter individually
> and all in combination.
>
> Best,
> Sebastian
>
> On 10/04/2018 06:52 AM, Amarnatha Reddy wrote:
> > Hi Markus,
> >
> > Thanks a lot for the quick update, but i applied the same rule and it's
> > completely rejected and no more urls to inject.
> >
> > I have applied the same regex: -^.+(?:modal|exit).*\.html
> > seed.txt: https://www.abc.com/
> > Seems regex is fine, but it's not working with Nutch1.15 regex
> block...any
> > thoughts please?
> >
> > Here is sample output:
> > [Nutch]$ bin/crawl -i -D abccollection -s urls/ crawl/ -1
> > Injecting seed URLs
> > /test/Nutch/TEST/test2_Nutch/bin/nutch inject crawl//crawldb urls/
> > Injector: starting at 2018-10-04 04:43:14
> > Injector: crawlDb: crawl/crawldb
> > Injector: urlDir: urls
> > Injector: Converting injected urls to crawl db entries.
> > Injecting seed URL file file:/test/Nutch/TEST/test2_Nutch/urls/seed.txt
> > Injector: overwrite: false
> > Injector: update: false
> > Injector: Total urls rejected by filters: 1
> > Injector: Total urls injected after normalization and filtering: 0
> > Injector: Total urls injected but already in CrawlDb: 0
> > Injector: Total new urls injected: 0
> > Injector: Total urls with status gone removed from CrawlDb
> > (db.update.purge.404): 0
> > Injector: finished at 2018-10-04 04:43:16, elapsed: 00:00:02
> > Thu Oct 4 04:43:16 UTC 2018 : Iteration 1
> > Generating a new segment
> > /test/Nutch/TEST/test2_Nutch/bin/nutch generate -D
> mapreduce.job.reduces=2
> > -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
> > -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
> > crawl//crawldb crawl//segments -topN 5 -numFetchers 1 -noFilter
> > Generator: starting at 2018-10-04 04:43:17
> > Generator: Selecting best-scoring urls due for fetch.
> > Generator: filtering: false
> > Generator: normalizing: true
> > Generator: topN: 5
> > Generator: 0 records selected for fetching, exiting ...
> > Generate returned 1 (no new segments created)
> > Escaping loop: no more URLs to fetch now
> >
> > Thanks,
> > Amarnath Polu
> >
> > On Thu, Oct 4, 2018 at 12:53 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hi Amarnatha,
> >>
> >> -^.+(?:modal|exit).*\.html
> >>
> >> Will work for all exampes given.
> >>
> >> You can test regexes really well online [1]. If each input has true for
> >> lookingAt, Nutch' regexfilter will filter the URL's.
> >>
> >> Regards,
> >> Markus
> >>
> >> [1] https://www.regexplanet.com/advanced/java/index.html
> >>
> >>
> >> -Original message-
> >>> From:Amarnatha Reddy 
> >>> Sent: Wednesday 3rd October 2018 15:23
> >>> To: user@nutch.apache.org
> >>> Subject: Regex to block some patterns
> >>>
> >>> Hi Team,
> >>>
> >>>
> >>>
> >>> I need some assistance to block patterns in my current setup.
> >>>
> >>>
> >>>
> >>> Always my seed url is *https://www.abc.com/ *
> and
> >>> need to crawl all pages except below patterns in Nutch1.15
> >>>
> >>>
> >>> Blocking pattern *modal(.*).html *and *exit.html? *and *exit.html/?*
> >>>
> >>> Sample pages *modal.html, modal_1123Abc.html, modalaa_12.html* (these
> >> could
> >>> be end of the domain)
> >>>
> >>>
> >>>
> >>> Below are the few use case urls'
> >>>
> >>>
> >>>
> >>
> https://www.abc.com/abc-editions/2018/test-ask/altitude/feature-pillar/abc/acb-1/modal.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/ask/exterior/feature_overlay/modalcontainer5.html
> >>>
> >>>
> >>
> https://www.abc.com/2017/image/exterior/abc/feature_overlay/modalcontainer5_Ab_c.html
> >>>
> >>>
> >>>
> >>> exit.html (here anything like this exit.html? exit.html/?)
> >>>
> >>>
> >>> Ask here is after domain (https://www.abc.com/), starts with
> >>> exit.html/exit.html?/exit.html/?  then need to block/exclude crawl.
> >>>
> >>>
> https://www.abc.com/exit.html?url=https://www.gear.abc.com/welcome.asp
> >>>
> >>>
> >>
> https://www.abc.com/exit.html/?tname=abc_facebook=http://www.facebook.com/abc=true
> >>>
> >>>
> >>> *Note: Yes we can directly put - ^(complete url) ,but dont know how
> many
> >>> are there, so need generic regex rule to apply.*
> >>>
> >>>
> >>> i tried below pattern,but it is not working
> >>>
> >>> ## Blocking pattern ends with 
> >>>
> >>> -^(?i)\*(modal*|exit*).html
> >>>
> >>>
> >>>
> >>> Kindly help me to setup regex to block my use case.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Amarnath
> >>>
> >>>
> >>>
> >>>
> >>> 

Re: using any23 with nutch

2018-07-28 Thread govind nitk
Tried 2.3-SNAPSHOT instead of 2.3 as :



Error persists.


On Sat, Jul 28, 2018 at 5:57 PM govind nitk  wrote:

>
> hi all,
>
> I want to use any23 2.3-snapshot version with nutch. This is what I have
> done:
> 1. have "mvn install" in any23 repo.
> so jars are released in local ~/.m2 dir.
>
> ex. 
> /home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar
>
> 2. nutch repo, plugins/any23/ivy.xml
>
>  conf="*->default">
>
> 3. In nutch repo, have changed ivy setting as below:
> 
>  
> value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]"
>   override="false" />
>
> 
>   
>  
>  
>   
> 
>
>  >
>
>
> So, expectation is any23 will start using my_local releases.
>
> But its failing with below error:
>
> resolve-default:
> [ivy:resolve] :: loading settings :: file =
> /home/govind/apache/nutch/ivy/ivysettings.xml
> [ivy:resolve]
> [ivy:resolve] :: problems summary ::
> [ivy:resolve]  WARNINGS
> [ivy:resolve] module not found: org.apache.any23#apache-any23;2.3
> [ivy:resolve]  local-maven2: tried
> [ivy:resolve]
> /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml
> [ivy:resolve]   -- artifact
> org.apache.any23#apache-any23;2.3!apache-any23.jar:
> [ivy:resolve]
> /home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar
> [ivy:resolve] ::
> [ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
> [ivy:resolve] ::
> [ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found
> [ivy:resolve] ::
> [ivy:resolve]
> [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
> Target 'resolve-default' failed with message 'impossible to resolve
> dependencies:
>
>
> Am I missing something in local resolver defined for any23 ?
> Is it the case, that we can not use the  locally released jars in nutch ?
> Is there any other hack I can use to this resolved ?
>
> Regards,
> Govind
>


using any23 with nutch

2018-07-28 Thread govind nitk
hi all,

I want to use any23 2.3-snapshot version with nutch. This is what I have
done:
1. have "mvn install" in any23 repo.
so jars are released in local ~/.m2 dir.
ex. 
/home/govind/.m2/repository/org/apache/any23/apache-any23-core/2.3-SNAPSHOT/apache-any23-core-2.3-SNAPSHOT.jar

2. nutch repo, plugins/any23/ivy.xml



3. In nutch repo, have changed ivy setting as below:



  
 
 
  





So, expectation is any23 will start using my_local releases.

But its failing with below error:

resolve-default:
[ivy:resolve] :: loading settings :: file =
/home/govind/apache/nutch/ivy/ivysettings.xml
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve]  WARNINGS
[ivy:resolve] module not found: org.apache.any23#apache-any23;2.3
[ivy:resolve]  local-maven2: tried
[ivy:resolve]
/home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.xml
[ivy:resolve]   -- artifact
org.apache.any23#apache-any23;2.3!apache-any23.jar:
[ivy:resolve]
/home/govind/.m2/repository/org/apache/any23/apache-any23/2.3/apache-any23-2.3.jar
[ivy:resolve] ::
[ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::
[ivy:resolve] :: org.apache.any23#apache-any23;2.3: not found
[ivy:resolve] ::
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Target 'resolve-default' failed with message 'impossible to resolve
dependencies:


Am I missing something in local resolver defined for any23 ?
Is it the case, that we can not use the  locally released jars in nutch ?
Is there any other hack I can use to this resolved ?

Regards,
Govind


Re: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1

2018-07-28 Thread govind nitk
+1 for build
plugins test - success






On Thu, Jul 26, 2018 at 10:25 PM Roannel Fernández Hernández 
wrote:

> +1 Great work, folks
>
> - Mensaje original -
> > De: "Sebastian Nagel" 
> > Para: user@nutch.apache.org
> > CC: d...@nutch.apache.org
> > Enviados: Jueves, 26 de Julio 2018 11:05:06
> > Asunto: [MASSMAIL][VOTE] Release Apache Nutch 1.15 RC#1
> >
> > Hi Folks,
> >
> > A first candidate for the Nutch 1.15 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/nutch/1.15/
> >
> > The release candidate is a zip and tar.gz archive of the binary and
> sources
> > in:
> >   https://github.com/apache/nutch/tree/release-1.15
> >
> > The SHA1 checksum of the archive apache-nutch-1.15-bin.tar.gz is
> >555d00ddc0371b05c5958bde7abb2a9db8c38ee2
> >
> > In addition, a staged maven repository is available here:
> >
> https://repository.apache.org/content/repositories/orgapachenutch-1015/
> >
> > We addressed 119 Issues:
> >https://s.apache.org/nczS
> >
> > Please vote on releasing this package as Apache Nutch 1.15.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Nutch PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Nutch 1.15.
> > [ ] -1 Do not release this package because…
> >
> > Cheers,
> > Sebastian
> > (On behalf of the Nutch PMC)
> >
> > P.S. Here is my +1.
> >
> UCIENCIA 2018: III Conferencia Científica Internacional de la Universidad
> de las Ciencias Informáticas.
> Del 24-26 de septiembre, 2018 http://uciencia.uci.cu http://eventos.uci.cu
>


any23 2.2 upgrading in NUTCH gives errors

2018-04-02 Thread govind nitk
Hi,

Tried to upgrade any23 2.1 to 2.2 in nutch code base.

Changes:
1. src/plugin/any23/ivy.xml:


2. src/plugin/any23/plugin.xml








after "ant runtime",
below jar files are present in dir runtime/local/plugins/any23

any23.jar
apache-any23-api-2.2.jar
apache-any23-core-2.2.jar
apache-any23-csvutils-2.2.jar
apache-any23-encoding-2.2.jar
apache-any23-mime-2.2.jar




Did simple parse checker on a test html. Getting Errors as
1.  java.util.concurrent.ExecutionException:
java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
 
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry

2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
...
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl




Entire log file is attached in debug.txt.


Regards,
Govind
2018-04-02 17:09:49,999 INFO  parse.ParserChecker (ParserChecker.java:run(122)) 
- fetching: file:/tmp/exact_code.html
2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:(117)) - 
fetching file:/tmp/exact_code.html
2018-04-02 17:09:50,450 INFO  parse.ParseSegment 
(ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped. 
Content of size 79433 was truncated to 65536
2018-04-02 17:09:50,450 WARN  parse.ParserChecker (ParserChecker.java:run(187)) 
- Content is truncated, parse may fail!
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor, 
extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika, 
extension-id: org.apache.nutch.parse.tika.TikaParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext, 
extension-id: ExtParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html, 
extension-id: org.apache.nutch.parse.html.HtmlParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js, 
extension-id: JSParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed, 
extension-id: org.apache.nutch.parse.feed.FeedParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf, 
extension-id: org.apache.nutch.parse.swf.SWFParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip, 
extension-id: org.apache.nutch.parse.zip.ZipParser
2018-04-02 17:09:50,461 INFO  parse.ParserFactory 
(ParserFactory.java:matchExtensions(374)) - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser - 
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes 
system property, and all claim to support the content type text/html, but they 
are not mapped to it  in the parse-plugins.xml file
2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - 
Parsing [file:/tmp/exact_code.html] with 
[org.apache.nutch.parse.tika.TikaParser@693fe6c9]
2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) - 
Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html
2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) - 
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, 
noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
   - viewport   =   width=device-width, initial-scale=1
   - dc:title   =   I.F. on Kharms – Just a Beginning
   - content-encoding   =   UTF-8
   - generator  =   WordPress 4.9.4
   - content-type   =   text/html; charset=UTF-8
   - robots =   index,follow
 * http-equiv tags:

2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) - 
Getting text...
2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) - 
Getting title...
2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) - 
Getting links (base URL = file:/tmp/exact_code.html) ...
2018-04-02 17:09:51,227 TRACE 

Re: Getting Error

2018-01-17 Thread govind nitk
Hi Sebastian and lewis,

Did build on other machine and diffed the runtime log. Got the issues
pretty clear
yes, the build was not proper. Got it resolved.

Happy crawling.

Regards,
GoViNd


On Mon, Jan 15, 2018 at 2:04 AM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> Hi Govind,
>
> thanks. At least, although it's caught later it seems a little bit clearer
> what is happening:
>
> > Exception in thread "main" java.lang.NoSuchMethodError:
> >
> org.apache.nutch.util.NutchJob.getInstance(Lorg/apache/hadoop/conf/
> Configuration;Ljava/lang/String;)Lorg/apache/nutch/util/NutchJob;
>
> However, there is a method in NutchJob.java:
>
>public static NutchJob getInstance(Configuration conf, String jobName)
>throws IOException {
>
> As said, I'm able to run Injector of the current 2.x branch with the
> changes described (use
> MongoDB), so this is really weired. Looks more like a build or class path
> issue...
>
> Best,
> Sebastian
>
>
> On 01/13/2018 08:19 AM, govind nitk wrote:
> >
> > Hi Sebastian,
> >
> > Thanks for clarification.
> >
> > $cat /tmp/urls/seeds.txt
> > http://nutch.apache.org/
> >
> > $export 'NUTCH_OPTS=-Xverify:none'
> > $./bin/nutch inject /tmp/urls/
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> > [jar:file:/home/govind/apache/nutch/runtime/local/lib/slf4j-
> log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> > [jar:file:/home/govind/apache/nutch/runtime/local/lib/slf4j-
> log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> > InjectorJob: starting at 2018-01-13 12:09:33
> > InjectorJob: Injecting urlDir: /tmp/urls
> > Exception in thread "main" java.lang.NoSuchMethodError:
> > org.apache.nutch.util.NutchJob.getInstance(Lorg/apache/hadoop/conf/
> Configuration;Ljava/lang/String;)Lorg/apache/nutch/util/NutchJob;
> > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:229)
> > at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:270)
> > at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:293)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> > at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:302)
> >
> >
> > Attached is the java setting & properties for crosscheck.
> >
> > Regards,
> > Govind
> >
> > On Fri, Jan 12, 2018 at 5:12 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> > <mailto:wastl.na...@googlemail.com>> wrote:
> >
> > No. Please use Java 8. Nutch requires Java 8, see default.properties.
> >
> > The Dockerfile is outdated. If possible please open a Jira issue to
> update it.
> >
> > The error is really weired:
> > NutchJob extends org.apache.hadoop.mapreduce.Job, so there should
> be no
> > verification [1] error. I'm not able to reproduce it.
> >
> > Could you explain more in which environment Nutch is executed and how
> > you launch it?  Ev. try with "java -Xverify:none" (for bin/nutch set
> > the environment variable NUTCH_OPTS=-Xverify:none) to see what
> happens.
> >
> > Thanks,
> > Sebastian
> >
> >
> > [1] https://static.rainfocus.com/oracle/oow16/sess/
> 1461563392709001ttyE/ppt/bcv_J1SF_2016.pdf
> > <https://static.rainfocus.com/oracle/oow16/sess/
> 1461563392709001ttyE/ppt/bcv_J1SF_2016.pdf>
> >
> >
> > On 01/12/2018 09:09 AM, govind nitk wrote:
> > > Hi Lewis,
> > >
> > > Tried with oracle java8, but issue persists and the error is same.
> > >
> > > Nutch might be compatiable with java8,
> > > but in docker file for hbase(nutch/docker/hbase/Dockerfile),
> java7 is used.
> > > So do we need to use java7 only ?
> > >
> > >
> > > Regards,
> > > GoViNd
> > >
> > >
> > >
> > > On Thu, Jan 11, 2018 at 8:22 PM, lewis john mcgibbney <
> lewi...@apache.org
> > <mailto:lewi...@apache.org>>
> > > wrote:
> > >
> > >> I unfortunately do not use the OpenJDK so i don't know if this is
> where
> > >> your issue stems from.
> > >> All of your config looks absolutely fine.
> > >> Lewis
> &

Re: Getting Error

2018-01-12 Thread govind nitk
Hi Sebastian,

Thanks for clarification.

$cat /tmp/urls/seeds.txt
http://nutch.apache.org/

$export 'NUTCH_OPTS=-Xverify:none'
$./bin/nutch inject /tmp/urls/

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/govind/apache/nutch/runtime/local/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/govind/apache/nutch/runtime/local/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
InjectorJob: starting at 2018-01-13 12:09:33
InjectorJob: Injecting urlDir: /tmp/urls
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.nutch.util.NutchJob.getInstance(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/String;)Lorg/apache/nutch/util/NutchJob;
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:229)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:270)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:293)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:302)


Attached is the java setting & properties for crosscheck.

Regards,
Govind

On Fri, Jan 12, 2018 at 5:12 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> No. Please use Java 8. Nutch requires Java 8, see default.properties.
>
> The Dockerfile is outdated. If possible please open a Jira issue to update
> it.
>
> The error is really weired:
> NutchJob extends org.apache.hadoop.mapreduce.Job, so there should be no
> verification [1] error. I'm not able to reproduce it.
>
> Could you explain more in which environment Nutch is executed and how
> you launch it?  Ev. try with "java -Xverify:none" (for bin/nutch set
> the environment variable NUTCH_OPTS=-Xverify:none) to see what happens.
>
> Thanks,
> Sebastian
>
>
> [1] https://static.rainfocus.com/oracle/oow16/sess/
> 1461563392709001ttyE/ppt/bcv_J1SF_2016.pdf
>
>
> On 01/12/2018 09:09 AM, govind nitk wrote:
> > Hi Lewis,
> >
> > Tried with oracle java8, but issue persists and the error is same.
> >
> > Nutch might be compatiable with java8,
> > but in docker file for hbase(nutch/docker/hbase/Dockerfile), java7 is
> used.
> > So do we need to use java7 only ?
> >
> >
> > Regards,
> > GoViNd
> >
> >
> >
> > On Thu, Jan 11, 2018 at 8:22 PM, lewis john mcgibbney <
> lewi...@apache.org>
> > wrote:
> >
> >> I unfortunately do not use the OpenJDK so i don't know if this is where
> >> your issue stems from.
> >> All of your config looks absolutely fine.
> >> Lewis
> >>
> >> On Thu, Jan 11, 2018 at 8:26 AM, <user-digest-h...@nutch.apache.org>
> >> wrote:
> >>
> >>>
> >>> From: govind nitk <govind.n...@gmail.com>
> >>> To: user@nutch.apache.org
> >>> Cc:
> >>> Bcc:
> >>> Date: Wed, 10 Jan 2018 14:06:53 +0530
> >>> Subject: Re: Getting Error
> >>> $java -version
> >>> openjdk version "1.8.0_141"
> >>> OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-3~14.04-b15)
> >>> OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
> >>>
> >>>
> >>> config edits:
> >>>
> >>> Gora properties:
> >>> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> >>> gora.mongodb.override_hadoop_configuration=false
> >>> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> >>> gora.mongodb.servers=localhost:27017
> >>> gora.mongodb.db=crawler
> >>> #gora.mongodb.login=login
> >>> #gora.mongodb.secret=secret
> >>>
> >>>
> >>> nutch-site.xml:
> >>>   
> >>> storage.data.store.class
> >>> org.apache.gora.mongodb.store.MongoStore
> >>> Default class for storing data
> >>>   
> >>>
> >>>
> >>> mongod running on default port: 27017.
> >>>
> >>>
> >>> And before generating snapshot , uncommented the goa backend to use
> >> mongo.
> >>> as:
> >>>  >>> conf="*->default" />
> >>>
> >>>
> >>> Am I missing anything else?
> >>>
> >>>
> >>> regards,
> >>> govind
> >>>
> >>>
> >>>
> >>> On Wed, Jan 10, 2018 

Re: Getting Error

2018-01-12 Thread govind nitk
Hi Lewis,

Tried with oracle java8, but issue persists and the error is same.

Nutch might be compatiable with java8,
but in docker file for hbase(nutch/docker/hbase/Dockerfile), java7 is used.
So do we need to use java7 only ?


Regards,
GoViNd



On Thu, Jan 11, 2018 at 8:22 PM, lewis john mcgibbney <lewi...@apache.org>
wrote:

> I unfortunately do not use the OpenJDK so i don't know if this is where
> your issue stems from.
> All of your config looks absolutely fine.
> Lewis
>
> On Thu, Jan 11, 2018 at 8:26 AM, <user-digest-h...@nutch.apache.org>
> wrote:
>
> >
> > From: govind nitk <govind.n...@gmail.com>
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Wed, 10 Jan 2018 14:06:53 +0530
> > Subject: Re: Getting Error
> > $java -version
> > openjdk version "1.8.0_141"
> > OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-3~14.04-b15)
> > OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
> >
> >
> > config edits:
> >
> > Gora properties:
> > gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> > gora.mongodb.override_hadoop_configuration=false
> > gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> > gora.mongodb.servers=localhost:27017
> > gora.mongodb.db=crawler
> > #gora.mongodb.login=login
> > #gora.mongodb.secret=secret
> >
> >
> > nutch-site.xml:
> >   
> > storage.data.store.class
> > org.apache.gora.mongodb.store.MongoStore
> > Default class for storing data
> >   
> >
> >
> > mongod running on default port: 27017.
> >
> >
> > And before generating snapshot , uncommented the goa backend to use
> mongo.
> > as:
> >  > conf="*->default" />
> >
> >
> > Am I missing anything else?
> >
> >
> > regards,
> > govind
> >
> >
> >
> > On Wed, Jan 10, 2018 at 12:31 PM, govind nitk <govind.n...@gmail.com>
> > wrote:
> >
> > >
> > > hi Lewis,
> > >
> > > uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan
> 7
> > > 15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney <
> lewi...@apache.org
> > >
> > > wrote:
> > >
> > >> Hi govind,
> > >> Very strange. Which operating system are you using?
> > >> Lewis
> > >>
> > >> On Tue, Jan 9, 2018 at 5:15 AM, <user-digest-h...@nutch.apache.org>
> > >> wrote:
> > >>
> > >> > From: govind nitk <govind.n...@gmail.com>
> > >> > To: user@nutch.apache.org
> > >> > Cc:
> > >> > Bcc:
> > >> > Date: Tue, 9 Jan 2018 15:45:08 +0530
> > >> > Subject: Getting Error
> > >> > Hi,
> > >> >
> > >> > 1. running nutch compiled from branch 2.x. Build succeed.
> > >> > 2. using mongo as db storage. changed the storage.data.store.class
> to
> > >> point
> > >> > to mongo class.
> > >> >
> > >> >
> > >> > Getting this error while running nutch inject /tmp/urls/seeds.txt ?
> > >> >
> > >> >
> > >> > Error: A JNI error has occurred, please check your installation and
> > try
> > >> > again
> > >> > Exception in thread "main" java.lang.VerifyError: Bad type on
> operand
> > >> stack
> > >> > Exception Details:
> > >> >   Location:
> > >> > org/apache/nutch/crawl/InjectorJob.run(Ljava/util/Map;)
> > >> Ljava/util/Map;
> > >> > @85: putfield
> > >> >   Reason:
> > >> > Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1])
> > is
> > >> not
> > >> > assignable to 'org/apache/hadoop/mapreduce/Job'
> > >> >   Current Frame:
> > >> > bci: @85
> > >> > flags: { }
> > >> > locals: { 'org/apache/nutch/crawl/InjectorJob',
> 'java/util/Map',
> > >> > 'org/apache/hadoop/fs/Path', 'java/lang/Object' }
> > >> > stack: { 'org/apache/nutch/crawl/InjectorJob',
> > >> > 'org/apache/nutch/util/NutchJob' }
> > >> >   Bytecode:
> > >> > 0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
> > >> > 0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
> >

Re: Getting Error

2018-01-10 Thread govind nitk
$java -version
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-3~14.04-b15)
OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)


config edits:

Gora properties:
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=crawler
#gora.mongodb.login=login
#gora.mongodb.secret=secret


nutch-site.xml:
  
storage.data.store.class
org.apache.gora.mongodb.store.MongoStore
Default class for storing data
  


mongod running on default port: 27017.


And before generating snapshot , uncommented the goa backend to use mongo.
as:



Am I missing anything else?


regards,
govind



On Wed, Jan 10, 2018 at 12:31 PM, govind nitk <govind.n...@gmail.com> wrote:

>
> hi Lewis,
>
> uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan 7
> 15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney <lewi...@apache.org>
> wrote:
>
>> Hi govind,
>> Very strange. Which operating system are you using?
>> Lewis
>>
>> On Tue, Jan 9, 2018 at 5:15 AM, <user-digest-h...@nutch.apache.org>
>> wrote:
>>
>> > From: govind nitk <govind.n...@gmail.com>
>> > To: user@nutch.apache.org
>> > Cc:
>> > Bcc:
>> > Date: Tue, 9 Jan 2018 15:45:08 +0530
>> > Subject: Getting Error
>> > Hi,
>> >
>> > 1. running nutch compiled from branch 2.x. Build succeed.
>> > 2. using mongo as db storage. changed the storage.data.store.class to
>> point
>> > to mongo class.
>> >
>> >
>> > Getting this error while running nutch inject /tmp/urls/seeds.txt ?
>> >
>> >
>> > Error: A JNI error has occurred, please check your installation and try
>> > again
>> > Exception in thread "main" java.lang.VerifyError: Bad type on operand
>> stack
>> > Exception Details:
>> >   Location:
>> > org/apache/nutch/crawl/InjectorJob.run(Ljava/util/Map;)
>> Ljava/util/Map;
>> > @85: putfield
>> >   Reason:
>> > Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1]) is
>> not
>> > assignable to 'org/apache/hadoop/mapreduce/Job'
>> >   Current Frame:
>> > bci: @85
>> > flags: { }
>> > locals: { 'org/apache/nutch/crawl/InjectorJob', 'java/util/Map',
>> > 'org/apache/hadoop/fs/Path', 'java/lang/Object' }
>> > stack: { 'org/apache/nutch/crawl/InjectorJob',
>> > 'org/apache/nutch/util/NutchJob' }
>> >   Bytecode:
>> > 0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
>> > 0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
>> > 0x020: 4da7 000f bb00 0b59 2db6 000c b700 0d4d
>> > 0x030: 2a04 b500 0e2a 03b5 000f 2a2a b600 04bb
>> > 0x040: 0010 59b7 0011 1212 b600 132c b600 14b6
>> > 0x050: 0015 b800 16b5 0017 2ab4 0017 2cb8 0018
>> > 0x060: 2ab4 0017 1219 b600 1a2a b400 1712 1bb6
>> > 0x070: 001c 2ab4 0017 121d b600 1e2a b400 1712
>> > 0x080: 1fb6 0020 2ab4 0017 b600 2112 1b12 1db8
>> > 0x090: 0022 3a04 2ab4 0017 1904 04b8 0023 2ab4
>> > 0x0a0: 0017 b600 21b8 0024 3a05 b200 25bb 0010
>> > 0x0b0: 59b7 0011 1226 b600 1319 05b6 0014 1227
>> > 0x0c0: b600 13b6 0015 b900 2802 002a b400 1712
>> > 0x0d0: 29b6 002a 2ab4 0017 03b6 002b 2ab4 0017
>> > 0x0e0: 04b6 002c 5701 2ab4 0017 2ab4 002d b800
>> > 0x0f0: 2e2a b400 17b6 002f 1230 1231 b600 32b9
>> > 0x100: 0033 0100 3706 2ab4 0017 b600 2f12 3012
>> > 0x110: 34b6 0032 b900 3301 0037 08b2 0025 bb00
>> > 0x120: 1059 b700 1112 35b6 0013 1608 b600 36b6
>> > 0x130: 0015 b900 2802 00b2 0025 bb00 1059 b700
>> > 0x140: 1112 37b6 0013 1606 b600 36b6 0015 b900
>> > 0x150: 2802 002a b400 2db0
>> >   Stackmap Table:
>> > append_frame(@36,Top,Object[#148])
>> > full_frame(@48,{Object[#149],Object[#150],Object[#151],
>> > Object[#148]},{})
>> >
>> > at java.lang.Class.getDeclaredMethods0(Native Method)
>> > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
>> > at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
>> > at java.lang.Class.getMethod0(Class.java:3018)
>> > at java.lang.Class.getMethod(Class.java:1784)
>> > at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper
>> .java:544)
>> > at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.
>> java:526)
>> >
>> >
>> >
>> > Regards,
>> > govind
>> >
>> >
>>
>>
>> --
>> http://home.apache.org/~lewismc/
>> @hectorMcSpector
>> http://www.linkedin.com/in/lmcgibbney
>>
>
>


Re: Getting Error

2018-01-09 Thread govind nitk
hi Lewis,

uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan 7
15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney <lewi...@apache.org>
wrote:

> Hi govind,
> Very strange. Which operating system are you using?
> Lewis
>
> On Tue, Jan 9, 2018 at 5:15 AM, <user-digest-h...@nutch.apache.org> wrote:
>
> > From: govind nitk <govind.n...@gmail.com>
> > To: user@nutch.apache.org
> > Cc:
> > Bcc:
> > Date: Tue, 9 Jan 2018 15:45:08 +0530
> > Subject: Getting Error
> > Hi,
> >
> > 1. running nutch compiled from branch 2.x. Build succeed.
> > 2. using mongo as db storage. changed the storage.data.store.class to
> point
> > to mongo class.
> >
> >
> > Getting this error while running nutch inject /tmp/urls/seeds.txt ?
> >
> >
> > Error: A JNI error has occurred, please check your installation and try
> > again
> > Exception in thread "main" java.lang.VerifyError: Bad type on operand
> stack
> > Exception Details:
> >   Location:
> > org/apache/nutch/crawl/InjectorJob.run(Ljava/util/
> Map;)Ljava/util/Map;
> > @85: putfield
> >   Reason:
> > Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1]) is
> not
> > assignable to 'org/apache/hadoop/mapreduce/Job'
> >   Current Frame:
> > bci: @85
> > flags: { }
> > locals: { 'org/apache/nutch/crawl/InjectorJob', 'java/util/Map',
> > 'org/apache/hadoop/fs/Path', 'java/lang/Object' }
> > stack: { 'org/apache/nutch/crawl/InjectorJob',
> > 'org/apache/nutch/util/NutchJob' }
> >   Bytecode:
> > 0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
> > 0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
> > 0x020: 4da7 000f bb00 0b59 2db6 000c b700 0d4d
> > 0x030: 2a04 b500 0e2a 03b5 000f 2a2a b600 04bb
> > 0x040: 0010 59b7 0011 1212 b600 132c b600 14b6
> > 0x050: 0015 b800 16b5 0017 2ab4 0017 2cb8 0018
> > 0x060: 2ab4 0017 1219 b600 1a2a b400 1712 1bb6
> > 0x070: 001c 2ab4 0017 121d b600 1e2a b400 1712
> > 0x080: 1fb6 0020 2ab4 0017 b600 2112 1b12 1db8
> > 0x090: 0022 3a04 2ab4 0017 1904 04b8 0023 2ab4
> > 0x0a0: 0017 b600 21b8 0024 3a05 b200 25bb 0010
> > 0x0b0: 59b7 0011 1226 b600 1319 05b6 0014 1227
> > 0x0c0: b600 13b6 0015 b900 2802 002a b400 1712
> > 0x0d0: 29b6 002a 2ab4 0017 03b6 002b 2ab4 0017
> > 0x0e0: 04b6 002c 5701 2ab4 0017 2ab4 002d b800
> > 0x0f0: 2e2a b400 17b6 002f 1230 1231 b600 32b9
> > 0x100: 0033 0100 3706 2ab4 0017 b600 2f12 3012
> > 0x110: 34b6 0032 b900 3301 0037 08b2 0025 bb00
> > 0x120: 1059 b700 1112 35b6 0013 1608 b600 36b6
> > 0x130: 0015 b900 2802 00b2 0025 bb00 1059 b700
> > 0x140: 1112 37b6 0013 1606 b600 36b6 0015 b900
> > 0x150: 2802 002a b400 2db0
> >   Stackmap Table:
> > append_frame(@36,Top,Object[#148])
> > full_frame(@48,{Object[#149],Object[#150],Object[#151],
> > Object[#148]},{})
> >
> > at java.lang.Class.getDeclaredMethods0(Native Method)
> > at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> > at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> > at java.lang.Class.getMethod0(Class.java:3018)
> > at java.lang.Class.getMethod(Class.java:1784)
> > at sun.launcher.LauncherHelper.validateMainClass(
> LauncherHelper.java:544)
> > at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
> >
> >
> >
> > Regards,
> > govind
> >
> >
>
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>


Getting Error

2018-01-09 Thread govind nitk
Hi,

1. running nutch compiled from branch 2.x. Build succeed.
2. using mongo as db storage. changed the storage.data.store.class to point
to mongo class.


Getting this error while running nutch inject /tmp/urls/seeds.txt ?


Error: A JNI error has occurred, please check your installation and try
again
Exception in thread "main" java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
org/apache/nutch/crawl/InjectorJob.run(Ljava/util/Map;)Ljava/util/Map;
@85: putfield
  Reason:
Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1]) is not
assignable to 'org/apache/hadoop/mapreduce/Job'
  Current Frame:
bci: @85
flags: { }
locals: { 'org/apache/nutch/crawl/InjectorJob', 'java/util/Map',
'org/apache/hadoop/fs/Path', 'java/lang/Object' }
stack: { 'org/apache/nutch/crawl/InjectorJob',
'org/apache/nutch/util/NutchJob' }
  Bytecode:
0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
0x020: 4da7 000f bb00 0b59 2db6 000c b700 0d4d
0x030: 2a04 b500 0e2a 03b5 000f 2a2a b600 04bb
0x040: 0010 59b7 0011 1212 b600 132c b600 14b6
0x050: 0015 b800 16b5 0017 2ab4 0017 2cb8 0018
0x060: 2ab4 0017 1219 b600 1a2a b400 1712 1bb6
0x070: 001c 2ab4 0017 121d b600 1e2a b400 1712
0x080: 1fb6 0020 2ab4 0017 b600 2112 1b12 1db8
0x090: 0022 3a04 2ab4 0017 1904 04b8 0023 2ab4
0x0a0: 0017 b600 21b8 0024 3a05 b200 25bb 0010
0x0b0: 59b7 0011 1226 b600 1319 05b6 0014 1227
0x0c0: b600 13b6 0015 b900 2802 002a b400 1712
0x0d0: 29b6 002a 2ab4 0017 03b6 002b 2ab4 0017
0x0e0: 04b6 002c 5701 2ab4 0017 2ab4 002d b800
0x0f0: 2e2a b400 17b6 002f 1230 1231 b600 32b9
0x100: 0033 0100 3706 2ab4 0017 b600 2f12 3012
0x110: 34b6 0032 b900 3301 0037 08b2 0025 bb00
0x120: 1059 b700 1112 35b6 0013 1608 b600 36b6
0x130: 0015 b900 2802 00b2 0025 bb00 1059 b700
0x140: 1112 37b6 0013 1606 b600 36b6 0015 b900
0x150: 2802 002a b400 2db0
  Stackmap Table:
append_frame(@36,Top,Object[#148])
full_frame(@48,{Object[#149],Object[#150],Object[#151],Object[#148]},{})

at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)



Regards,
govind