Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
You can try checking robots.txt for these websites

On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan  Most probably the problem is these websites allow only some specific
> crawlers in their robots.txt file.
>
> On Wed, 14 Nov 2018, 15:56 Semyon Semyonov  wrote:
>
>> Hi Nicholas,
>>
>> I have the same problem with https://www.graydon.nl/
>> And it doesnt look like a wordpress website.
>>
>> Semyon
>>
>>
>> Sent: Wednesday, November 14, 2018 at 7:49 AM
>> From: "Nicholas Roberts" 
>> To: user@nutch.apache.org
>> Subject: Wordpress.com hosted sites fail
>> org.apache.commons.httpclient.NoHttpResponseException
>> hi
>>
>> I am setting up a new crawler with Nutch 1.15 and am having problems only
>> with Wordpress.com hosted sites
>>
>> I can crawl other https sites no problems
>>
>> Wordpress sites can be crawled on other hosts, but I think there is a
>> problem with the SSL certs at Wordpress.com
>>
>> I get this error
>>
>> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
>> org.apache.commons.httpclient.NoHttpResponseException: The server
>> whatdavidread.ca failed to respond
>> FetcherThread 43 has no more work available
>>
>> there seems to be two layers of SSL certs
>>
>> first there is a Letsencrypt cert, with many domains, including the one
>> above, and the tls.auttomatic.com domain
>>
>> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
>> from Comodo
>>
>> Certificate chain
>> 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
>> wordpress.com
>> i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
>> RSA Domain Validation Secure Server CA
>>
>> I can crawl other https sites no problems
>>
>> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
>> -Djsse.enableSNIExtension=false) and no joy
>>
>> my nutch-site.xml
>>
>> 
>> plugin.includes
>>
>>
>> protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist
>> 
>> 
>> 
>>
>>
>> thanks for the consideration
>> --
>> Nicholas Roberts
>> www.niccolox.org[http://www.niccolox.org]
>>
>


Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
Most probably the problem is these websites allow only some specific
crawlers in their robots.txt file.

On Wed, 14 Nov 2018, 15:56 Semyon Semyonov  Hi Nicholas,
>
> I have the same problem with https://www.graydon.nl/
> And it doesnt look like a wordpress website.
>
> Semyon
>
>
> Sent: Wednesday, November 14, 2018 at 7:49 AM
> From: "Nicholas Roberts" 
> To: user@nutch.apache.org
> Subject: Wordpress.com hosted sites fail
> org.apache.commons.httpclient.NoHttpResponseException
> hi
>
> I am setting up a new crawler with Nutch 1.15 and am having problems only
> with Wordpress.com hosted sites
>
> I can crawl other https sites no problems
>
> Wordpress sites can be crawled on other hosts, but I think there is a
> problem with the SSL certs at Wordpress.com
>
> I get this error
>
> FetcherThread 43 fetch of https://whatdavidread.ca/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> whatdavidread.ca failed to respond
> FetcherThread 43 has no more work available
>
> there seems to be two layers of SSL certs
>
> first there is a Letsencrypt cert, with many domains, including the one
> above, and the tls.auttomatic.com domain
>
> then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert
> from Comodo
>
> Certificate chain
> 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*.
> wordpress.com
> i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO
> RSA Domain Validation Secure Server CA
>
> I can crawl other https sites no problems
>
> I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR"
> -Djsse.enableSNIExtension=false) and no joy
>
> my nutch-site.xml
>
> 
> plugin.includes
>
>
> protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist
> 
> 
> 
>
>
> thanks for the consideration
> --
> Nicholas Roberts
> www.niccolox.org[http://www.niccolox.org]
>


Re: Alternatives to Solr

2018-10-05 Thread Yash Thenuan Thenuan
You can use elasticsearch.

On Sat, 6 Oct 2018, 00:58 Timeka Cobb,  wrote:

> Hello folks! Does anyone know of a good alternative to Solr? Im asking this
> becasue Ive been trying to connect the 2 and its been so frustrating.
> The Nutch Wiki is extremely unreliable when it comes to Solr and every site
> I go to for info leads me nowhere. Does anyone know of something else I can
> use? I want to to create simple search engine crawling a few whole websites
> thats all. Thank you
>
> Timeka Cobb
>


Re: Having plugin as a separate project

2018-05-07 Thread Yash Thenuan Thenuan
Hey,
Thanks for the answer, But my question was can't we write plugin by
downloading the nutch jar and using it as a dependency, rather than adding
the code in nutch source code?

On Fri, May 4, 2018 at 8:08 PM, Jorge Betancourt <betancourt.jo...@gmail.com
> wrote:

> Usually we tend to develop everything inside the Nutch file structure,
> specially useful if you need to deploy to a Hadoop cluster later on
> (because you need to bundle everything in a job file).
>
> But, if you really want to develop the plugin in isolation you only need to
> create a new project in your preferred IDE/maven/ant/gradle and add the
> dependencies that you need from the lib/ directory (or the global
> dependencies with the same version).
>
> Then just compile everything to a jar and place it in the proper plugin
> structure in the Nutch installation. Although this should work is not
> really a smooth development experience.
> You need to be careful and not bundle all libs inside your jar, etc.
>
> The path suggested by Sebastian is much better, in the end while developing
> you want to have everything, perhaps just compile/test your plugin and
> later on you can copy the final jar of your plugin to the desired Nutch
> installation.
>
> Best Regards,
> Jorge
>
> On Fri, May 4, 2018 at 4:02 PM narendra singh arya <nsary...@gmail.com>
> wrote:
>
> > Can we have nutch plugin as a separate project?
> >
> > On Fri, 4 May 2018, 19:26 Sebastian Nagel, <wastl.na...@googlemail.com>
> > wrote:
> >
> > > That's trivial. Just run ant in the plugin's source folder:
> > >
> > >   cd src/plugin/urlnormalizer-basic/
> > >   ant
> > >
> > > or to run also the tests
> > >
> > >   cd src/plugin/urlnormalizer-basic/
> > >   ant test
> > >
> > > Note: you have to compile the core test classes first by running
> > >
> > >   ant compile-core-test
> > >
> > > in the Nutch "root" folder.
> > >
> > > A little bit slower but guarantees that everything is compiled:
> > >
> > >   ant -Dplugin=urlnormalizer-basic test-plugin
> > >
> > > Or sometimes it's enough to skip some of the long running tests:
> > >
> > >   ant -Dtest.exclude='TestSegmentMerger*' clean runtime test
> > >
> > >
> > > Best,
> > > Sebastian
> > >
> > > On 05/04/2018 01:13 PM, Yash Thenuan Thenuan wrote:
> > > > Hi all,
> > > > I want to compile my plugins separately so that I need not compile
> > > > the whole project again when I make a change in some plugin. How can
> I
> > > > achieve that?
> > > > Thanks
> > > >
> > >
> > >
> >
>


Having plugin as a separate project

2018-05-04 Thread Yash Thenuan Thenuan
Hi all,
I want to compile my plugins separately so that I need not compile
the whole project again when I make a change in some plugin. How can I
achieve that?
Thanks


RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
Yes  I am using Html parser and yes the document is getting parsed but
document fragment is printing null.

On 15 Mar 2018 13:52, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> Is your parser the HTML parser? I can say from experience that the
> document is passed.
> I really recommend debugging in local mode rather than using sysout.
>
> > -----Original Message-----
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 15 March 2018 10:13
> > To: user@nutch.apache.org
> > Subject: RE: RE: Dependency between plugins
> >
> > I tried printing the contents of document fragment in parsefilter-regex
> by writing
> > System.out.println(doc) but its printing null!! And document is getting
> parsed!!
> >
> > On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >
> > > Parse filters receive a DocumentFragment as their fourth parameter.
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 15 March 2018 08:50
> > > > To: user@nutch.apache.org
> > > > Subject: Re: RE: Dependency between plugins
> > > >
> > > > Hi Jorge and Yossi,
> > > > The reason why I am trying to do it is exactly what yossi said
> > > > "removing
> > > nutch
> > > > overhead", I didn't thought that it would be that complicated, All I
> > > > am
> > > trying is to
> > > > call the existing parsers from my own parser, but I am not able to
> > > > do it
> > > correctly,
> > > > may be chain approach is a better idea to do that but *do parse
> > > > filter
> > > receives
> > > > any DOM object?* as a parameter so by accessing that I can extract
> > > > the
> > > data I
> > > > want??
> > > >
> > > >
> > > > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari
> > > > <yossi.tam...@pipl.com>
> > > > wrote:
> > > >
> > > > > There is no built-in mechanism for this. However, are you sure you
> > > > > really want a parser for each website, rather than a parse-filter
> > > > > for each website (which will take the results of the HTML parser
> > > > > and apply some domain specific customizations)?
> > > > > In both cases you can use a dispatcher approach, which your custom
> > > > > parser is, or a chain approach (every parser that is not intended
> > > > > for this domain returns null, or each parse-filter that is not
> > > > > intended for this domain returns the ParseResult that it received).
> > > > > The advantage of the chain approach is that each new website
> > > > > parser is a first-class, reusable Nutch object. The advantage of
> > > > > the dispatcher approach is that you don't need to deal with a lot
> > > > > of the Nutch overhead, but it is more monolithic (You can end up
> > > > > with one huge plugin that needs to be constantly modified whenever
> > > > > one of the
> > > websites is
> > > > modified).
> > > > >
> > > > > > -Original Message-
> > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > Sent: 14 March 2018 15:28
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: RE: Dependency between plugins
> > > > > >
> > > > > > Is there a way in nutch by which we can use different parser for
> > > > > different
> > > > > > websites?
> > > > > > I am trying to do this by writing a custom parser which will
> > > > > > call
> > > > > different parsers
> > > > > > for different websites?
> > > > > >
> > > > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > > > <semyon.semyo...@mail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > As a side note,
> > > > > > >
> > > > > > > I had to implement my own parser with extra functionality,
> > > > > > > simple copy/past of the code of HTMLparser did the job.
> > > > > > >
> > > > > > > If you want to inherit instead of copy paste it can be a bad
> > > > > > > idea at

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
I tried printing the contents of document fragment in parsefilter-regex by
writing System.out.println(doc) but its printing null!! And document is
getting parsed!!

On 15 Mar 2018 13:15, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> Parse filters receive a DocumentFragment as their fourth parameter.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 15 March 2018 08:50
> > To: user@nutch.apache.org
> > Subject: Re: RE: Dependency between plugins
> >
> > Hi Jorge and Yossi,
> > The reason why I am trying to do it is exactly what yossi said "removing
> nutch
> > overhead", I didn't thought that it would be that complicated, All I am
> trying is to
> > call the existing parsers from my own parser, but I am not able to do it
> correctly,
> > may be chain approach is a better idea to do that but *do parse filter
> receives
> > any DOM object?* as a parameter so by accessing that I can extract the
> data I
> > want??
> >
> >
> > On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com>
> > wrote:
> >
> > > There is no built-in mechanism for this. However, are you sure you
> > > really want a parser for each website, rather than a parse-filter for
> > > each website (which will take the results of the HTML parser and apply
> > > some domain specific customizations)?
> > > In both cases you can use a dispatcher approach, which your custom
> > > parser is, or a chain approach (every parser that is not intended for
> > > this domain returns null, or each parse-filter that is not intended
> > > for this domain returns the ParseResult that it received).
> > > The advantage of the chain approach is that each new website parser is
> > > a first-class, reusable Nutch object. The advantage of the dispatcher
> > > approach is that you don't need to deal with a lot of the Nutch
> > > overhead, but it is more monolithic (You can end up with one huge
> > > plugin that needs to be constantly modified whenever one of the
> websites is
> > modified).
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 14 March 2018 15:28
> > > > To: user@nutch.apache.org
> > > > Subject: Re: RE: Dependency between plugins
> > > >
> > > > Is there a way in nutch by which we can use different parser for
> > > different
> > > > websites?
> > > > I am trying to do this by writing a custom parser which will call
> > > different parsers
> > > > for different websites?
> > > >
> > > > On 14 Mar 2018 14:19, "Semyon Semyonov"
> > <semyon.semyo...@mail.com>
> > > > wrote:
> > > >
> > > > > As a side note,
> > > > >
> > > > > I had to implement my own parser with extra functionality, simple
> > > > > copy/past of the code of HTMLparser did the job.
> > > > >
> > > > > If you want to inherit instead of copy paste it can be a bad idea
> > > > > at
> > > all.
> > > > > HTML parser is a concrete non abstract class, therefore the
> > > > > inheritance will not be so smooth as in case of contract
> > > > > implementations(the plugins are contracts, ie interfaces) and can
> > > easily break
> > > > some OOP rules.
> > > > >
> > > > >
> > > > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > > > To: user@nutch.apache.org
> > > > > Subject: RE: Dependency between plugins One suggestion I can make
> > > > > is to ensure that the html-parse plugin is built before your
> > > > > plugin (since you are including the jars that are generated in its
> build).
> > > > >
> > > > > > -Original Message-
> > > > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > > > Sent: 14 March 2018 09:55
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Re: Dependency between plugins
> > > > > >
> > > > > > Hi,
> > > > > > It didn't worked in ant runtime.
>

Re: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
Hi Jorge and Yossi,
The reason why I am trying to do it is exactly what yossi said "removing
nutch overhead", I didn't thought that it would be that complicated, All I
am trying is to call the existing parsers from my own parser, but I am not
able to do it correctly, may be chain approach is a better idea to do that
but *do parse filter receives any DOM object?* as a parameter so by
accessing that I can extract the data I want??


On Wed, Mar 14, 2018 at 7:36 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> There is no built-in mechanism for this. However, are you sure you really
> want a parser for each website, rather than a parse-filter for each website
> (which will take the results of the HTML parser and apply some domain
> specific customizations)?
> In both cases you can use a dispatcher approach, which your custom parser
> is, or a chain approach (every parser that is not intended for this domain
> returns null, or each parse-filter that is not intended for this domain
> returns the ParseResult that it received).
> The advantage of the chain approach is that each new website parser is a
> first-class, reusable Nutch object. The advantage of the dispatcher
> approach is that you don't need to deal with a lot of the Nutch overhead,
> but it is more monolithic (You can end up with one huge plugin that needs
> to be constantly modified whenever one of the websites is modified).
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 14 March 2018 15:28
> > To: user@nutch.apache.org
> > Subject: Re: RE: Dependency between plugins
> >
> > Is there a way in nutch by which we can use different parser for
> different
> > websites?
> > I am trying to do this by writing a custom parser which will call
> different parsers
> > for different websites?
> >
> > On 14 Mar 2018 14:19, "Semyon Semyonov" <semyon.semyo...@mail.com>
> > wrote:
> >
> > > As a side note,
> > >
> > > I had to implement my own parser with extra functionality, simple
> > > copy/past of the code of HTMLparser did the job.
> > >
> > > If you want to inherit instead of copy paste it can be a bad idea at
> all.
> > > HTML parser is a concrete non abstract class, therefore the
> > > inheritance will not be so smooth as in case of contract
> > > implementations(the plugins are contracts, ie interfaces) and can
> easily break
> > some OOP rules.
> > >
> > >
> > > Sent: Wednesday, March 14, 2018 at 9:18 AM
> > > From: "Yossi Tamari" <yossi.tam...@pipl.com>
> > > To: user@nutch.apache.org
> > > Subject: RE: Dependency between plugins One suggestion I can make is
> > > to ensure that the html-parse plugin is built before your plugin
> > > (since you are including the jars that are generated in its build).
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 14 March 2018 09:55
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Dependency between plugins
> > > >
> > > > Hi,
> > > > It didn't worked in ant runtime.
> > > > I included "import org.apache.nutch.parse.html;" in my custom parser
> > > code.
> > > > but it is throwing errror while i am doing ant runtime.
> > > >
> > > > [javac]
> > > > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > > > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > > > error: cannot find symbol
> > > >
> > > > [javac] import org.apache.nutch.parse.html;
> > > >
> > > > [javac] ^
> > > >
> > > > [javac] symbol: class html
> > > >
> > > > [javac] location: package org.apache.nutch.parse
> > > >
> > > >
> > > > below are the xml files of my parser
> > > >
> > > >
> > > > My ivy.xml
> > > >
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > http://nutch.apache.org"/>
> > > >
> > > > 
> > > >
> > > > Apache Nutch
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > 

Re: RE: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
Is there a way in nutch by which we can use different parser for different
websites?
I am trying to do this by writing a custom parser which will call different
parsers for different websites?

On 14 Mar 2018 14:19, "Semyon Semyonov" <semyon.semyo...@mail.com> wrote:

> As a side note,
>
> I had to implement my own parser with extra functionality, simple
> copy/past of the code of HTMLparser did the job.
>
> If you want to inherit instead of copy paste it can be a bad idea at all.
> HTML parser is a concrete non abstract class, therefore the inheritance
> will not be so smooth as in case of contract implementations(the plugins
> are contracts, ie interfaces) and can easily break some OOP rules.
>
>
> Sent: Wednesday, March 14, 2018 at 9:18 AM
> From: "Yossi Tamari" <yossi.tam...@pipl.com>
> To: user@nutch.apache.org
> Subject: RE: Dependency between plugins
> One suggestion I can make is to ensure that the html-parse plugin is built
> before your plugin (since you are including the jars that are generated in
> its build).
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 14 March 2018 09:55
> > To: user@nutch.apache.org
> > Subject: Re: Dependency between plugins
> >
> > Hi,
> > It didn't worked in ant runtime.
> > I included "import org.apache.nutch.parse.html;" in my custom parser
> code.
> > but it is throwing errror while i am doing ant runtime.
> >
> > [javac]
> > /Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-
> > custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
> > error: cannot find symbol
> >
> > [javac] import org.apache.nutch.parse.html;
> >
> > [javac] ^
> >
> > [javac] symbol: class html
> >
> > [javac] location: package org.apache.nutch.parse
> >
> >
> > below are the xml files of my parser
> >
> >
> > My ivy.xml
> >
> >
> > 
> >
> > 
> >
> > 
> >
> > http://nutch.apache.org"/>
> >
> > 
> >
> > Apache Nutch
> >
> > 
> >
> > 
> >
> >
> > 
> >
> > 
> >
> > 
> >
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > build.xml
> >
> > 
> >
> > 
> >
> > 
> > 
> > 
> > 
> >
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> >
> > plugin.xml
> >
> >  > id="parse-custom"
> > name="Custom Parse Plug-in"
> > version="1.0.0"
> > provider-name="nutch.org">
> >
> > 
> > 
> > 
> > 
> > 
> >
> > 
> > 
> > 
> > 
> >  > name="CustomParse"
> > point="org.apache.nutch.parse.Parser">
> >
> >  > class="org.apache.nutch.parse.custom.CustomParser">
> >  > value="text/html|application/xhtml+xml"/>
> > 
> > 
> >
> > 
> >
> > 
> >
> >
> >
> >
> > On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari <yossi.tam...@pipl.com>
> > wrote:
> >
> > > Hi Yash,
> > >
> > > I don't know how to do it, I never tried, but if I had to it would be
> > > a trial and error thing
> > >
> > > If you want to increase the chances that someone will answer your
> > > question, I suggest you provide as much information as possible:
> > > Where did it not work? In "ant runtime", or when running in Hadoop?
> > > What was the error message?
> > > What is the content of your build.xml, plugin.xml, and ivy.xml?
> > > Is parse-html configured in your plugin-includes?
> > >
> > > If it's a problem during execution, I would suggest looking at or
> > > debugging the code of PluginClassLoader.
> > >
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > > > Sent: 14 March 2018 08:34
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Dependency between plugins
> > > >
> > > > Anybody please help me out regarding this.
> > > >
> > > > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > > > rit2014...@iiita.ac.in> wrote:
> > > >
> > > > > I am trying to import Htmlparser in my custom parser.
> > > > > I did it in the same way by which Htmlparser imports lib-nekohtml
> > > > > but it didn't worked.
> > > > > Can anybody please tell me how to do it?
> > > > >
> > >
> > >
>
>


Re: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
Hi,
It didn't worked in ant runtime.
I included  "import org.apache.nutch.parse.html;" in my custom parser code.
but it is throwing errror while i am doing ant runtime.

[javac]
/Users/yasht/Downloads/apache-nutch-1.14/src/plugin/parse-custom/src/java/org/apache/nutch/parse/custom/CustomParser.java:41:
error: cannot find symbol

[javac] import org.apache.nutch.parse.html;

[javac]  ^

[javac]   symbol:   class html

[javac]   location: package org.apache.nutch.parse


below are the xml files of my parser


My ivy.xml




  



http://nutch.apache.org"/>



Apache Nutch



  


  



  


  





  



build.xml



  

  
  

  


  





  
  


  



plugin.xml



   
  
 
  
   

   
  
  

   

  


  

   






On Wed, Mar 14, 2018 at 1:02 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> Hi Yash,
>
> I don't know how to do it, I never tried, but if I had to it would be a
> trial and error thing
>
> If you want to increase the chances that someone will answer your
> question, I suggest you provide as much information as possible:
> Where did it not work? In "ant runtime", or when running in Hadoop? What
> was the error message?
> What is the content of your build.xml, plugin.xml, and ivy.xml?
> Is parse-html configured in your plugin-includes?
>
> If it's a problem during execution, I would suggest looking at or
> debugging the code of PluginClassLoader.
>
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 14 March 2018 08:34
> > To: user@nutch.apache.org
> > Subject: Re: Dependency between plugins
> >
> > Anybody please help me out regarding this.
> >
> > On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in> wrote:
> >
> > > I am trying to import Htmlparser in my custom parser.
> > > I did it in the same way by which Htmlparser imports lib-nekohtml but
> > > it didn't worked.
> > > Can anybody please tell me how to do it?
> > >
>
>


Re: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
Anybody please help me out regarding this.

On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan <
rit2014...@iiita.ac.in> wrote:

> I am trying to import Htmlparser in my custom parser.
> I did it in the same way by which Htmlparser imports lib-nekohtml but it
> didn't worked.
> Can anybody please tell me how to do it?
>


Dependency between plugins

2018-03-13 Thread Yash Thenuan Thenuan
I am trying to import Htmlparser in my custom parser.
I did it in the same way by which Htmlparser imports lib-nekohtml but it
didn't worked.
Can anybody please tell me how to do it?


RE: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Yossi I tried with both the original url and the newer one but it didn't
worked!!
However for now I disabled the scoring opic as suggested by Sebastian and
it worked for now.
And I will open a jira issue but I am new to open source world so can you
please  help me regarding this?
Thanks a lot yossi and Sebastian.

On 7 Mar 2018 16:11, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

Yas, just to be sure, you are using the original URL (the one that was in
the ParseResult passed as parameter to the filter) in the ParseResult
constructor, right?

> -Original Message-
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regarding Internal Links
>
> Hi,
>
> that needs to be fixed. It's because there is no CrawlDb entry for the
partial
> documents. May also be happen after NUTCH-2456. Could you open a Jira
issue
> to address the problem? Thanks!
>
> As a quick work-around:
> - either disable scoring-opic while indexing
> - or check dbDatum for null in scoring-opic indexerScore(...)
>
> Thanks,
> Sebastian
>
> On 03/07/2018 11:13 AM, Yash Thenuan Thenuan wrote:
> > Thanks Yossi, I am now able to parse the data successfully but I am
> > getting Error at the time of indexing.
> > Below are the hadoop logs for indexing.
> >
> > ElasticRestIndexWriter
> > elastic.rest.host : hostname
> > elastic.rest.port : port
> > elastic.rest.index : elastic index command elastic.rest.max.bulk.docs
> > : elastic bulk index doc counts. (default 250)
> > elastic.rest.max.bulk.size : elastic bulk index length. (default
> > 2500500
> > ~2.5MB)
> >
> >
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20180307130959
> > 2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
> > 2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting
> > server pool to a list of 1 servers: [http://localhost:9200]
> > 2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
> > thread/connection supporting pooling connection manager
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default
> > GSON instance
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node
> > Discovery disabled...
> > 2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle
> > connection reaping disabled...
> > 2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing remaining requests [docs = 1, length = 210402, total docs =
> > 1]
> > 2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
> > Processing to finalize last execute
> > 2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter -
> > Previous took in ms 175, including wait 97
> > 2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
> > job_local1561152089_0001
> > java.lang.Exception: java.lang.NullPointerException at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.ja
> > va:462) at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:52
> > 9) Caused by: java.lang.NullPointerException at
> > org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScori
> > ngFilter.java:171)
> > at
> > org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.ja
> > va:120)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :296)
> > at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java
> > :57) at
> > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> > at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(Loc
> > alJobRunner.java:319) at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
> > ) at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
> > ava:1149)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> > java:624) at j

Re: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Thanks Yossi, I am now able to parse the data successfully but I am getting
Error at the time of indexing.
Below are the hadoop logs for indexing.

ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.rest.max.bulk.size : elastic bulk index length. (default 2500500
~2.5MB)


2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2018-03-07 15:41:52,327 INFO  indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20180307130959
2018-03-07 15:41:53,677 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-07 15:41:54,861 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elasticrest.ElasticRestIndexWriter
2018-03-07 15:41:55,168 INFO  client.AbstractJestClient - Setting server
pool to a list of 1 servers: [http://localhost:9200]
2018-03-07 15:41:55,170 INFO  client.JestClientFactory - Using multi
thread/connection supporting pooling connection manager
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Using default GSON
instance
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Node Discovery
disabled...
2018-03-07 15:41:55,238 INFO  client.JestClientFactory - Idle connection
reaping disabled...
2018-03-07 15:41:55,282 INFO  elasticrest.ElasticRestIndexWriter -
Processing remaining requests [docs = 1, length = 210402, total docs = 1]
2018-03-07 15:41:55,361 INFO  elasticrest.ElasticRestIndexWriter -
Processing to finalize last execute
2018-03-07 15:41:55,458 INFO  elasticrest.ElasticRestIndexWriter - Previous
took in ms 175, including wait 97
2018-03-07 15:41:55,468 WARN  mapred.LocalJobRunner -
job_local1561152089_0001
java.lang.Exception: java.lang.NullPointerException
at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.NullPointerException
at
org.apache.nutch.scoring.opic.OPICScoringFilter.indexerScore(OPICScoringFilter.java:171)
at
org.apache.nutch.scoring.ScoringFilters.indexerScore(ScoringFilters.java:120)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:296)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:57)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-03-07 15:41:55,510 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)


On Wed, Mar 7, 2018 at 12:30 AM, Yossi Tamari <yossi.tam...@pipl.com> wrote:

> Regarding the configuration parameter, your Parse Filter should expose a
> setConf method that receives a conf parameter. Keep that as a member
> variable and pass it where necessary.
> Regarding parsestatus, contentmeta and parsemeta, you're going to have to
> look at them yourself (probably in a debugger), but as a baseline, you can
> probably just use the values in the inbound ParseResult (of the whole
> document).
> More specifically, parsestatus is an indication of whether parsing was
> successful. Unless your parsing may fail even when the whole document
> parsing was successful, you don't need to change it. contentmeta is all the
> information that was gathered about this page before parsing, so again, you
> probably just want to keep it, and finally parsemeta is the metadata that
> was gathered during parsing and may be useful for indexing, so passing the
> metadata from the original ParseResult makes sense, or just using the
> constructor that does not require it if you don't care about the metadata.
> This should all be easier to understand if you look at what the HTML
> Parser does with each of these fields.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in>
> > Sent: 06 March 2018 20:17
> > To: user@nutch.apache.org
> > Subject: RE: Regarding Internal Links
> >
&g

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
If you want simple crawlung then Not at all.
But having experience with java will help you to fulfil your personal
requirements.

On 7 Mar 2018 01:42, "Eric Valencia" <ericlvalen...@gmail.com> wrote:

> Does this require knowing Java proficiently?
>
> On Tue, Mar 6, 2018 at 10:51 AM Semyon Semyonov <semyon.semyo...@mail.com>
> wrote:
>
> > Here is an unpleasant truth - there is no up to date tutorial for Nutch.
> > To make it even more interesting, sometimes the tutorial can contradict
> > real behavior of Nutch, because of lately introduced features/bugs. If
> you
> > find such cases, please try to fix and contribute to the project.
> >
> > Welcome to the open source world.
> >
> > Though, my recommendations as a person who started with Nutch less then a
> > year ago :
> > 1) If you just need a simple crawl, you are in luck. Simply run crawl
> > script or several steps according to the Nutch crawl tutorial.
> > 2) If it is bit more comlex you start to face problems either with
> > configuration or with bugs. Therefore, first have a look at Nutch List
> > Archive http://nutch.apache.org/mailing_lists.html , if it doesnt work
> > try to figure out yourself, if that doesnt work ask here or at developer
> > list.
> > 3) In most cases, you HAVE to open the code and fix/discover something.
> > Nutch is really complicated system and to understand it properly you can
> > easily spend 2-3 months trying to get the full basic understanding of the
> > system. It gets even worse if you don't know Hadoop. If you dont I do
> > recomend to read "Hadoop. The definitive guide", because, well, Nutch is
> > Hadoop.
> >
> > Here we are, no pain, no gain.
> >
> >
> >
> > Sent: Tuesday, March 06, 2018 at 7:42 PM
> > From: "Eric Valencia" <ericlvalen...@gmail.com>
> > To: user@nutch.apache.org
> > Subject: Re: Need Tutorial on Nutch
> > Thank you kindly Yash. Yes, I did try some of the tutorials actually but
> > they seem to be missing the complete amount of steps required to
> > successfully scrape in nutch.
> >
> > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> > rit2014...@iiita.ac.in>
> > wrote:
> >
> > > I would suggest to start with the documentation on nutch's website.
> > > You can get a Idea about how to start crawling and all.
> > > Apart from that there are no proper tutorials as such.
> > > Just start crawling if you got stuck somewhere try to find something
> > > related to that on Google and nutch mailing list archives.
> > > Ask questions if nothing helps.
> > >
> > > On 7 Mar 2018 00:01, "Eric Valencia" <ericlvalen...@gmail.com> wrote:
> > >
> > > I'm a beginner in Nutch and need the best tutorials to get started. Can
> > > you guys let me know how you would advise yourselves if starting today
> > > (like me)?
> > >
> > > Eric
> > >
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
Start with nutch 1.x if you are getting some trouble. Its easier to
configure and by following nutch 1.x tutorial you will be able to crawl
your first website easily.

On 7 Mar 2018 00:13, "Eric Valencia" <ericlvalen...@gmail.com> wrote:

> Thank you kindly Yash.  Yes, I did try some of the tutorials actually but
> they seem to be missing the complete amount of steps required to
> successfully scrape in nutch.
>
> On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan <
> rit2014...@iiita.ac.in>
> wrote:
>
> > I would suggest to start with the documentation on nutch's website.
> > You can get a Idea about how to start crawling and all.
> > Apart from that there are no proper tutorials as such.
> > Just start crawling if you got stuck somewhere try to find something
> > related to that on Google and nutch mailing list archives.
> > Ask questions if nothing helps.
> >
> > On 7 Mar 2018 00:01, "Eric Valencia" <ericlvalen...@gmail.com> wrote:
> >
> > I'm a beginner in Nutch and need the best tutorials to get started.  Can
> > you guys let me know how you would advise yourselves if starting today
> > (like me)?
> >
> > Eric
> >
>


Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
I would suggest to start with the documentation on nutch's website.
You can get a Idea about how to start crawling and all.
Apart from that there are no proper tutorials as such.
Just start crawling if you got stuck somewhere try to find something
related to that on Google and nutch mailing list archives.
Ask questions if nothing helps.

On 7 Mar 2018 00:01, "Eric Valencia"  wrote:

I'm a beginner in Nutch and need the best tutorials to get started.  Can
you guys let me know how you would advise yourselves if starting today
(like me)?

Eric


RE: Regarding Internal Links

2018-03-06 Thread Yash Thenuan Thenuan
> I am able to get the content corresponding to each Internal link by
> writing a parse filter plugin. Now  I am  not getting how to proceed
> further. How can I parse them as separate document and what should
> my ParseResult filter return??


Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
Please help me out regarding this.
It's urgent.

On 5 Mar 2018 15:41, "Yash Thenuan Thenuan" <rit2014...@iiita.ac.in> wrote:

> How can I achieve this in nutch 1.x?
>
> On 1 Mar 2018 22:30, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote:
>
>> Hi,
>>
>> Yes, that's possible but only for Nutch 1.x:
>> a ParseResult [1] may contain multiple ParseData objects
>> each accessible by a separate URL.
>> This feature is not available for 2.x [2].
>>
>> It's used by the feed parser plugin to add a single
>> entry for every feed item.  Afaik, that's not supported
>> out of the box for sections of a page (e.g., split by
>> anchors or h1/h2/h3). You would need to write a
>> parse-filter plugin to achieve this.
>>
>> I've once used it to index parts of a page identified
>> by XPath expressions.
>>
>> Best,
>> Sebastian
>>
>> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/
>> nutch/parse/ParseResult.html
>> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/apache/
>> nutch/parse/Parse.html
>>
>>
>> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
>> > Hi there,
>> > For example we have a url
>> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
>> > here #table_of _contents is a internal link.
>> > I want to separate the contents of the page on the basis of internal
>> links.
>> > Is this possible in nutch??
>> > I want to index the contents of each internal link separately.
>> >
>>
>>


Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
How can I achieve this in nutch 1.x?

On 1 Mar 2018 22:30, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote:

> Hi,
>
> Yes, that's possible but only for Nutch 1.x:
> a ParseResult [1] may contain multiple ParseData objects
> each accessible by a separate URL.
> This feature is not available for 2.x [2].
>
> It's used by the feed parser plugin to add a single
> entry for every feed item.  Afaik, that's not supported
> out of the box for sections of a page (e.g., split by
> anchors or h1/h2/h3). You would need to write a
> parse-filter plugin to achieve this.
>
> I've once used it to index parts of a page identified
> by XPath expressions.
>
> Best,
> Sebastian
>
> [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/parse/
> ParseResult.html
> [2] https://nutch.apache.org/apidocs/apidocs-2.3.1/org/
> apache/nutch/parse/Parse.html
>
>
> On 03/01/2018 08:02 AM, Yash Thenuan Thenuan wrote:
> > Hi there,
> > For example we have a url
> > https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
> > here #table_of _contents is a internal link.
> > I want to separate the contents of the page on the basis of internal
> links.
> > Is this possible in nutch??
> > I want to index the contents of each internal link separately.
> >
>
>


Re: Crawling of AJAX populated content.

2018-03-05 Thread Yash Thenuan Thenuan
Is there a way to fetch https websites using selenium?

On 5 Mar 2018 14:10, "Sebastian Nagel"  wrote:

> > What will happen if I try to crawl a https website.
>
> I didn't try it, but I would expect that
> - if except protocol-selenium no other protocol plugins are active:
>fetching fails (as reported in NUTCH-2310)
> - if another protocol plugin is active which supports https:
>Fetcher will uses it to fetch https content
>
>
> On 03/05/2018 09:35 AM, narendra singh arya wrote:
> > I am using the one you told.
> > Now my question is after specifying protocol-selenium as initial Fetcher,
> > What will happen if I try to crawl a https website.
> > And what will happen if don't setup the selenium and try crawl a website.
> > Because it's not throwing any error.
> >
> > On Mon, 5 Mar 2018, 13:59 Sebastian Nagel, 
> > wrote:
> >
> >> Hi,
> >>
> >> it is not used as Fetcher but Fetcher will use it if it fetches content
> >> via http.
> >> If not used at all, it's likely a configuration issue (plugin.includes)
> or
> >> an unsupported protocol (that's true for https, see NUTCH-2310).
> >>
> >> Just to confirm: are you really using
> >>   https://github.com/momer/nutch-selenium-grid-plugin
> >> instead of protocol-selenium which is part of Nutch?
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 03/05/2018 09:00 AM, narendra singh arya wrote:
> >>> How can I know that protocol-selinium is used as Fetcher. Because I
> don't
> >>> think after going through all the steps it is being used at all.
> >>>
> >>> On Fri, 2 Mar 2018, 18:28 narendra singh arya, 
> >> wrote:
> >>>
>  I want to crawl ajax populated content using nutch.
>  I tried this with selenium-grid-plugin on nutch 1.14.
>  After following all the steps from github page
> >> nutch-selenium-grid-plugin
>  I am not able to fetch the ajax loaded content.
>  I have docker-selnium hub and node running on my mac.
>  But I am still not able to fetch the ajax loaded content.
>  Help regarding any version of nutch will be appreciated.
>  Thanks
> 
> >>>
> >>
> >>
> >
>
>


Re: Regarding Indexing to elasticsearch

2018-03-02 Thread Yash Thenuan Thenuan
  mapreduce.Job - Job job_local1792747860_0001
completed successfully
2018-03-02 17:29:53,849 INFO  mapreduce.Job - Counters: 15
File System Counters
FILE: Number of bytes read=610359
FILE: Number of bytes written=891634
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=79
Map output records=0
Input split bytes=995
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=103
Total committed heap usage (bytes)=225443840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2018-03-02 17:29:53,866 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:53,866 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port  (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


2018-03-02 17:29:53,925 INFO  indexer.IndexingJob - IndexingJob: done.


On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi,
>
> looks more like that there is nothing to index.
>
> Unfortunately, in 2.x there are no log messages
> on by default which indicate how many documents
> are sent to the index back-ends.
>
> The easiest way is to enable Job counters in
> conf/log4j.properties by adding the line:
>
>  log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> or setting the level to INFO for
>
>  log4j.logger.org.apache.hadoop=WARN
>
> Make sure the log4j.properties is correctly deployed
> (in doubt, run "ant runtime"). Then check the hadoop.log
> again: there should be a counter DocumentCount with non-zero
> value.
>
> Best,
> Sebastian
>
>
> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> > Following are the logs from hadoop.log
> >
> > 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
> > 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:48,663 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,666 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:48,792 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,798 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-0

Re: Regarding Indexing to elasticsearch

2018-03-01 Thread Yash Thenuan Thenuan
Following are the logs from hadoop.log

2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:48,663 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 11:18:48,666 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 11:18:48,792 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 11:18:48,798 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port  (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.


On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <wastl.na...@googlemail.com
> wrote:

> It's impossible to find the reason from console output.
> Please check the hadoop.log, it should contain more logs
> including those from ElasticIndexWriter.
>
> Sebastian
>
> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> > Hi Sebastian All of this is coming but the problem is,The content is not
> > sent sent.Nothing is indexed to es.
> > This is the output on debug level.
> >
> > ElasticIndexWriter
> >
> > elastic.cluster : elastic prefix cluster
> >
> > elastic.host : hostname
> >
> > elastic.port : port  (default 9200)
> >
> > elastic.index : elastic index command
> >
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > no modules loaded
> >
> > loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >
> > loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >
> > loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >
> > loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >
> > loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >
> > created thread pool: name [force_merge], size [1], queue size [unbounded]
> >
> > created thread pool: name [fetch_shard_started], core [1], max [8], keep
> > alive [5m]
> >
> > created thread pool: name [listener], size [2], queue size [unbounded]
> >
> > crea

Regarding Internal Links

2018-02-28 Thread Yash Thenuan Thenuan
Hi there,
For example we have a url
https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents
here #table_of _contents is a internal link.
I want to separate the contents of the page on the basis of internal links.
Is this possible in nutch??
I want to index the contents of each internal link separately.


Re: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
Hi Sebastian All of this is coming but the problem is,The content is not
sent sent.Nothing is indexed to es.
This is the output on debug level.

ElasticIndexWriter

elastic.cluster : elastic prefix cluster

elastic.host : hostname

elastic.port : port  (default 9200)

elastic.index : elastic index command

elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)

elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)


no modules loaded

loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]

loaded plugin [org.elasticsearch.join.ParentJoinPlugin]

loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]

loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]

loaded plugin [org.elasticsearch.transport.Netty4Plugin]

created thread pool: name [force_merge], size [1], queue size [unbounded]

created thread pool: name [fetch_shard_started], core [1], max [8], keep
alive [5m]

created thread pool: name [listener], size [2], queue size [unbounded]

created thread pool: name [index], size [4], queue size [200]

created thread pool: name [refresh], core [1], max [2], keep alive [5m]

created thread pool: name [generic], core [4], max [128], keep alive [30s]

created thread pool: name [warmer], core [1], max [2], keep alive [5m]

thread pool [search] will adjust queue by [50] when determining automatic
queue size

created thread pool: name [search], size [7], queue size [1k]

created thread pool: name [flush], core [1], max [2], keep alive [5m]

created thread pool: name [fetch_shard_store], core [1], max [8], keep
alive [5m]

created thread pool: name [management], core [1], max [5], keep alive [5m]

created thread pool: name [get], size [4], queue size [1k]

created thread pool: name [bulk], size [4], queue size [200]

created thread pool: name [snapshot], core [1], max [2], keep alive [5m]

node_sampler_interval[5s]

adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
127.0.0.1:9300}]

connected to node
[{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
127.0.0.1:9300}]

IndexingJob: done


On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
wastl.na...@googlemail.com> wrote:

> I never tried ES with Nutch 2.3 but it should be similar to setup as for
> 1.x:
>
> - enable the plugin "indexer-elastic" in plugin.includes
>   (upgrade and rename to "indexer-elastic2" in 2.4)
>
> - expects ES 1.4.1
>
> - available/required options are found in the log file (hadoop.log):
>ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9300)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
> elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
> Sebastian
>
> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> > Yeah
> > I was also thinking that
> > Can somebody help me with nutch 2.3?
> >
> > On 28 Feb 2018 17:53, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >
> >> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
> >> Nutch 1.x. I'm afraid I can't help you.
> >>
> >>> -Original Message-
> >>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> >>> Sent: 28 February 2018 14:20
> >>> To: user@nutch.apache.org
> >>> Subject: RE: Regarding Indexing to elasticsearch
> >>>
> >>> IndexingJob ( | -all |-reindex) [-crawlId ] This is the
> >> output of
> >>> nutch index i have already configured the nutch-site.xml.
> >>>
> >>> On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >>>
> >>>> I suggest you run "nutch index", take a look at the returned help
> >>>> message, and continue from there.
> >>>> Broadly, first of all you need to configure your elasticsearch
> >>>> environment in nutch-site.xml, and then you need to run nutch index
> >>>> with the location of your CrawlDB and either the segment you want to
> >>>> index or the directory that contains all the segments you want to
> >> index.
> >>>>
> >>>>> -Original Message-
> >>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> >>>>> Sent: 28 February 2018 14:06
> >>>>> To: user@nutch.apache.org
> >>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>
> >>>>> All I want  is to index my parsed data to elasticsearch.
> >>&

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
Yeah
I was also thinking that
Can somebody help me with nutch 2.3?

On 28 Feb 2018 17:53, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> Sorry, I just realized that you're using Nutch 2.x and I'm answering for
> Nutch 1.x. I'm afraid I can't help you.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > Sent: 28 February 2018 14:20
> > To: user@nutch.apache.org
> > Subject: RE: Regarding Indexing to elasticsearch
> >
> > IndexingJob ( | -all |-reindex) [-crawlId ] This is the
> output of
> > nutch index i have already configured the nutch-site.xml.
> >
> > On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >
> > > I suggest you run "nutch index", take a look at the returned help
> > > message, and continue from there.
> > > Broadly, first of all you need to configure your elasticsearch
> > > environment in nutch-site.xml, and then you need to run nutch index
> > > with the location of your CrawlDB and either the segment you want to
> > > index or the directory that contains all the segments you want to
> index.
> > >
> > > > -Original Message-
> > > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > > > Sent: 28 February 2018 14:06
> > > > To: user@nutch.apache.org
> > > > Subject: RE: Regarding Indexing to elasticsearch
> > > >
> > > > All I want  is to index my parsed data to elasticsearch.
> > > >
> > > >
> > > > On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> > > >
> > > > Hi Yash,
> > > >
> > > > The nutch index command does not have a -all flag, so I'm not sure
> > > > what
> > > you're
> > > > trying to achieve here.
> > > >
> > > > Yossi.
> > > >
> > > > > -Original Message-
> > > > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > > > > Sent: 28 February 2018 13:55
> > > > > To: user@nutch.apache.org
> > > > > Subject: Regarding Indexing to elasticsearch
> > > > >
> > > > > Can somebody please tell me what happens when we hit the bin/nutc
> > > > > index
> > > > -all
> > > > > command.
> > > > > Because I can't figure out why the write function inside the
> > > > elastic-indexer is not
> > > > > getting executed.
> > >
> > >
>
>


RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
IndexingJob ( | -all |-reindex) [-crawlId ]
This is the output of nutch index i have already configured the
nutch-site.xml.

On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

> I suggest you run "nutch index", take a look at the returned help message,
> and continue from there.
> Broadly, first of all you need to configure your elasticsearch environment
> in nutch-site.xml, and then you need to run nutch index with the location
> of your CrawlDB and either the segment you want to index or the directory
> that contains all the segments you want to index.
>
> > -Original Message-
> > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > Sent: 28 February 2018 14:06
> > To: user@nutch.apache.org
> > Subject: RE: Regarding Indexing to elasticsearch
> >
> > All I want  is to index my parsed data to elasticsearch.
> >
> >
> > On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >
> > Hi Yash,
> >
> > The nutch index command does not have a -all flag, so I'm not sure what
> you're
> > trying to achieve here.
> >
> > Yossi.
> >
> > > -Original Message-
> > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> > > Sent: 28 February 2018 13:55
> > > To: user@nutch.apache.org
> > > Subject: Regarding Indexing to elasticsearch
> > >
> > > Can somebody please tell me what happens when we hit the bin/nutc
> > > index
> > -all
> > > command.
> > > Because I can't figure out why the write function inside the
> > elastic-indexer is not
> > > getting executed.
>
>


RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
All I want  is to index my parsed data to elasticsearch.


On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:

Hi Yash,

The nutch index command does not have a -all flag, so I'm not sure what
you're trying to achieve here.

Yossi.

> -Original Message-
> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> Sent: 28 February 2018 13:55
> To: user@nutch.apache.org
> Subject: Regarding Indexing to elasticsearch
>
> Can somebody please tell me what happens when we hit the bin/nutc index
-all
> command.
> Because I can't figure out why the write function inside the
elastic-indexer is not
> getting executed.


Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
Can somebody please tell me what happens when we hit the bin/nutc index
-all command.
Because I can't figure out why the write function inside the
elastic-indexer is not getting executed.