Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
You can try checking robots.txt for these websites On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan Most probably the problem is these websites allow only some specific > crawlers in their robots.txt file. > > On Wed, 14 Nov 2018, 15:56 Semyon Semyonov wrote: > >> Hi Nich

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
Most probably the problem is these websites allow only some specific crawlers in their robots.txt file. On Wed, 14 Nov 2018, 15:56 Semyon Semyonov Hi Nicholas, > > I have the same problem with https://www.graydon.nl/ > And it doesnt look like a wordpress website. > > Semyon > > > Sent:

Re: Alternatives to Solr

2018-10-05 Thread Yash Thenuan Thenuan
You can use elasticsearch. On Sat, 6 Oct 2018, 00:58 Timeka Cobb, wrote: > Hello folks! Does anyone know of a good alternative to Solr? Im asking this > becasue Ive been trying to connect the 2 and its been so frustrating. > The Nutch Wiki is extremely unreliable when it comes to Solr and

Re: Having plugin as a separate project

2018-05-07 Thread Yash Thenuan Thenuan
> > > > in the Nutch "root" folder. > > > > > > A little bit slower but guarantees that everything is compiled: > > > > > > ant -Dplugin=urlnormalizer-basic test-plugin > > > > > > Or sometimes it's enough to skip some of the long runn

Having plugin as a separate project

2018-05-04 Thread Yash Thenuan Thenuan
Hi all, I want to compile my plugins separately so that I need not compile the whole project again when I make a change in some plugin. How can I achieve that? Thanks

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
eally recommend debugging in local mode rather than using sysout. > > > -----Original Message----- > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > Sent: 15 March 2018 10:13 > > To: user@nutch.apache.org > > Subject: RE: RE: Dependency between pl

RE: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
ent as their fourth parameter. > > > -Original Message- > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > Sent: 15 March 2018 08:50 > > To: user@nutch.apache.org > > Subject: Re: RE: Dependency between plugins > > > > Hi Jorge and Yos

Re: RE: Dependency between plugins

2018-03-15 Thread Yash Thenuan Thenuan
ct. The advantage of the dispatcher > approach is that you don't need to deal with a lot of the Nutch overhead, > but it is more monolithic (You can end up with one huge plugin that needs > to be constantly modified whenever one of the websites is modified). > > > -Original Mes

Re: RE: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
.org > Subject: RE: Dependency between plugins > One suggestion I can make is to ensure that the html-parse plugin is built > before your plugin (since you are including the jars that are generated in > its build). > > > -Original Message- > > From: Yash Thenuan The

Re: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
ur plugin-includes? > > If it's a problem during execution, I would suggest looking at or > debugging the code of PluginClassLoader. > > > > -Original Message- > > From: Yash Thenuan Thenuan <rit2014...@iiita.ac.in> > > Sent: 14 March 2018 08:34 &g

Re: Dependency between plugins

2018-03-14 Thread Yash Thenuan Thenuan
Anybody please help me out regarding this. On Tue, Mar 13, 2018 at 6:51 PM, Yash Thenuan Thenuan < rit2014...@iiita.ac.in> wrote: > I am trying to import Htmlparser in my custom parser. > I did it in the same way by which Htmlparser imports lib-nekohtml but it > didn't worked

Dependency between plugins

2018-03-13 Thread Yash Thenuan Thenuan
I am trying to import Htmlparser in my custom parser. I did it in the same way by which Htmlparser imports lib-nekohtml but it didn't worked. Can anybody please tell me how to do it?

RE: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
happen after NUTCH-2456. Could you open a Jira issue > to address the problem? Thanks! > > As a quick work-around: > - either disable scoring-opic while indexing > - or check dbDatum for null in scoring-opic indexerScore(...) > > Thanks, > Sebastian > > On 03/07/2018 11:13

Re: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
om the original ParseResult makes sense, or just using the > constructor that does not require it if you don't care about the metadata. > This should all be easier to understand if you look at what the HTML > Parser does with each of these fields. > > > -Original Message- &g

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
is > > Hadoop. > > > > Here we are, no pain, no gain. > > > > > > > > Sent: Tuesday, March 06, 2018 at 7:42 PM > > From: "Eric Valencia" <ericlvalen...@gmail.com> > > To: user@nutch.apache.org > > Subject: Re: Need Tutorial on Nutch &g

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
Yes, I did try some of the tutorials actually but > they seem to be missing the complete amount of steps required to > successfully scrape in nutch. > > On Tue, Mar 6, 2018 at 10:37 AM Yash Thenuan Thenuan < > rit2014...@iiita.ac.in> > wrote: > > > I would suggest to start with t

Re: Need Tutorial on Nutch

2018-03-06 Thread Yash Thenuan Thenuan
I would suggest to start with the documentation on nutch's website. You can get a Idea about how to start crawling and all. Apart from that there are no proper tutorials as such. Just start crawling if you got stuck somewhere try to find something related to that on Google and nutch mailing list

RE: Regarding Internal Links

2018-03-06 Thread Yash Thenuan Thenuan
> I am able to get the content corresponding to each Internal link by > writing a parse filter plugin. Now I am not getting how to proceed > further. How can I parse them as separate document and what should > my ParseResult filter return??

Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
Please help me out regarding this. It's urgent. On 5 Mar 2018 15:41, "Yash Thenuan Thenuan" <rit2014...@iiita.ac.in> wrote: > How can I achieve this in nutch 1.x? > > On 1 Mar 2018 22:30, "Sebastian Nagel" <wastl.na...@googlemail.com> wrote: > >>

Re: Regarding Internal Links

2018-03-05 Thread Yash Thenuan Thenuan
se-filter plugin to achieve this. > > I've once used it to index parts of a page identified > by XPath expressions. > > Best, > Sebastian > > [1] https://nutch.apache.org/apidocs/apidocs-1.14/org/apache/nutch/parse/ > ParseResult.html > [2] https://nutch.apache.org/apidocs/api

Re: Crawling of AJAX populated content.

2018-03-05 Thread Yash Thenuan Thenuan
Is there a way to fetch https websites using selenium? On 5 Mar 2018 14:10, "Sebastian Nagel" wrote: > > What will happen if I try to crawl a https website. > > I didn't try it, but I would expect that > - if except protocol-selenium no other protocol plugins are

Re: Regarding Indexing to elasticsearch

2018-03-02 Thread Yash Thenuan Thenuan
ould be a counter DocumentCount with non-zero > value. > > Best, > Sebastian > > > On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote: > > Following are the logs from hadoop.log > > > > 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting &

Re: Regarding Indexing to elasticsearch

2018-03-01 Thread Yash Thenuan Thenuan
Nagel <wastl.na...@googlemail.com > wrote: > It's impossible to find the reason from console output. > Please check the hadoop.log, it should contain more logs > including those from ElasticIndexWriter. > > Sebastian > > On 03/01/2018 06:38 AM, Yash Thenuan Thenuan

Regarding Internal Links

2018-02-28 Thread Yash Thenuan Thenuan
Hi there, For example we have a url https://wiki.apache.org/nutch/NutchTutorial#Table_of_Contents here #table_of _contents is a internal link. I want to separate the contents of the page on the basis of internal links. Is this possible in nutch?? I want to index the contents of each internal link

Re: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
t (default 9300) > elastic.index : elastic index command > elastic.max.bulk.docs : elastic bulk index doc counts. (default > 250) > elastic.max.bulk.size : elastic bulk index length. (default > 2500500 ~2.5MB) > > Sebastian > > On 02/28/2018 01:26 PM

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
-Original Message- > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > > Sent: 28 February 2018 14:20 > > To: user@nutch.apache.org > > Subject: RE: Regarding Indexing to elasticsearch > > > > IndexingJob ( | -all |-reindex) [-crawlId ] This is

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
ments you want to index. > > > -Original Message- > > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > > Sent: 28 February 2018 14:06 > > To: user@nutch.apache.org > > Subject: RE: Regarding Indexing to elasticsearch > > > > All I wa

RE: Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
Message- > From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > Sent: 28 February 2018 13:55 > To: user@nutch.apache.org > Subject: Regarding Indexing to elasticsearch > > Can somebody please tell me what happens when we hit the bin/nutc index -all > command. > Becaus

Regarding Indexing to elasticsearch

2018-02-28 Thread Yash Thenuan Thenuan
Can somebody please tell me what happens when we hit the bin/nutc index -all command. Because I can't figure out why the write function inside the elastic-indexer is not getting executed.