Re: Re[2]: Siet is not crawling

2023-08-13 Thread Markus Jelsma
Hello Raj,

I see. Unfortunately turning on Javascript supporting protocol plugins such
as Htmlunit or Selenium does not always solve the problem

Maybe you can ask at the Selenium project about this. They are the experts
on that particular problem.

Regards,
Markus

Op di 1 aug 2023 om 19:38 schreef Raj Chidara :

> Hello Markus
>   Now, I have removed all other protocol-* and given only
> protocol-selenium.  Now it crawled few pages.  However, there is no content
> read from pages.  All pages are shown as only with text *Home*
>
> Thanks and Regards
> Raj Chidara
>
>
>
>  On Mon, 30 Jan 2023 18:35:06 +0530 *Markus Jelsma
> >* wrote ---
>
> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the
> work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara :
>
>
> >
> > Hello Markus
> > Sorry for duplicate question. I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > plugin.includes
> >
> >
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> >
> > Still the site is not crawling. Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > - Original Message -
> > From: Markus Jelsma (markus.jel...@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chid...@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > > Nutch is not able crawl this site. Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>
>
>
>


Re: Re[2]: Siet is not crawling

2023-08-01 Thread Raj Chidara
Hello Markus

  Now, I have removed all other protocol-* and given only protocol-selenium.  
Now it crawled few pages.  However, there is no content read from pages.  All 
pages are shown as only with text Home



Thanks and Regards

Raj Chidara








 On Mon, 30 Jan 2023 18:35:06 +0530 Markus Jelsma 
 wrote ---



Yes, remove the other protocol-* plugins from the configuration. With all 
three active it is not always determined which one is going to do the work. 
 
Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara 
: 
 
> 
> Hello Markus 
>   Sorry for duplicate question.  I added selenium plugin in 
> conf/nutch-default.xml and included following 
> 
> plugin.includes 
> 
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>  
> 
> Still the site is not crawling.  Are there any additional steps to be 
> followed for installation of selenium. Please suggest 
> 
> 
> Thanks and Regards 
> 
> Raj Chidara 
> 
> - Original Message - 
> From: Markus Jelsma (mailto:markus.jel...@openindex.io) 
> Date: 30-01-2023 16:26 
> To: mailto:user@nutch.apache.org 
> Subject: Re: Siet is not crawling 
> 
> Hello Raj, 
> 
> I think the same question about the same site was asked here some time ago. 
> Anyway, this site loads its content via Javascript. You will need a 
> protocol plugin that supports it, either protocol-htmlunit, or 
> protocol-selenium, instead of protocol-http or any other. 
> 
> Change the configuration for plugin.includes, and it should work. 
> 
> Markus 
> 
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara 
>  >: 
> 
> > 
> > Hello, 
> > 
> >   Nutch is not able crawl this site.  Are there any nutch configuration 
> > changes required for this site? 
> > 
> > https://www.ich.org/ 
> > 
> > 
> > Thanks and Regards 
> > 
> > Raj Chidara 
> > 
> > 
> > 
> 
>

Re: Re[2]: Siet is not crawling

2023-01-30 Thread Steven Zhu
Already unsubscribed. Why do I still get this email?
Thanks

Steven

On Mon, Jan 30, 2023 at 7:06 AM Markus Jelsma 
wrote:

> Yes, remove the other protocol-* plugins from the configuration. With all
> three active it is not always determined which one is going to do the work.
>
> Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara  >:
>
> >
> > Hello Markus
> >   Sorry for duplicate question.  I added selenium plugin in
> > conf/nutch-default.xml and included following
> >
> > plugin.includes
> >
> >
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
> >
> > Still the site is not crawling.  Are there any additional steps to be
> > followed for installation of selenium. Please suggest
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> > - Original Message -
> > From: Markus Jelsma (markus.jel...@openindex.io)
> > Date: 30-01-2023 16:26
> > To: user@nutch.apache.org
> > Subject: Re: Siet is not crawling
> >
> > Hello Raj,
> >
> > I think the same question about the same site was asked here some time
> ago.
> > Anyway, this site loads its content via Javascript. You will need a
> > protocol plugin that supports it, either protocol-htmlunit, or
> > protocol-selenium, instead of protocol-http or any other.
> >
> > Change the configuration for plugin.includes, and it should work.
> >
> > Markus
> >
> > Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara <
> raj.chid...@ddismart.com
> > >:
> >
> > >
> > > Hello,
> > >
> > >   Nutch is not able crawl this site.  Are there any nutch configuration
> > > changes required for this site?
> > >
> > > https://www.ich.org/
> > >
> > >
> > > Thanks and Regards
> > >
> > > Raj Chidara
> > >
> > >
> > >
> >
> >
>


Re: Re[2]: Siet is not crawling

2023-01-30 Thread Markus Jelsma
Yes, remove the other protocol-* plugins from the configuration. With all
three active it is not always determined which one is going to do the work.

Op ma 30 jan. 2023 om 12:50 schreef Raj Chidara :

>
> Hello Markus
>   Sorry for duplicate question.  I added selenium plugin in
> conf/nutch-default.xml and included following
>
> plugin.includes
>
> protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
>
> Still the site is not crawling.  Are there any additional steps to be
> followed for installation of selenium. Please suggest
>
>
> Thanks and Regards
>
> Raj Chidara
>
> - Original Message -
> From: Markus Jelsma (markus.jel...@openindex.io)
> Date: 30-01-2023 16:26
> To: user@nutch.apache.org
> Subject: Re: Siet is not crawling
>
> Hello Raj,
>
> I think the same question about the same site was asked here some time ago.
> Anyway, this site loads its content via Javascript. You will need a
> protocol plugin that supports it, either protocol-htmlunit, or
> protocol-selenium, instead of protocol-http or any other.
>
> Change the configuration for plugin.includes, and it should work.
>
> Markus
>
> Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara  >:
>
> >
> > Hello,
> >
> >   Nutch is not able crawl this site.  Are there any nutch configuration
> > changes required for this site?
> >
> > https://www.ich.org/
> >
> >
> > Thanks and Regards
> >
> > Raj Chidara
> >
> >
> >
>
>


Re[2]: Siet is not crawling

2023-01-30 Thread Raj Chidara


Hello Markus
  Sorry for duplicate question.  I added selenium plugin in 
conf/nutch-default.xml and included following

plugin.includes
  
protocol-http|protocol-httpclient|protocol-selenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)

Still the site is not crawling.  Are there any additional steps to be followed 
for installation of selenium. Please suggest


Thanks and Regards

Raj Chidara

- Original Message -
From: Markus Jelsma (markus.jel...@openindex.io)
Date: 30-01-2023 16:26
To: user@nutch.apache.org
Subject: Re: Siet is not crawling

Hello Raj,

I think the same question about the same site was asked here some time ago.
Anyway, this site loads its content via Javascript. You will need a
protocol plugin that supports it, either protocol-htmlunit, or
protocol-selenium, instead of protocol-http or any other.

Change the configuration for plugin.includes, and it should work.

Markus

Op ma 30 jan. 2023 om 10:39 schreef Raj Chidara :

>
> Hello,
>
>   Nutch is not able crawl this site.  Are there any nutch configuration
> changes required for this site?
>
> https://www.ich.org/
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>