RE: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
so, There is no way to crawl if they blocked their web-sites to crawl ? I've one idea, But seems little bit foolish(not works/I've to Modify whole architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of fetcher then? Anyhow Html-parser easily takes all contents of the web-page.Can i

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
so, There is no way to crawl if they blocked their web-sites to crawl ? I've one idea, But seems little bit foolish(not works/I've to Modify whole architecture) still I'm telling, If I use Html-Parser(Jsoup) Instead of fetcher then? Anyhow Html-parser easily takes all contents of the web-page.Can i

RE: [ANNOUNCE] Web Crawler

2013-07-15 Thread karl.wright
To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Lucene does not provide any capabilities for crawling websites. You would have to contact the Nutch project, the ManifoldCF project, or other web crawling projects. As far as bypassing robots.txt, that is a very unethical thing

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Jack Krupansky
anybody on this mailing list would engage in such an unethical or unprofessional activity. -- Jack Krupansky -Original Message- From: Ramakrishna Sent: Monday, July 15, 2013 9:13 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler Hi.. I'm trying nutch to

Re: [ANNOUNCE] Web Crawler

2013-07-15 Thread Ramakrishna
Hi.. I'm trying nutch to crawl some web-sites. Unfortunately they restricted to crawl their web-site by writing robots.txt. By using crawl-anywhere can I crawl any web-sites irrespective of that web-sites robots.txt??? If yes, plz send me the materials/links to study about crawl-anywhere or else p

Re: [ANNOUNCE] Web Crawler

2011-05-27 Thread Dominique Bejean
Hi, Sorry for the delay, but I haven't been checking the mailing list for a long time. Crawl-anywhere includes 3 piece of software : a crawler, a pipeline and a solr indexer. There is a default Solr schema used by Crawl-anywhere, tested with Solr 1.4.1 and Solr 3.1.0. But, you can config

Re: [ANNOUNCE] Web Crawler

2011-05-16 Thread abhayd
hi Julien , I m not sure what do u mean by "SOLR is now used by default for indexing in Nutch." Does that mean SOLR has integrated Nutch for crawling web resources? I checked SOLR wiki but i didnt see something like that, Could u please provide some details ? -- View this message in context:

Re: [ANNOUNCE] Web Crawler

2011-05-16 Thread Julien Nioche
> I dont see any activities on Nutch wiki so wondering if its not being > developed anymore. But most forums say Nutch is standard for solr. > Looking at the mail archives is a good clue of whether a project is still alive or not. In the case of Nutch, the project is active as you can see on the l

RE: [ANNOUNCE] Web Crawler

2011-05-15 Thread karl.wright
You might want to look at ManifoldCF also. Karl -Original Message- From: ext abhayd [mailto:ajdabhol...@hotmail.com] Sent: Saturday, May 14, 2011 9:29 AM To: java-user@lucene.apache.org Subject: Re: [ANNOUNCE] Web Crawler hi Dominique, I am looking for a crawler to feed solr index

Re: [ANNOUNCE] Web Crawler

2011-05-15 Thread abhayd
hi Dominique, I am looking for a crawler to feed solr index. After looking at various posts i have settled down on two Nutch and crawl anywhere. I dont see any activities on Nutch wiki so wondering if its not being developed anymore. But most forums say Nutch is standard for solr. Crawl Anywhere