Which was exactly what I suggested. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
> On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote: > > In the mean time I have found a better solution at the moment is to test on > a site that allows users to crawl their site. > > On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> wrote: > >> I think you misunderstand the argument was about stealing content. Sorry >> but I think you need to read what people write before making bold >> statements. >> >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org> >> wrote: >> >>> Let’s not get snarky right away, especially when you are wrong. >>> >>> Corporations do not generally ignore robots.txt. I worked on a commercial >>> web spider for ten years. Occasionally, our customers did need to bypass >>> portions of robots.txt. That was usually because of a poorly-maintained web >>> server, or because our spider could safely crawl some content that would >>> cause problems for other crawlers. >>> >>> If you want to learn crawling, don’t start by breaking the conventions of >>> good web citizenship. Instead, start with sitemap.xml and crawl the >>> preferred portions of a site. >>> >>> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html> >>> >>> If the site blocks you, find a different site to learn on. >>> >>> I like the looks of “Scrapy”, written in Python. I haven’t used it for >>> anything big, but I’d start with that for learning. >>> >>> https://scrapy.org/ <https://scrapy.org/> >>> >>> If you want to learn on a site with a lot of content, try ours, chegg.com >>> But if your crawler gets out of hand, crawling too fast, we’ll block it. >>> Any other site will do the same. >>> >>> I would not base the crawler directly on Solr. A crawler needs a >>> dedicated database to record the URLs visited, errors, duplicates, etc. The >>> output of the crawl goes to Solr. That is how we did it with Ultraseek >>> (before Solr existed). >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>> >>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote: >>>> >>>> Oh well I guess its ok if a corporation does it but not someone wanting >>> to >>>> learn more about the field. I actually have written a crawler before as >>>> well as the you know Inverted Index of how solr works but I just thought >>>> its architecture was better suited for scaling. >>>> >>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com> >>> wrote: >>>> >>>>> And I mean that in the context of stealing content from sites that >>>>> explicitly declare they don't want to be crawled. Robots.txt is to be >>>>> followed. >>>>> >>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com> >>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I was wondering if anyone could guide me on how to crawl the web and >>>>>> ignore the robots.txt since I can not index some big sites. Or if >>> someone >>>>>> could point how to get around it. I read somewhere about a >>>>>> protocol.plugin.check.robots >>>>>> but that was for nutch. >>>>>> >>>>>> The way I index is >>>>>> bin/post -c gettingstarted https://en.wikipedia.org/ >>>>>> >>>>>> but I can't index the site I'm guessing because of the robots.txt. >>>>>> I can index with >>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr >>>>>> >>>>>> which I am guessing allows it. I was also wondering how to find the >>> name >>>>> of >>>>>> the crawler bin/post uses. >>>>> >>> >>>