Re: Solr Web Crawler - Robots.txt

Walter Underwood Thu, 01 Jun 2017 15:34:51 -0700

Which was exactly what I suggested.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



> On Jun 1, 2017, at 3:31 PM, David Choi <choi.davi...@gmail.com> wrote:
> 
> In the mean time I have found a better solution at the moment is to test on
> a site that allows users to crawl their site.
> 
> On Thu, Jun 1, 2017 at 5:26 PM David Choi <choi.davi...@gmail.com> wrote:
> 
>> I think you misunderstand the argument was about stealing content. Sorry
>> but I think you need to read what people write before making bold
>> statements.
>> 
>> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>> 
>>> Let’s not get snarky right away, especially when you are wrong.
>>> 
>>> Corporations do not generally ignore robots.txt. I worked on a commercial
>>> web spider for ten years. Occasionally, our customers did need to bypass
>>> portions of robots.txt. That was usually because of a poorly-maintained web
>>> server, or because our spider could safely crawl some content that would
>>> cause problems for other crawlers.
>>> 
>>> If you want to learn crawling, don’t start by breaking the conventions of
>>> good web citizenship. Instead, start with sitemap.xml and crawl the
>>> preferred portions of a site.
>>> 
>>> https://www.sitemaps.org/index.html <https://www.sitemaps.org/index.html>
>>> 
>>> If the site blocks you, find a different site to learn on.
>>> 
>>> I like the looks of “Scrapy”, written in Python. I haven’t used it for
>>> anything big, but I’d start with that for learning.
>>> 
>>> https://scrapy.org/ <https://scrapy.org/>
>>> 
>>> If you want to learn on a site with a lot of content, try ours, chegg.com
>>> But if your crawler gets out of hand, crawling too fast, we’ll block it.
>>> Any other site will do the same.
>>> 
>>> I would not base the crawler directly on Solr. A crawler needs a
>>> dedicated database to record the URLs visited, errors, duplicates, etc. The
>>> output of the crawl goes to Solr. That is how we did it with Ultraseek
>>> (before Solr existed).
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 1, 2017, at 3:01 PM, David Choi <choi.davi...@gmail.com> wrote:
>>>> 
>>>> Oh well I guess its ok if a corporation does it but not someone wanting
>>> to
>>>> learn more about the field. I actually have written a crawler before as
>>>> well as the you know Inverted Index of how solr works but I just thought
>>>> its architecture was better suited for scaling.
>>>> 
>>>> On Thu, Jun 1, 2017 at 4:47 PM Dave <hastings.recurs...@gmail.com>
>>> wrote:
>>>> 
>>>>> And I mean that in the context of stealing content from sites that
>>>>> explicitly declare they don't want to be crawled. Robots.txt is to be
>>>>> followed.
>>>>> 
>>>>>> On Jun 1, 2017, at 5:31 PM, David Choi <choi.davi...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I was wondering if anyone could guide me on how to crawl the web and
>>>>>> ignore the robots.txt since I can not index some big sites. Or if
>>> someone
>>>>>> could point how to get around it. I read somewhere about a
>>>>>> protocol.plugin.check.robots
>>>>>> but that was for nutch.
>>>>>> 
>>>>>> The way I index is
>>>>>> bin/post -c gettingstarted https://en.wikipedia.org/
>>>>>> 
>>>>>> but I can't index the site I'm guessing because of the robots.txt.
>>>>>> I can index with
>>>>>> bin/post -c gettingstarted http://lucene.apache.org/solr
>>>>>> 
>>>>>> which I am guessing allows it. I was also wondering how to find the
>>> name
>>>>> of
>>>>>> the crawler bin/post uses.
>>>>> 
>>> 
>>>

Re: Solr Web Crawler - Robots.txt

Reply via email to