Re: Crawling (better: indexing) only certain URLS

Andrea Gazzarini Fri, 15 Apr 2016 23:17:59 -0700

And, sorry for the spamming, another question:

As far as I understood, another option, even more better, could be the
following:


in the prefix-urlfilter:

http://www.xyz.com/book/

And in the regex-urlfilter.txt

+.

but it doesn't work...it still crawls everything, also URLs (belonging to
the same xyz domain) that don't have that prefix.

Best,
Andrea


On Sat, Apr 16, 2016 at 7:58 AM, Andrea Gazzarini <[email protected]> wrote:

> Hi Furkan,
> I'm not able to have it working. Maybe I misunderstood your email.
>
> Simplifying, let's assume my website has the following structure
>
> http://www.xyz.com/book/1
>
> that contains a link towards
>
> http://www.xyz.com/book/2
> http://www.xyz.com/book/3
>
> The /2 and /3 also contain some outlink, so the site map is the following
>
>
>    - http://www.xyz.com/book/1
>    - http://www.xyz.com/book/2
>       - http://www.xyz.com/book/5
>          - http://www.xyz.com/book/6
>          - http://www.xyz.com/book/3
>       - http://www.xyz.com/book/7
>          - http://www.xyz.com/book/8
>
> I put
>
> http://www.xyz.com/book/1
>
> in the seed file and the following line in the regex-urlfilter.txt (the
> only uncommented line)
>
> +^http://www.xyz.com/book/([0-9]*\.)
>
> Running
>
> bin/crawl -i -D solr.server.url=http://localhost:8983/solr/woozlee
> urls/few/captain-gazza.txt TestCrawl x
>
>
> Injector: Total number of urls rejected by filters: 0
>
> Injector: Total number of urls after normalization: 1
>
> ...
>
> Indexing 1 documents
>
> Indexer: number of documents indexed, deleted, or skipped:
>
> Indexer:      1  indexed (add/update)
>
> Indexer: finished at 2016-04-16 07:54:35, elapsed: 00:00:04
>
> Cleaning up index if possible
>
> /home/solr/apache-nutch-1.11/bin/nutch clean -Dsolr.server.url=
> http://localhost:8983/solr/woozlee TestCrawl/crawldb
>
> Sat Apr 16 07:54:39 CEST 2016 : Iteration 2 of 5
>
> Generating a new segment
>
> /home/solr/apache-nutch-1.11/bin/nutch generate -D mapreduce.job.reduces=2
> -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
> -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
> TestCrawl/crawldb TestCrawl/segments -topN 50000 -numFetchers 1 -noFilter
>
> Generator: starting at 2016-04-16 07:54:40
>
> Generator: Selecting best-scoring urls due for fetch.
>
> Generator: filtering: false
>
> Generator: normalizing: true
>
> Generator: topN: 50000
>
> Generator: 0 records selected for fetching, exiting ...
>
> Generate returned 1 (no new segments created)
>
> Escaping loop: no more URLs to fetch now
>
>
> Whatever is *x*, the cycle completes shortly and indexes only the URL in
> the seed list (i.e. I have one record indexed in Solr).
>
>
> Again, many thanks for your help
>
>
> Best,
>
> Andrea
>
> On Fri, Apr 15, 2016 at 9:23 PM, Andrea Gazzarini <[email protected]>
> wrote:
>
>> Hi Furkan,
>> many thanks, I'm going to try and I'll let you know.
>>
>> For the first question, I'm not sure about the overall size but we're
>> talking about 2milions (growing) pages; in general nothing that can be
>> easily handled with a  from-scratch and custom solution.
>>
>> I was wondering if, from a functional perspective, Nutch is a good fit
>> for "automatizing" the periodic indexing (in Solr, this is my ultimate
>> goal) of that website. If that works the same mechanism will be used for
>> other websites as well.
>>
>> Best,
>> Andrea
>> On 15 Apr 2016 18:16, "Furkan KAMACI" <[email protected]> wrote:
>>
>>> Hi Andrea,
>>>
>>> Regex URL Filter works like that:
>>>
>>> This accepts anything else:
>>>
>>> *+.*
>>>
>>> Let's assume that you want to crawl Nutch's website. If you wished to
>>> limit
>>> the crawl to the nutch.apache.org domain, than definition should be
>>> that:
>>>
>>> * +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>*
>>>
>>> So, if your more like this section has that pattern:
>>>
>>> *http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>*
>>>
>>> Than your definition should be that:
>>>
>>> *+^http://www.xyz.com/book/([0-9]*\.)*
>>> <http://www.xyz.com/book/([0-9]*\.)*>*
>>>
>>> For your first question, you should tell us what is the approximate size
>>> of
>>> the data you will crawl, etc. and do you have any other needs?
>>>
>>> Kind Regards,
>>> Furkan KAMACI
>>>
>>>
>>> On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <[email protected]>
>>> wrote:
>>>
>>> > Hi guys,
>>> > just playing as a Nutch newbie in a simple (at least I think) use case:
>>> >
>>> > I have a website (e.g. http://www.xyz.com) that allows searching for
>>> > books. Here, as any straight search website I have two kind of pages:
>>> >
>>> >  * a page that shows search results (depending on the user entered
>>> >    search terms)
>>> >  * a details page about a given book. Each details page is a permalink
>>> >    which follows a given naming convention (e.g.
>>> >    http://www.xyz.com/book/{book id})
>>> >
>>> > The details page has something like a "more like this" section that
>>> > contains permalinks to other (similar) books.
>>> > Now, my requirement is to index in Solr *all* details page of such
>>> website.
>>> >
>>> > If Nutch is a suitable tool for doing that (and this is actually the
>>> first
>>> > question), could you please give me some hint about how to configure
>>> it?
>>> >
>>> > Specifically, I tried put a seed file with just one entry
>>> >
>>> > http://www.xyx.com/book/1
>>> >
>>> > and then I configured my regex-urlfilter.txt
>>> >
>>> > +^http://www.xyx.com/book
>>> >
>>> > But it indexes only the /1 page. I imagined that the "more like this"
>>> > section of the /1 page would act as a set of outlinks for getting
>>> further
>>> > details pages (where in turns there are further MLT sections, and so
>>> on)
>>> >
>>> > Best,
>>> > Andrea
>>> >
>>> >
>>>
>>
>

Re: Crawling (better: indexing) only certain URLS

Reply via email to