I just fixed the pattern with following:
-^http://.*ccm_paging_p.*$

And put it before.

Case closed.

Thank you Tejas!


On Sat, Feb 15, 2014 at 8:53 PM, Bayu Widyasanyata
<[email protected]>wrote:

> Hi Tejas,
>
> You're right!
> It's my mistake! regex-urlfilter.txt problems.
>
> It starts when I changed db.ignore.external.links to true, and with the
> assumptions *only* that setting can protect nutch to ignore external links
> or only fetches urls in my seeds file.
>
> After that, then I commented the line which will only allowing any host of
> *.bappenas.go.id in regex-urlfilter.txt file:
> #+^http://([a-z0-9]*\.)*bappenas.go.id/([a-z0-9\-A-Z]*\/)*<http://bappenas.go.id/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>
>
> Since then the problems begin :(
> Now, I've fixed the problem by commented out it and segments created.
>
> ***
>
> One question to regex-urlfilter.txt configuration.
> What is the correct order inside that file?
>
> I plan to drop a type of URLs like following:
>
> http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=4
>
> http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=5
>
> http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=6
>
> http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=60
>
> Then I added a line configuration below *before* +^http://([a-z0-9]*\.)*
> bappenas.go.id/([a-z0-9\-A-Z]*\/)*<http://bappenas.go.id/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>to
>  drop/ignore kinds of URLs above.
>
> -^http://bappenas.go.id/index.php?ccm_paging_p*
>
> But nutch is still fetched those kind of URLs.
> Does my regex is incorrect or should I put it after?
>
> Thank you.
>
> On Sat, Feb 15, 2014 at 12:18 PM, Tejas Patil <[email protected]>wrote:
>
>> The logs say this:
>> >> Generator: 0 records selected for fetching, exiting ...
>> This is because there are no urls that generator could pass to form a
>> segment.
>>
>> >> Injector: total number of urls injected after normalization and
>> filtering: 0
>> Inject did NOT add anything to the crawldb. Check if you are
>> over-filtering
>> the input urls. Also it would be nice to see the urls that you are
>> injecting are valid. From the logs looks like there were just 4 urls in
>> the
>> seeds file.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata
>> <[email protected]>wrote:
>>
>> > Hi,
>> >
>> > From what I know that "nutch generate" will create a new segment
>> directory
>> > every round nutch is running.
>> >
>> > I have a problem (never happened before) that nutch won't create new
>> > segment.
>> > It always only fetch and parse the latest segment.
>> > - from the logs:
>> > 2014-02-15 07:20:02,036 INFO  fetcher.Fetcher - Fetcher: segment:
>> > /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835
>> >
>> > Even though I repeat the processes (generate > fetch > parse > update)
>> > many times.
>> >
>> > What should I check for the configuration of nutch? or any hints to
>> solve
>> > this problem.
>> >
>> > I use nutch 1.7.
>> >
>> > And here is the part of hadoop log file:
>> > http://pastebin.com/kpi48gK6
>> >
>> > Thank you.
>> >
>> > --
>> > wassalam,
>> > [bayu]
>> >
>>
>
>
>
> --
> wassalam,
> [bayu]
>



-- 
wassalam,
[bayu]

Reply via email to