Hi Tejas, You're right! It's my mistake! regex-urlfilter.txt problems.
It starts when I changed db.ignore.external.links to true, and with the assumptions *only* that setting can protect nutch to ignore external links or only fetches urls in my seeds file. After that, then I commented the line which will only allowing any host of *.bappenas.go.id in regex-urlfilter.txt file: #+^http://([a-z0-9]*\.)*bappenas.go.id/([a-z0-9\-A-Z]*\/)* Since then the problems begin :( Now, I've fixed the problem by commented out it and segments created. *** One question to regex-urlfilter.txt configuration. What is the correct order inside that file? I plan to drop a type of URLs like following: http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=4 http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=5 http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=6 http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=60 Then I added a line configuration below *before* +^http://([a-z0-9]*\.)* bappenas.go.id/([a-z0-9\-A-Z]*\/)* to drop/ignore kinds of URLs above. -^http://bappenas.go.id/index.php?ccm_paging_p* But nutch is still fetched those kind of URLs. Does my regex is incorrect or should I put it after? Thank you. On Sat, Feb 15, 2014 at 12:18 PM, Tejas Patil <[email protected]>wrote: > The logs say this: > >> Generator: 0 records selected for fetching, exiting ... > This is because there are no urls that generator could pass to form a > segment. > > >> Injector: total number of urls injected after normalization and > filtering: 0 > Inject did NOT add anything to the crawldb. Check if you are over-filtering > the input urls. Also it would be nice to see the urls that you are > injecting are valid. From the logs looks like there were just 4 urls in the > seeds file. > > Thanks, > Tejas > > > On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata > <[email protected]>wrote: > > > Hi, > > > > From what I know that "nutch generate" will create a new segment > directory > > every round nutch is running. > > > > I have a problem (never happened before) that nutch won't create new > > segment. > > It always only fetch and parse the latest segment. > > - from the logs: > > 2014-02-15 07:20:02,036 INFO fetcher.Fetcher - Fetcher: segment: > > /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835 > > > > Even though I repeat the processes (generate > fetch > parse > update) > > many times. > > > > What should I check for the configuration of nutch? or any hints to solve > > this problem. > > > > I use nutch 1.7. > > > > And here is the part of hadoop log file: > > http://pastebin.com/kpi48gK6 > > > > Thank you. > > > > -- > > wassalam, > > [bayu] > > > -- wassalam, [bayu]

