I just fixed the pattern with following: -^http://.*ccm_paging_p.*$
And put it before. Case closed. Thank you Tejas! On Sat, Feb 15, 2014 at 8:53 PM, Bayu Widyasanyata <[email protected]>wrote: > Hi Tejas, > > You're right! > It's my mistake! regex-urlfilter.txt problems. > > It starts when I changed db.ignore.external.links to true, and with the > assumptions *only* that setting can protect nutch to ignore external links > or only fetches urls in my seeds file. > > After that, then I commented the line which will only allowing any host of > *.bappenas.go.id in regex-urlfilter.txt file: > #+^http://([a-z0-9]*\.)*bappenas.go.id/([a-z0-9\-A-Z]*\/)*<http://bappenas.go.id/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*> > > Since then the problems begin :( > Now, I've fixed the problem by commented out it and segments created. > > *** > > One question to regex-urlfilter.txt configuration. > What is the correct order inside that file? > > I plan to drop a type of URLs like following: > > http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=4 > > http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=5 > > http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=6 > > http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=60 > > Then I added a line configuration below *before* +^http://([a-z0-9]*\.)* > bappenas.go.id/([a-z0-9\-A-Z]*\/)*<http://bappenas.go.id/(%5Ba-z0-9%5C-A-Z%5D*%5C/)*>to > drop/ignore kinds of URLs above. > > -^http://bappenas.go.id/index.php?ccm_paging_p* > > But nutch is still fetched those kind of URLs. > Does my regex is incorrect or should I put it after? > > Thank you. > > On Sat, Feb 15, 2014 at 12:18 PM, Tejas Patil <[email protected]>wrote: > >> The logs say this: >> >> Generator: 0 records selected for fetching, exiting ... >> This is because there are no urls that generator could pass to form a >> segment. >> >> >> Injector: total number of urls injected after normalization and >> filtering: 0 >> Inject did NOT add anything to the crawldb. Check if you are >> over-filtering >> the input urls. Also it would be nice to see the urls that you are >> injecting are valid. From the logs looks like there were just 4 urls in >> the >> seeds file. >> >> Thanks, >> Tejas >> >> >> On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata >> <[email protected]>wrote: >> >> > Hi, >> > >> > From what I know that "nutch generate" will create a new segment >> directory >> > every round nutch is running. >> > >> > I have a problem (never happened before) that nutch won't create new >> > segment. >> > It always only fetch and parse the latest segment. >> > - from the logs: >> > 2014-02-15 07:20:02,036 INFO fetcher.Fetcher - Fetcher: segment: >> > /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835 >> > >> > Even though I repeat the processes (generate > fetch > parse > update) >> > many times. >> > >> > What should I check for the configuration of nutch? or any hints to >> solve >> > this problem. >> > >> > I use nutch 1.7. >> > >> > And here is the part of hadoop log file: >> > http://pastebin.com/kpi48gK6 >> > >> > Thank you. >> > >> > -- >> > wassalam, >> > [bayu] >> > >> > > > > -- > wassalam, > [bayu] > -- wassalam, [bayu]

