Hi Tejas,

You're right!
It's my mistake! regex-urlfilter.txt problems.

It starts when I changed db.ignore.external.links to true, and with the
assumptions *only* that setting can protect nutch to ignore external links
or only fetches urls in my seeds file.

After that, then I commented the line which will only allowing any host of
*.bappenas.go.id in regex-urlfilter.txt file:
#+^http://([a-z0-9]*\.)*bappenas.go.id/([a-z0-9\-A-Z]*\/)*

Since then the problems begin :(
Now, I've fixed the problem by commented out it and segments created.

***

One question to regex-urlfilter.txt configuration.
What is the correct order inside that file?

I plan to drop a type of URLs like following:
http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=4
http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=5
http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=6
http://bappenas.go.id/index.php?ccm_paging_p_b10721=1&ccm_paging_p_b2495=13&ccm_paging_p_b982=60

Then I added a line configuration below *before* +^http://([a-z0-9]*\.)*
bappenas.go.id/([a-z0-9\-A-Z]*\/)* to drop/ignore kinds of URLs above.

-^http://bappenas.go.id/index.php?ccm_paging_p*

But nutch is still fetched those kind of URLs.
Does my regex is incorrect or should I put it after?

Thank you.

On Sat, Feb 15, 2014 at 12:18 PM, Tejas Patil <[email protected]>wrote:

> The logs say this:
> >> Generator: 0 records selected for fetching, exiting ...
> This is because there are no urls that generator could pass to form a
> segment.
>
> >> Injector: total number of urls injected after normalization and
> filtering: 0
> Inject did NOT add anything to the crawldb. Check if you are over-filtering
> the input urls. Also it would be nice to see the urls that you are
> injecting are valid. From the logs looks like there were just 4 urls in the
> seeds file.
>
> Thanks,
> Tejas
>
>
> On Fri, Feb 14, 2014 at 4:43 PM, Bayu Widyasanyata
> <[email protected]>wrote:
>
> > Hi,
> >
> > From what I know that "nutch generate" will create a new segment
> directory
> > every round nutch is running.
> >
> > I have a problem (never happened before) that nutch won't create new
> > segment.
> > It always only fetch and parse the latest segment.
> > - from the logs:
> > 2014-02-15 07:20:02,036 INFO  fetcher.Fetcher - Fetcher: segment:
> > /opt/searchengine/nutch/BappenasCrawl/segments/20140205213835
> >
> > Even though I repeat the processes (generate > fetch > parse > update)
> > many times.
> >
> > What should I check for the configuration of nutch? or any hints to solve
> > this problem.
> >
> > I use nutch 1.7.
> >
> > And here is the part of hadoop log file:
> > http://pastebin.com/kpi48gK6
> >
> > Thank you.
> >
> > --
> > wassalam,
> > [bayu]
> >
>



-- 
wassalam,
[bayu]

Reply via email to