Re: Crawl in Nutch2.2

Sznajder ForMailingList Sun, 30 Jun 2013 10:16:38 -0700

Thanks for your help.

I am copying here the content.


# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.



On Sun, Jun 30, 2013 at 6:38 PM, h b <[email protected]> wrote:

> What does your conf/regex_urlfilters
> file contain?
> Did you change this file?
> On Jun 30, 2013 5:10 AM, "Sznajder ForMailingList" <
> [email protected]>
> wrote:
>
> > Thanks a lot for your help
> >
> > however, I still did not resovle this issue...
> >
> >
> > I attach there the logs after 2 rounds of
> > "generate/fetch/parse/updatedb"
> >
> > the DB still contains only the seed url , not more...
> >
> >
> >
> >
> > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Try each step with a crawlId and see if this provides you with better
> >> results.
> >>
> >> Unless you truncated all data between Nutch tasks then you should be
> >> seeing
> >> more data in HBase.
> >> As Tejas asked... what do the logs say?
> >>
> >>
> >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList <
> >> [email protected]> wrote:
> >>
> >> > Hi Lewis,
> >> >
> >> > Thanks for your reply
> >> >
> >> > I just set the values:
> >> >
> >> >  gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
> >> >
> >> >
> >> > I already removed the Hbase table in the past. Can it be a cause?
> >> >
> >> > Benjamin
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> > > Have you changed from the default MemStore gora storage to something
> >> > else?
> >> > >
> >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList <
> >> > > [email protected]>
> >> > > wrote:
> >> > > > thanks Tejas
> >> > > >
> >> > > > Yes, I cheecked the logs and  no Error appears in them
> >> > > >
> >> > > > I let the http.content.limit and parser.html.impl with their
> default
> >> > > > value...
> >> > > >
> >> > > > Benajmin
> >> > > >
> >> > > >
> >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil <
> >> [email protected]
> >> > > >wrote:
> >> > > >
> >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any
> >> exception
> >> > or
> >> > > >> error messages ?
> >> > > >> Also you might have a look at these configs in nutch-site.xml
> >> (default
> >> > > >> values are in nutch-default.xml):
> >> > > >> http.content.limit and parser.html.impl
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList <
> >> > > >> [email protected]> wrote:
> >> > > >>
> >> > > >> > Hello
> >> > > >> >
> >> > > >> > I installed Nutch 2.2 on my linux machine.
> >> > > >> >
> >> > > >> > I defined the seed directory with one file containing:
> >> > > >> > http://en.wikipedia.org/
> >> > > >> > http://edition.cnn.com/
> >> > > >> >
> >> > > >> >
> >> > > >> > I ran the following:
> >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/
> >> > > >> >
> >> > > >> > After this step:
> >> > > >> > the call
> >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> >> > > >> >
> >> > > >> > returns
> >> > > >> > TOTAL urls:     2
> >> > > >> > status 0 (null):        2
> >> > > >> > avg score:      1.0
> >> > > >> >
> >> > > >> >
> >> > > >> > Then, I ran the following:
> >> > > >> > bin/nutch generate -topN 10
> >> > > >> > bin/nutch fetch -all
> >> > > >> > bin/nutch parse -all
> >> > > >> > bin/nutch updatedb
> >> > > >> > bin/nutch generate -topN 1000
> >> > > >> > bin/nutch fetch -all
> >> > > >> > bin/nutch parse -all
> >> > > >> > bin/nutch updatedb
> >> > > >> >
> >> > > >> >
> >> > > >> > However, the stats call after these steps is still:
> >> > > >> > the call
> >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> >> > > >> > status 5 (status_redir_perm):   1
> >> > > >> > max score:      2.0
> >> > > >> > TOTAL urls:     3
> >> > > >> > avg score:      1.3333334
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > Only 3 urls?!
> >> > > >> > What do I miss?
> >> > > >> >
> >> > > >> > thanks
> >> > > >> >
> >> > > >> > Benjamin
> >> > > >> >
> >> > > >>
> >> > > >
> >> > >
> >> > > --
> >> > > *Lewis*
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>

Re: Crawl in Nutch2.2

Reply via email to