Sorry It did not help...
After 2 iterations, there is still only one url in the DB... Benjamin On Sun, Jun 30, 2013 at 9:20 PM, kiran chitturi <[email protected]>wrote: > Hi* *Sznajder, > > Please see an example in the 1.x tutorial here ( > https://wiki.apache.org/nutch/NutchTutorial#Steps). It is in the 3rd step, > on how to configure regex for crawling websites. > > > > > On Sun, Jun 30, 2013 at 10:15 AM, Sznajder ForMailingList < > [email protected]> wrote: > > > Thanks for your help. > > > > I am copying here the content. > > > > # Licensed to the Apache Software Foundation (ASF) under one or more > > # contributor license agreements. See the NOTICE file distributed with > > # this work for additional information regarding copyright ownership. > > # The ASF licenses this file to You under the Apache License, Version 2.0 > > # (the "License"); you may not use this file except in compliance with > > # the License. You may obtain a copy of the License at > > # > > # http://www.apache.org/licenses/LICENSE-2.0 > > # > > # Unless required by applicable law or agreed to in writing, software > > # distributed under the License is distributed on an "AS IS" BASIS, > > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > # See the License for the specific language governing permissions and > > # limitations under the License. > > > > > > # The default url filter. > > # Better for whole-internet crawling. > > > > # Each non-comment, non-blank line contains a regular expression > > # prefixed by '+' or '-'. The first matching pattern in the file > > # determines whether a URL is included or ignored. If no pattern > > # matches, the URL is ignored. > > > > # skip file: ftp: and mailto: urls > > -^(ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > # for a more extensive coverage use the urlfilter-suffix plugin > > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > #-[?*!@=] > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > > loops > > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > # accept anything else > > +. > > > > > > > > On Sun, Jun 30, 2013 at 6:38 PM, h b <[email protected]> wrote: > > > > > What does your conf/regex_urlfilters > > > file contain? > > > Did you change this file? > > > On Jun 30, 2013 5:10 AM, "Sznajder ForMailingList" < > > > [email protected]> > > > wrote: > > > > > > > Thanks a lot for your help > > > > > > > > however, I still did not resovle this issue... > > > > > > > > > > > > I attach there the logs after 2 rounds of > > > > "generate/fetch/parse/updatedb" > > > > > > > > the DB still contains only the seed url , not more... > > > > > > > > > > > > > > > > > > > > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney < > > > > [email protected]> wrote: > > > > > > > >> Try each step with a crawlId and see if this provides you with > better > > > >> results. > > > >> > > > >> Unless you truncated all data between Nutch tasks then you should be > > > >> seeing > > > >> more data in HBase. > > > >> As Tejas asked... what do the logs say? > > > >> > > > >> > > > >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList < > > > >> [email protected]> wrote: > > > >> > > > >> > Hi Lewis, > > > >> > > > > >> > Thanks for your reply > > > >> > > > > >> > I just set the values: > > > >> > > > > >> > gora.datastore.default=org.apache.gora.hbase.store.HBaseStore > > > >> > > > > >> > > > > >> > I already removed the Hbase table in the past. Can it be a cause? > > > >> > > > > >> > Benjamin > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney < > > > >> > [email protected]> wrote: > > > >> > > > > >> > > Have you changed from the default MemStore gora storage to > > something > > > >> > else? > > > >> > > > > > >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList < > > > >> > > [email protected]> > > > >> > > wrote: > > > >> > > > thanks Tejas > > > >> > > > > > > >> > > > Yes, I cheecked the logs and no Error appears in them > > > >> > > > > > > >> > > > I let the http.content.limit and parser.html.impl with their > > > default > > > >> > > > value... > > > >> > > > > > > >> > > > Benajmin > > > >> > > > > > > >> > > > > > > >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil < > > > >> [email protected] > > > >> > > >wrote: > > > >> > > > > > > >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any > > > >> exception > > > >> > or > > > >> > > >> error messages ? > > > >> > > >> Also you might have a look at these configs in nutch-site.xml > > > >> (default > > > >> > > >> values are in nutch-default.xml): > > > >> > > >> http.content.limit and parser.html.impl > > > >> > > >> > > > >> > > >> > > > >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList < > > > >> > > >> [email protected]> wrote: > > > >> > > >> > > > >> > > >> > Hello > > > >> > > >> > > > > >> > > >> > I installed Nutch 2.2 on my linux machine. > > > >> > > >> > > > > >> > > >> > I defined the seed directory with one file containing: > > > >> > > >> > http://en.wikipedia.org/ > > > >> > > >> > http://edition.cnn.com/ > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > I ran the following: > > > >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ > > > >> > > >> > > > > >> > > >> > After this step: > > > >> > > >> > the call > > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > > >> > > >> > > > > >> > > >> > returns > > > >> > > >> > TOTAL urls: 2 > > > >> > > >> > status 0 (null): 2 > > > >> > > >> > avg score: 1.0 > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > Then, I ran the following: > > > >> > > >> > bin/nutch generate -topN 10 > > > >> > > >> > bin/nutch fetch -all > > > >> > > >> > bin/nutch parse -all > > > >> > > >> > bin/nutch updatedb > > > >> > > >> > bin/nutch generate -topN 1000 > > > >> > > >> > bin/nutch fetch -all > > > >> > > >> > bin/nutch parse -all > > > >> > > >> > bin/nutch updatedb > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > However, the stats call after these steps is still: > > > >> > > >> > the call > > > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > > >> > > >> > status 5 (status_redir_perm): 1 > > > >> > > >> > max score: 2.0 > > > >> > > >> > TOTAL urls: 3 > > > >> > > >> > avg score: 1.3333334 > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > Only 3 urls?! > > > >> > > >> > What do I miss? > > > >> > > >> > > > > >> > > >> > thanks > > > >> > > >> > > > > >> > > >> > Benjamin > > > >> > > >> > > > > >> > > >> > > > >> > > > > > > >> > > > > > >> > > -- > > > >> > > *Lewis* > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> -- > > > >> *Lewis* > > > >> > > > > > > > > > > > > > > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> >

