Hi Renato,

The default content limit for http protocol is 65536 while the webpage is
much bigger than that. The relevant config needs to be updated.
Add this to the conf/nutch-site.xml:

*<property>*
*  <name>http.content.limit</name>*
*  <value>240000</value>*
*  <description>The length limit for downloaded content using the http*
*  protocol, in bytes. If this value is nonnegative (>=0), content longer*
*  than it will be truncated; otherwise, no truncation at all. Do not*
*  confuse this setting with the file.content.limit setting.*
*  </description>*
*</property>*

I got a connection timed out error post this config change above (it makes
sense as the content to be downloaded is more).
So I added this to the conf/nutch-site.xml:

*<property>*
*  <name>http.timeout</name>*
*  <value>1000000</value>*
*  <description>The default network timeout, in milliseconds.</description>*
*</property>*

After running a fresh crawl, I could see the link to the next page in the
crawldb:

*
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
key:
 net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
*
*baseUrl:        null*
*status: 1 (status_unfetched)*
*fetchTime:      1368424541731*
*prevFetchTime:  0*
*fetchInterval:  2592000*
*retriesSinceFetch:      0*
*modifiedTime:   0*
*prevModifiedTime:       0*
*protocolStatus: (null)*
*parseStatus:    (null)*
*title:  null*
*score:  0.0042918455*
*markers:        {dist=1}*
*reprUrl:        null*
*metadata _csh_ :        ;���*

HTH


On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> Hi Tejas,
>
> So I started fresh. I deleted the webpage keyspace as I am using
> Cassandra as a backend. But I did get the same output. I mean I get a
> bunch of urls after I do a readdb -dump but not the ones I want. I get
> only one fetched site, and many links parsed (to be parsed in the next
> cycle?). Maybe it has to do something with the urls I am trying to
> get?
> I am trying to get this url and similar ones:
>
>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>
> But I have noticed that the links pointing to the next ones are
> something like this:
>
> <a class="resultado_roda"
> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>
> So I decided to try commenting this url rule:
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> But I got the same results. A single site fetched, some urls parsed
> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
> Thanks a ton for your help Tejas!
>
>
> Renato M.
>
>
> 2013/5/12 Tejas Patil <[email protected]>:
> > Hi Renato,
> >
> > Thats weird. I ran a crawl over similar urls having a query in the end (
> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
> > My guess is that there is something wrong while parsing due to which
> > outlinks are not getting into the crawldb.
> >
> > Start from fresh. Clear everything from previous attempts. (including the
> > backend table named as the value of 'storage.schema.webpage').
> > Run these :
> > bin/nutch inject *<urldir>*
> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> > bin/nutch fetch *<batchID>* -threads 2
> > bin/nutch parse *<batchID> *
> > bin/nutch updatedb
> > bin/nutch readdb -dump <*output dir*>
> >
> > The readdb output will shown if the outlinks were extracted correctly.
> >
> > The commands for checking urlfilter rules accept one input url at a time
> > from console (you need to type/paste the url and hit enter).
> > It shows "+" if the url is accepted by the current rules. ("-" for
> > rejection).
> >
> > Thanks,
> > Tejas
> >
> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> > [email protected]> wrote:
> >
> >> And I did try the commands you told me but I am not sure how they
> >> work. They do wait for an url to be input, but then it prints the url
> >> with a '+' at the beginning, what does that mean?
> >>
> >> http://www.xyz.com/lanchon
> >> +http://www.xyz.com/lanchon
> >>
> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
> >> > Hi Tejas,
> >> >
> >> > Thanks for your help. I have tried the expression you suggested, and
> >> > now my url-filter file is like this:
> >> > +http://www.xyz.com/\?page=*
> >> >
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> > +.
> >> >
> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> >> loops
> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> > +.
> >> >
> >> > # accept anything else
> >> > +.
> >> >
> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
> >> > fetch all, but I keep on getting a single page fetched. What am I
> >> > doing wrong? Thanks again for your help.
> >> >
> >> >
> >> > Renato M.
> >> >
> >> > 2013/5/12 Tejas Patil <[email protected]>:
> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
> >> rules
> >> >> against any given url:
> >> >>
> >> >> bin/nutch plugin urlfilter-regex
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >> OR
> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >>
> >> >> Both of them accept input url one at a time from stdin.
> >> >> The later one has a param which can enable you to test a given url
> >> against
> >> >> several url filters at once. See its usage for more details.
> >> >>
> >> >>
> >> >>
> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
> [email protected]
> >> >wrote:
> >> >>
> >> >>> If there is no restriction on the number at the end of the url, you
> >> might
> >> >>> just use this:
> >> >>> (note that the rule must be above the one which filters urls with a
> "?"
> >> >>> character)
> >> >>>
> >> >>> *+http://www.xyz.com/\?page=*
> >> >>> *
> >> >>> *
> >> >>> *# skip URLs containing certain characters as probable queries,
> etc.*
> >> >>> *-[?*!@=]*
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >> >>> [email protected]> wrote:
> >> >>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I have been trying to fetch a query similar to:
> >> >>>>
> >> >>>> http://www.xyz.com/?page=1
> >> >>>>
> >> >>>> But where the number can vary from 1 to 100. Inside the first page
> >> >>>> there are links to the next ones. So I updated the
> >> >>>> conf/regex-urlfilter file and added:
> >> >>>>
> >> >>>> ^[0-9]{1,45}$
> >> >>>>
> >> >>>> When I do this, the generate job fails saying that it is "Invalid
> >> >>>> first character". I have tried generating with topN 5 and depth 5
> and
> >> >>>> trying to fetch more urls but that does not work.
> >> >>>>
> >> >>>> Could anyone advise me on how to accomplish this? I am running
> Nutch
> >> 2.x.
> >> >>>> Thanks in advance!
> >> >>>>
> >> >>>>
> >> >>>> Renato M.
> >> >>>>
> >> >>>
> >> >>>
> >>
>

Reply via email to