Re: Fetching a specific number of urls

Renato Marroquín Mogrovejo Thu, 16 May 2013 10:38:00 -0700

Hi Tejas,

Thank you very much for your help again.
But I'm sorry to inform that I am still not able to get the next link
into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
is not properly set up. I am sending the content of this file, could
you help me determining what is wrong with it?
Thanks a ton in advanced!



Renato M.


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

#+http://www.xyz.com/\?page=*
+http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]
+.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+.

# accept anything else
+.

2013/5/13 Tejas Patil <[email protected]>:
> Hi Renato,
>
> The default content limit for http protocol is 65536 while the webpage is
> much bigger than that. The relevant config needs to be updated.
> Add this to the conf/nutch-site.xml:
>
> *<property>*
> *  <name>http.content.limit</name>*
> *  <value>240000</value>*
> *  <description>The length limit for downloaded content using the http*
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
> *  than it will be truncated; otherwise, no truncation at all. Do not*
> *  confuse this setting with the file.content.limit setting.*
> *  </description>*
> *</property>*
>
> I got a connection timed out error post this config change above (it makes
> sense as the content to be downloaded is more).
> So I added this to the conf/nutch-site.xml:
>
> *<property>*
> *  <name>http.timeout</name>*
> *  <value>1000000</value>*
> *  <description>The default network timeout, in milliseconds.</description>*
> *</property>*
>
> After running a fresh crawl, I could see the link to the next page in the
> crawldb:
>
> *
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> key:
>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> *
> *baseUrl:        null*
> *status: 1 (status_unfetched)*
> *fetchTime:      1368424541731*
> *prevFetchTime:  0*
> *fetchInterval:  2592000*
> *retriesSinceFetch:      0*
> *modifiedTime:   0*
> *prevModifiedTime:       0*
> *protocolStatus: (null)*
> *parseStatus:    (null)*
> *title:  null*
> *score:  0.0042918455*
> *markers:        {dist=1}*
> *reprUrl:        null*
> *metadata _csh_ :        ;���*
>
> HTH
>
>
> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
>> Hi Tejas,
>>
>> So I started fresh. I deleted the webpage keyspace as I am using
>> Cassandra as a backend. But I did get the same output. I mean I get a
>> bunch of urls after I do a readdb -dump but not the ones I want. I get
>> only one fetched site, and many links parsed (to be parsed in the next
>> cycle?). Maybe it has to do something with the urls I am trying to
>> get?
>> I am trying to get this url and similar ones:
>>
>>
>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>>
>> But I have noticed that the links pointing to the next ones are
>> something like this:
>>
>> <a class="resultado_roda"
>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>>
>> So I decided to try commenting this url rule:
>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>
>> But I got the same results. A single site fetched, some urls parsed
>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
>> Thanks a ton for your help Tejas!
>>
>>
>> Renato M.
>>
>>
>> 2013/5/12 Tejas Patil <[email protected]>:
>> > Hi Renato,
>> >
>> > Thats weird. I ran a crawl over similar urls having a query in the end (
>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
>> > My guess is that there is something wrong while parsing due to which
>> > outlinks are not getting into the crawldb.
>> >
>> > Start from fresh. Clear everything from previous attempts. (including the
>> > backend table named as the value of 'storage.schema.webpage').
>> > Run these :
>> > bin/nutch inject *<urldir>*
>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
>> > bin/nutch fetch *<batchID>* -threads 2
>> > bin/nutch parse *<batchID> *
>> > bin/nutch updatedb
>> > bin/nutch readdb -dump <*output dir*>
>> >
>> > The readdb output will shown if the outlinks were extracted correctly.
>> >
>> > The commands for checking urlfilter rules accept one input url at a time
>> > from console (you need to type/paste the url and hit enter).
>> > It shows "+" if the url is accepted by the current rules. ("-" for
>> > rejection).
>> >
>> > Thanks,
>> > Tejas
>> >
>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
>> > [email protected]> wrote:
>> >
>> >> And I did try the commands you told me but I am not sure how they
>> >> work. They do wait for an url to be input, but then it prints the url
>> >> with a '+' at the beginning, what does that mean?
>> >>
>> >> http://www.xyz.com/lanchon
>> >> +http://www.xyz.com/lanchon
>> >>
>> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
>> >> > Hi Tejas,
>> >> >
>> >> > Thanks for your help. I have tried the expression you suggested, and
>> >> > now my url-filter file is like this:
>> >> > +http://www.xyz.com/\?page=*
>> >> >
>> >> > # skip URLs containing certain characters as probable queries, etc.
>> >> > #-[?*!@=]
>> >> > +.
>> >> >
>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
>> break
>> >> loops
>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> >> > +.
>> >> >
>> >> > # accept anything else
>> >> > +.
>> >> >
>> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
>> >> > fetch all, but I keep on getting a single page fetched. What am I
>> >> > doing wrong? Thanks again for your help.
>> >> >
>> >> >
>> >> > Renato M.
>> >> >
>> >> > 2013/5/12 Tejas Patil <[email protected]>:
>> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>> >> rules
>> >> >> against any given url:
>> >> >>
>> >> >> bin/nutch plugin urlfilter-regex
>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> >> OR
>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> >>
>> >> >> Both of them accept input url one at a time from stdin.
>> >> >> The later one has a param which can enable you to test a given url
>> >> against
>> >> >> several url filters at once. See its usage for more details.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
>> [email protected]
>> >> >wrote:
>> >> >>
>> >> >>> If there is no restriction on the number at the end of the url, you
>> >> might
>> >> >>> just use this:
>> >> >>> (note that the rule must be above the one which filters urls with a
>> "?"
>> >> >>> character)
>> >> >>>
>> >> >>> *+http://www.xyz.com/\?page=*
>> >> >>> *
>> >> >>> *
>> >> >>> *# skip URLs containing certain characters as probable queries,
>> etc.*
>> >> >>> *-[?*!@=]*
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> >> >>> [email protected]> wrote:
>> >> >>>
>> >> >>>> Hi all,
>> >> >>>>
>> >> >>>> I have been trying to fetch a query similar to:
>> >> >>>>
>> >> >>>> http://www.xyz.com/?page=1
>> >> >>>>
>> >> >>>> But where the number can vary from 1 to 100. Inside the first page
>> >> >>>> there are links to the next ones. So I updated the
>> >> >>>> conf/regex-urlfilter file and added:
>> >> >>>>
>> >> >>>> ^[0-9]{1,45}$
>> >> >>>>
>> >> >>>> When I do this, the generate job fails saying that it is "Invalid
>> >> >>>> first character". I have tried generating with topN 5 and depth 5
>> and
>> >> >>>> trying to fetch more urls but that does not work.
>> >> >>>>
>> >> >>>> Could anyone advise me on how to accomplish this? I am running
>> Nutch
>> >> 2.x.
>> >> >>>> Thanks in advance!
>> >> >>>>
>> >> >>>>
>> >> >>>> Renato M.
>> >> >>>>
>> >> >>>
>> >> >>>
>> >>
>>

Re: Fetching a specific number of urls

Reply via email to