Hi Tejas, Thank you very much for your help again. But I'm sorry to inform that I am still not able to get the next link into my crawldb. I am thinking that my conf/regex-urlfilter.txt file is not properly set up. I am sending the content of this file, could you help me determining what is wrong with it? Thanks a ton in advanced!
Renato M. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ #+http://www.xyz.com/\?page=* +http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=* # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] +. # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ +. # accept anything else +. 2013/5/13 Tejas Patil <[email protected]>: > Hi Renato, > > The default content limit for http protocol is 65536 while the webpage is > much bigger than that. The relevant config needs to be updated. > Add this to the conf/nutch-site.xml: > > *<property>* > * <name>http.content.limit</name>* > * <value>240000</value>* > * <description>The length limit for downloaded content using the http* > * protocol, in bytes. If this value is nonnegative (>=0), content longer* > * than it will be truncated; otherwise, no truncation at all. Do not* > * confuse this setting with the file.content.limit setting.* > * </description>* > *</property>* > > I got a connection timed out error post this config change above (it makes > sense as the content to be downloaded is more). > So I added this to the conf/nutch-site.xml: > > *<property>* > * <name>http.timeout</name>* > * <value>1000000</value>* > * <description>The default network timeout, in milliseconds.</description>* > *</property>* > > After running a fresh crawl, I could see the link to the next page in the > crawldb: > > * > http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 > key: > net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2 > * > *baseUrl: null* > *status: 1 (status_unfetched)* > *fetchTime: 1368424541731* > *prevFetchTime: 0* > *fetchInterval: 2592000* > *retriesSinceFetch: 0* > *modifiedTime: 0* > *prevModifiedTime: 0* > *protocolStatus: (null)* > *parseStatus: (null)* > *title: null* > *score: 0.0042918455* > *markers: {dist=1}* > *reprUrl: null* > *metadata _csh_ : ;���* > > HTH > > > On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > >> Hi Tejas, >> >> So I started fresh. I deleted the webpage keyspace as I am using >> Cassandra as a backend. But I did get the same output. I mean I get a >> bunch of urls after I do a readdb -dump but not the ones I want. I get >> only one fetched site, and many links parsed (to be parsed in the next >> cycle?). Maybe it has to do something with the urls I am trying to >> get? >> I am trying to get this url and similar ones: >> >> >> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1 >> >> But I have noticed that the links pointing to the next ones are >> something like this: >> >> <a class="resultado_roda" >> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a> >> >> So I decided to try commenting this url rule: >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> >> But I got the same results. A single site fetched, some urls parsed >> but not the ones I want using the regex-urlfilter.txt. Any Ideas? >> Thanks a ton for your help Tejas! >> >> >> Renato M. >> >> >> 2013/5/12 Tejas Patil <[email protected]>: >> > Hi Renato, >> > >> > Thats weird. I ran a crawl over similar urls having a query in the end ( >> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x. >> > My guess is that there is something wrong while parsing due to which >> > outlinks are not getting into the crawldb. >> > >> > Start from fresh. Clear everything from previous attempts. (including the >> > backend table named as the value of 'storage.schema.webpage'). >> > Run these : >> > bin/nutch inject *<urldir>* >> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0 >> > bin/nutch fetch *<batchID>* -threads 2 >> > bin/nutch parse *<batchID> * >> > bin/nutch updatedb >> > bin/nutch readdb -dump <*output dir*> >> > >> > The readdb output will shown if the outlinks were extracted correctly. >> > >> > The commands for checking urlfilter rules accept one input url at a time >> > from console (you need to type/paste the url and hit enter). >> > It shows "+" if the url is accepted by the current rules. ("-" for >> > rejection). >> > >> > Thanks, >> > Tejas >> > >> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo < >> > [email protected]> wrote: >> > >> >> And I did try the commands you told me but I am not sure how they >> >> work. They do wait for an url to be input, but then it prints the url >> >> with a '+' at the beginning, what does that mean? >> >> >> >> http://www.xyz.com/lanchon >> >> +http://www.xyz.com/lanchon >> >> >> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>: >> >> > Hi Tejas, >> >> > >> >> > Thanks for your help. I have tried the expression you suggested, and >> >> > now my url-filter file is like this: >> >> > +http://www.xyz.com/\?page=* >> >> > >> >> > # skip URLs containing certain characters as probable queries, etc. >> >> > #-[?*!@=] >> >> > +. >> >> > >> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to >> break >> >> loops >> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> >> > +. >> >> > >> >> > # accept anything else >> >> > +. >> >> > >> >> > So after this, I run a generate command -topN 5 -depth 5, and then a >> >> > fetch all, but I keep on getting a single page fetched. What am I >> >> > doing wrong? Thanks again for your help. >> >> > >> >> > >> >> > Renato M. >> >> > >> >> > 2013/5/12 Tejas Patil <[email protected]>: >> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter >> >> rules >> >> >> against any given url: >> >> >> >> >> >> bin/nutch plugin urlfilter-regex >> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> >> >> OR >> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName >> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter >> >> >> >> >> >> Both of them accept input url one at a time from stdin. >> >> >> The later one has a param which can enable you to test a given url >> >> against >> >> >> several url filters at once. See its usage for more details. >> >> >> >> >> >> >> >> >> >> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil < >> [email protected] >> >> >wrote: >> >> >> >> >> >>> If there is no restriction on the number at the end of the url, you >> >> might >> >> >>> just use this: >> >> >>> (note that the rule must be above the one which filters urls with a >> "?" >> >> >>> character) >> >> >>> >> >> >>> *+http://www.xyz.com/\?page=* >> >> >>> * >> >> >>> * >> >> >>> *# skip URLs containing certain characters as probable queries, >> etc.* >> >> >>> *-[?*!@=]* >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo < >> >> >>> [email protected]> wrote: >> >> >>> >> >> >>>> Hi all, >> >> >>>> >> >> >>>> I have been trying to fetch a query similar to: >> >> >>>> >> >> >>>> http://www.xyz.com/?page=1 >> >> >>>> >> >> >>>> But where the number can vary from 1 to 100. Inside the first page >> >> >>>> there are links to the next ones. So I updated the >> >> >>>> conf/regex-urlfilter file and added: >> >> >>>> >> >> >>>> ^[0-9]{1,45}$ >> >> >>>> >> >> >>>> When I do this, the generate job fails saying that it is "Invalid >> >> >>>> first character". I have tried generating with topN 5 and depth 5 >> and >> >> >>>> trying to fetch more urls but that does not work. >> >> >>>> >> >> >>>> Could anyone advise me on how to accomplish this? I am running >> Nutch >> >> 2.x. >> >> >>>> Thanks in advance! >> >> >>>> >> >> >>>> >> >> >>>> Renato M. >> >> >>>> >> >> >>> >> >> >>> >> >> >>

