Hi
i have problems about nutch.my project is link analysis i crawled www.mersin.edu.tr and i analyse linkdb and i saw all about mersin.edu.tr links.But i have to find other links in site example www.tubitak.gov.tr bu i cannot find?i have to find these links ?please help me _ Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin. http://windows.microsoft.com/shop
Re: Hi
Did u check crawl-urlfilter.txt? All the domain names that you'd like to crawl have to mentioned. e.g. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*mersin\.edu\.tr/ +^http://([a-z0-9]*\.)*tubitak\.gov\.tr/ Also check property db.ignore.external.links in nutch-default.xml. Should be set to false. 2010/5/5 Zehra Göçer zgocer...@hotmail.com i have problems about nutch.my project is link analysis i crawled www.mersin.edu.tr and i analyse linkdb and i saw all about mersin.edu.trlinks.But i have to find other links in site example www.tubitak.gov.tr bu i cannot find?i have to find these links ?please help me _ Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin. http://windows.microsoft.com/shop
Re: Hi, and help with inject scoring...
Excellent, I'll have a look at the patch. Thanks, T On 23/03/2010 19:25, Julien Nioche wrote: Hi Toby, Have a look at https://issues.apache.org/jira/browse/NUTCH-655 The patch has been committed to the SVN repository and should allow you to do exactly what you described. HTH Julien
Hi, and help with inject scoring...
Hi Nutch list, We're using nutch for what basically amounts to an intranet crawl (just a few domains). We have a HUGE inject list as the site contains a lot of Ajax pages. What I'm wondering is… is there a simple way of getting the injected URLs to have a higher default score than URLs injected from the normal crawl? I've tried upping the default score, but that also modifies the score URLs get when they're added from the crawl. Many Thanks, Toby.
Re: Hi, and help with inject scoring...
Hi Toby, Have a look at https://issues.apache.org/jira/browse/NUTCH-655 The patch has been committed to the SVN repository and should allow you to do exactly what you described. HTH Julien -- DigitalPebble Ltd http://www.digitalpebble.com On 23 March 2010 17:35, Toby Cole toby.c...@semantico.com wrote: Hi Nutch list, We're using nutch for what basically amounts to an intranet crawl (just a few domains). We have a HUGE inject list as the site contains a lot of Ajax pages. What I'm wondering is… is there a simple way of getting the injected URLs to have a higher default score than URLs injected from the normal crawl? I've tried upping the default score, but that also modifies the score URLs get when they're added from the crawl. Many Thanks, Toby.
Re: hi Kubes:the question about develop environment!
hi kubes: thank you for your answers! i'm sorry that i didn't express my question. i run nutch only on one machine! and ,i cann't debug hadoop in nutch.because the hadoop's exist is lib. how can i debug hadoop source in nutch? and to my surprise ,the Tutorial RunNutchInEclipse1.0 doesn't start and configure hadoop ,include master listen port etc. when i debug nutch through breakpoint, it display:there is no source file attached to the class file URLClassPath.class! why? can hadoop run in vmware machine? and i also met other problers ,it is in another message run nutch on eclipse problem? ' thanks !!! Dennis Kubes-2 wrote: askNutch wrote: hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse stable (3.4 I think). such as IDE etc. I want to debug nutch. Debugging MapReduce, hence Nutch, jobs is difficult. The main reason why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce job so it is difficult to connect to that JVM as it is created and launched automagically. Here are some options depending on what you are trying to debug: 1) Run all hadoop servers processes (namenode, etc.) through eclipse using the internal debugger. This isn't always the best way, usually only used when debugging some part of the hadoop infrastructure such as socket communication. 2) Run most of the hadoop servers in separate processes, run the tasktracker inside of eclipse with the internal debugger. This is mainly used when debugging a specific MapRunner, MapTask, or ReduceTask interacting with Hadoop. You won't be able to debug the Map or Reduce task itself, just the communication with the Hadoop server, for instance reporting status. 3) Debugging the Map/Reduce task itself. Logging. Judicious logging is most often what I use. Also do very small example if you can help it to give yourself small turnaround times. Unless your problem is occurring only on a large dataset, don't debug on a large data set. Hope this helps. Dennis thank you !!! -- View this message in context: http://www.nabble.com/hi-%3Athe-question-about-develop-environment%21-tp23170026p23191120.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hi Kubes:the question about develop environment!
askNutch wrote: hi kubes: thank you for your answers! i'm sorry that i didn't express my question. i run nutch only on one machine! and ,i cann't debug hadoop in nutch.because the hadoop's exist is lib. how can i debug hadoop source in nutch? Build hadoop from scratch, or run inside of eclipse as a project. You will have to startup each hadoop server manually through an eclipse launcher and will have to have the project source as part of the debugger source. and to my surprise ,the Tutorial RunNutchInEclipse1.0 doesn't start and configure hadoop ,include master listen port etc. when i debug nutch through breakpoint, it display:there is no source file attached to the class file URLClassPath.class! why? When running it through eclipse you will also need to remove the lib hadoop jar from nutch (or at least from the classpath in eclipse) and put in the hadoop project. This way it pulls from the hadoop source code and will display the source file. can hadoop run in vmware machine? Probably yes. Many people run it under xen, don't know if there is that much difference. I wouldn't see why there would be a problem as long as it can get socket access. Dennis and i also met other problers ,it is in another message run nutch on eclipse problem? ' thanks !!! Dennis Kubes-2 wrote: askNutch wrote: hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse stable (3.4 I think). such as IDE etc. I want to debug nutch. Debugging MapReduce, hence Nutch, jobs is difficult. The main reason why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce job so it is difficult to connect to that JVM as it is created and launched automagically. Here are some options depending on what you are trying to debug: 1) Run all hadoop servers processes (namenode, etc.) through eclipse using the internal debugger. This isn't always the best way, usually only used when debugging some part of the hadoop infrastructure such as socket communication. 2) Run most of the hadoop servers in separate processes, run the tasktracker inside of eclipse with the internal debugger. This is mainly used when debugging a specific MapRunner, MapTask, or ReduceTask interacting with Hadoop. You won't be able to debug the Map or Reduce task itself, just the communication with the Hadoop server, for instance reporting status. 3) Debugging the Map/Reduce task itself. Logging. Judicious logging is most often what I use. Also do very small example if you can help it to give yourself small turnaround times. Unless your problem is occurring only on a large dataset, don't debug on a large data set. Hope this helps. Dennis thank you !!!
Re: hi Kubes:the question about develop environment!
On Thu, Apr 23, 2009 at 12:09 PM, askNutch hehehah...@126.com wrote: can hadoop run in vmware machine? I am running a Hadoop cluster where each node is a VMware virtual machine. So, yes, it is possible. As long as you are able to connect to sockets from one virtual machine to another, I don't see why you can not run Hadoop in VMware virtual machines. Regards, Susam Pal
Re: hi Kubes:the question about develop environment!
Why not to post such mails personally if you address to single person? Want to know other opinions? Best Regards Alexander Aristov 2009/4/22 askNutch hehehah...@126.com hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? such as IDE etc. I want to debug nutch. thank you !!! -- View this message in context: http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hi Kubes:the question about develop environment!
askNutch wrote: hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? Linux, Ubuntu (usually the most recent), sun jdk, core2 laptop (although hoping to upgrade to a sagernotebook.com quad core soon :) ), Eclipse stable (3.4 I think). such as IDE etc. I want to debug nutch. Debugging MapReduce, hence Nutch, jobs is difficult. The main reason why is because Hadoop/Nutch spin up a new JVM for each Map and Reduce job so it is difficult to connect to that JVM as it is created and launched automagically. Here are some options depending on what you are trying to debug: 1) Run all hadoop servers processes (namenode, etc.) through eclipse using the internal debugger. This isn't always the best way, usually only used when debugging some part of the hadoop infrastructure such as socket communication. 2) Run most of the hadoop servers in separate processes, run the tasktracker inside of eclipse with the internal debugger. This is mainly used when debugging a specific MapRunner, MapTask, or ReduceTask interacting with Hadoop. You won't be able to debug the Map or Reduce task itself, just the communication with the Hadoop server, for instance reporting status. 3) Debugging the Map/Reduce task itself. Logging. Judicious logging is most often what I use. Also do very small example if you can help it to give yourself small turnaround times. Unless your problem is occurring only on a large dataset, don't debug on a large data set. Hope this helps. Dennis thank you !!!
Re: hi Kubes:the question about develop environment!
Alexander Aristov wrote: Why not to post such mails personally if you address to single person? Want to know other opinions? I would :) Dennis Best Regards Alexander Aristov 2009/4/22 askNutch hehehah...@126.com hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? such as IDE etc. I want to debug nutch. thank you !!! -- View this message in context: http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: hi Kubes:the question about develop environment!
My environment is Windows Vista, MyEclipse7.0, sun jdk 6. This is enough in most cases when I don't deal with hadoop debugging. For hadoop I run virtual Fedora Linux and again use Eclipse. But I usually improve/develop plugins so Vista is enough. Best Regards Alexander Aristov 2009/4/22 Dennis Kubes ku...@apache.org Alexander Aristov wrote: Why not to post such mails personally if you address to single person? Want to know other opinions? I would :) Dennis Best Regards Alexander Aristov 2009/4/22 askNutch hehehah...@126.com hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? such as IDE etc. I want to debug nutch. thank you !!! -- View this message in context: http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html Sent from the Nutch - User mailing list archive at Nabble.com.
hi Kubes:the question about develop environment!
hi Kubes: You are the expert! Can you tell me What is the develop environment do you use to develop nutch ? such as IDE etc. I want to debug nutch. thank you !!! -- View this message in context: http://www.nabble.com/hi-Kubes%3Athe-question-about-develop-environment%21-tp23170026p23170026.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Hi What is the use of refine-query-init.jsp,refine-query.jsp
inalasuresh wrote: Hi , I am uncommented the refine-query.jsp and refine-query-init.jsp in the search.jsp i searched for bikekeyword it given result. Before that i am trying to run the application with comments witout comments . but that had given the same result. so plz any one can sugest me what is the use of the refine-query-init.jsp refine-query.jsp. what is the final result for using without comments for these jsp's thanx regards suresh These two jsp files are part of the ontology extension point. basically plugins extending this extension point(currently ontology plugin) implement two function which are getSynonyms() and getSubclasses(). The ontolgy plugin thus gives synonyms(from wordnet) and subclasses from the defined ontologies for search query refinement. you should enable ontology plugin and add some ontology url to the configuration. you can check the ontology plugin's readme file.
Hi What is the use of refine-query-init.jsp,refine-query.jsp
Hi , I am uncommented the refine-query.jsp and refine-query-init.jsp in the search.jsp i searched for bikekeyword it given result. Before that i am trying to run the application with comments witout comments . but that had given the same result. so plz any one can sugest me what is the use of the refine-query-init.jsp refine-query.jsp. what is the final result for using without comments for these jsp's -- View this message in context: http://www.nabble.com/Hi-What-is-the-use-of-refine-query-init.jsp%2Crefine-query.jsp-tf3389500.html#a9434697 Sent from the Nutch - User mailing list archive at Nabble.com.
Hi What is the use of refine-query-init.jsp,refine-query.jsp
Hi , I am uncommented the refine-query.jsp and refine-query-init.jsp in the search.jsp i searched for bikekeyword it given result. Before that i am trying to run the application with comments witout comments . but that had given the same result. so plz any one can sugest me what is the use of the refine-query-init.jsp refine-query.jsp. what is the final result for using without comments for these jsp's thanx regards suresh -- View this message in context: http://www.nabble.com/Hi-What-is-the-use-of-refine-query-init.jsp%2Crefine-query.jsp-tf3389501.html#a9434699 Sent from the Nutch - User mailing list archive at Nabble.com.
Hi what is the use of subcollections.xml
Hi , Any one help me. i am new for nutch.. what is the use of subcollections.xml when it is called. plz give the response for my query,... thanx regards suresh.. -- View this message in context: http://www.nabble.com/Hi-what-is-the-use-of-subcollections.xml-tf3389528.html#a9434780 Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Hi what is the use of subcollections.xml
inalasuresh wrote: Hi , Any one help me. i am new for nutch.. what is the use of subcollections.xml when it is called. plz give the response for my query,... thanx regards suresh.. Hi, Subcollections is a plugin for indexing the urls matching a regular expression and subcollections.xml is the configuration file it uses. subcollection namenutch/name idnutch/id whitelist http://lucene.apache.org/nutch/ /whitelist blacklist / /subcollection when this plugin is enabled, nutch adds a field to the index with fieldname subcollection, and value nutch for the url http://lucene.apache.org/nutch/. refer to the plugin's readme file.
Hi...How to set Nutch-0.8.1 to save logs into log files when running the crawl job?
Hi, How to set Nutch-0.8.1 to save logs into log files when running the crawl job? Is it setting in the nutch-site.xml, or other configuration file? Thanks your help in advance! -- kevin
Re: Hi...How to set Nutch-0.8.1 to save logs into log files when running the crawl job?
You can play around with these two, by setting them to true in your nutch-site.xml file. Hadoop logs just about everything to logs/hadoop.log. The file truncates each day automatically, and places .year-month-day onto it. property namehttp.verbose/name valuefalse/value descriptionIf true, HTTP will log more verbosely./description /property property namefetcher.verbose/name valuefalse/value descriptionIf true, fetcher will log more verbosely./description /property - Original Message From: kevin [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, December 21, 2006 10:55:38 PM Subject: Hi...How to set Nutch-0.8.1 to save logs into log files when running the crawl job? Hi, How to set Nutch-0.8.1 to save logs into log files when running the crawl job? Is it setting in the nutch-site.xml, or other configuration file? Thanks your help in advance! -- kevin
hi all
hi i have a problem now. i want to crawl the pages which's url contain ...item_detail,but i must crawl from the www..com ,and if i set rules in the crawl-urlfilter.txt,i can't get the pages what i want at all. so what i need to do now ? should i do something with the regex-urlfilter.txt or something else? -- www.babatu.com
hi all
hi all: i get a big problem when crawl the ftp. it seems that Nutch couldn't parse or index the files named in Chinese so after the command looks like: bin/nutch crawl urls.txt -dir test.dir (i've modified the crawl-urlfilter.txt) # skip file:, ftp:, mailto: urls #-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^ftp://* when i seach something in tomcat 5.0.28 ,the results are messy character. so anyone can tell me anything helpful to solve this big problem to me. any reply will be appreciated. -- www.babatu.com
Re: hi all
thx for advice! now i know what's up. but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the LUKE to see the index, ant there are messy character when crawl the Chinese webs. so ,how can i deal with it?? any reply will be appreciated. On 4/2/06, Dan Morrill [EMAIL PROTECTED] wrote: Good Morning Kauu, I have noticed that Nutch only knows about UTF-8 character codes, so the simplified Chinese character set is UTF-8 and should come out ok. If the crawl sees Chinese in a non-utf-8, the web site may be serving them under an older ISO standard, or you may not have the language pack installed to properly support Chinese. Personally, I would download the language pack for your Operating system and see what happens. r/d -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:48 AM To: nutch-user@lucene.apache.org Subject: hi all hi all: i get a big problem when crawl the ftp. it seems that Nutch couldn't parse or index the files named in Chinese so after the command looks like: bin/nutch crawl urls.txt -dir test.dir (i've modified the crawl-urlfilter.txt) # skip file:, ftp:, mailto: urls #-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M OV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^ftp://* when i seach something in tomcat 5.0.28 ,the results are messy character. so anyone can tell me anything helpful to solve this big problem to me. any reply will be appreciated. -- www.babatu.com -- www.babatu.com
RE: hi all
Kauu, Are you using the simplified Chinese character localaization package for windows XP, or are you using the non simplied UTF version? You might need an IME from here http://www.microsoft.com/windows/ie/downloads/recommended/ime/default.mspx That may help out. Since you are using Luke to see the index, luke may not have the character support built in for non utf-8 character sets (meaning gork when you look at it). I went to the luke site http://www.getopt.org/luke/ to see if they make mention of the character sets they support, but there is nothing that states they support any character set. When you run your search, do you see good characters, or do you see gork? Luke may not be able to understand the ISO character sets. (Hypothesis). r/d -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 8:31 AM To: nutch-user@lucene.apache.org Subject: Re: hi all thx for advice! now i know what's up. but my OS is WinXp(CHINESE), it supports Chinese very well. and i used the LUKE to see the index, ant there are messy character when crawl the Chinese webs. so ,how can i deal with it?? any reply will be appreciated. On 4/2/06, Dan Morrill [EMAIL PROTECTED] wrote: Good Morning Kauu, I have noticed that Nutch only knows about UTF-8 character codes, so the simplified Chinese character set is UTF-8 and should come out ok. If the crawl sees Chinese in a non-utf-8, the web site may be serving them under an older ISO standard, or you may not have the language pack installed to properly support Chinese. Personally, I would download the language pack for your Operating system and see what happens. r/d -Original Message- From: kauu [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 7:48 AM To: nutch-user@lucene.apache.org Subject: hi all hi all: i get a big problem when crawl the ftp. it seems that Nutch couldn't parse or index the files named in Chinese so after the command looks like: bin/nutch crawl urls.txt -dir test.dir (i've modified the crawl-urlfilter.txt) # skip file:, ftp:, mailto: urls #-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|M OV|exe|png|PNG)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept hosts in MY.DOMAIN.NAME +^ftp://* when i seach something in tomcat 5.0.28 ,the results are messy character. so anyone can tell me anything helpful to solve this big problem to me. any reply will be appreciated. -- www.babatu.com -- www.babatu.com
Re: hi all
Dan Morrill wrote: Since you are using Luke to see the index, luke may not have the character support built in for non utf-8 character sets (meaning gork when you look at it). I went to the luke site http://www.getopt.org/luke/ to see if they make mention of the character sets they support, but there is nothing that states they support any character set. When you run your search, do you see good characters, or do you see gork? Luke may not be able to understand the ISO character sets. (Hypothesis). Hi, (I'm the guy behind Luke) Luke uses UTF-8, because that's what Lucene stores in the index. You may experience problems with the default font that it uses, i.e. that it doesn't support all Unicode characters. Please try to change the font (in Settings) and see if it helps. Another frequent source of garbled characters is when you read the original content using wrong encoding, e.g. if you read a UTF-8 file using your native platform encoding like Latin1 or Big5, or the other way around. Then you get broken characters being encoded to UTF-8, when Lucene writes out the index, and restored from UTF-8 to their broken form when Luke reads the index -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: hi all
Andrzej, Cheers! Good to know. Thanks! r/d -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Sunday, April 02, 2006 5:01 PM To: nutch-user@lucene.apache.org Subject: Re: hi all Dan Morrill wrote: Since you are using Luke to see the index, luke may not have the character support built in for non utf-8 character sets (meaning gork when you look at it). I went to the luke site http://www.getopt.org/luke/ to see if they make mention of the character sets they support, but there is nothing that states they support any character set. When you run your search, do you see good characters, or do you see gork? Luke may not be able to understand the ISO character sets. (Hypothesis). Hi, (I'm the guy behind Luke) Luke uses UTF-8, because that's what Lucene stores in the index. You may experience problems with the default font that it uses, i.e. that it doesn't support all Unicode characters. Please try to change the font (in Settings) and see if it helps. Another frequent source of garbled characters is when you read the original content using wrong encoding, e.g. if you read a UTF-8 file using your native platform encoding like Latin1 or Big5, or the other way around. Then you get broken characters being encoded to UTF-8, when Lucene writes out the index, and restored from UTF-8 to their broken form when Luke reads the index -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com