Change of analyzer for specific language
Hi all, How can I change the analyzer which is used by the indexer for specific language? Also, can I use all the analyzer that I see in luke? Thank you. -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16065385.html Sent from the Nutch - User mailing list archive at Nabble.com.
FW: Problem in running Nutch where proxy authentication is required.
Hi Susam, I have mailed on the list 2 times but the mails bounced back with the following message ezmlm-reject: fatal: Sorry, I don't accept messages larger than 10 bytes (#5.2.3) Thanks Regards, Naveen Goswami -Original Message- From: Naveen Goswami (WT01 - E-ENABLING) Sent: Saturday, March 15, 2008 5:01 PM To: '[EMAIL PROTECTED]' Cc: 'nutch-user@lucene.apache.org' Subject: RE: Problem in running Nutch where proxy authentication is required. Hi Susam, Thanks for the help. Yeah I have got your earlier mail. I have followed all the steps given by you. I am attaching the hadoop.log and crawl.log for your reference. I have used the below command to run the crawl. bin/nutch crawl urls -dir crawl -depth 1 -threads 1 crawl.log Please tell me what is the problem. Thanks Regards, Naveen Goswami 91 9899547886 -Original Message- From: Susam Pal [mailto:[EMAIL PROTECTED] Sent: Friday, March 14, 2008 11:12 PM To: [EMAIL PROTECTED] Subject: Re: Problem in running Nutch where proxy authentication is required. I still can't see any DEBUG logs in your log file. Did you go through my earlier mail? Regards, Susam Pal On Wed, Mar 12, 2008 at 9:39 PM, [EMAIL PROTECTED] wrote: Hi All, I am facing a problem in running nutch where the proxy authentication is required to crawl the site.(eg. google.com, yahoo.com) I am able to crawl the sites which do not require proxy authentication from our domain (eg abc.com), it is successfully creating a crawl folder and 5 subfolders.. I have put all the values in conf/nutch-site.xml conf/nutch-default.xml as given. I have given below all the entries which i have modified to run nutch(eg. settings in urls/urls.txt, conf/crawl-urlfilter.txt, conf/nutch-site.xml, conf/nutch-default.xml) I have also given the crawl.log text for your reference. while crawling through cygwin, it is giving an exception(Please help me out what i have to do to run nutch successfully(where i have to put any entry to pass through Proxy Authentication)) Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: 43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) = ===crawl.log crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 topN = 50 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122052 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122052 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122052] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122101 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122101 Fetcher: threads: 10 fetching http://www.yahoo.com/ fetch of http://www.yahoo.com/ http://www.yahoo.com/ failed with: Http code=407, url=http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080109122101] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20080109122110 Generator: filtering: false Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20080109122110 Fetcher:
Thread behaviour in Nutch Crawl
Hi All, Could anyone please tell me that how the threads behave in Nutch. I have Run the same test on similar condition by giving the different no. of threads. Below is the output No. of Threads Time Taken in ms 1 235407 2 244569 3 235594 4 226555 5 229323 6 231400 7 219391 8 216384 9 215756 10 221586 See the behaviour: One thread is taking 235407 ms to crawl whereas two threads are taking more time with the same set under similar test conditions. How it can be possible. And again with thread 3 and 4 time is decreasing and then it is increased with 5 6 threads. And then again decreasing with 7,8 9 threads. And again increased with 10 threads. Could anyone please tell me how threads are behaving like this. Because as I know with increasing the threads the response time decreases. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Change of analyzer for specific language
Hi all, [Follow up post] I found the method by myself. 1. Write a plugin for your own language. The method can refer to the analysis-de and analysis-fr to wrap the luence analyzer into your plugin. 2. Then you need to add them to your plugin-include list in nutch-site.xml or nutch-sites.xml . Also you need to add the language-identifier 3. [For those language is not supported by language identifier or think language identifier is too slow] OK, their is 50% chance you will fail if you are writing for eurpoean lanuguage, and 100% fail if you writing for Eastern Asia Language. The reason for that is , language-identifier fail - your language is not supported and you will see the default indexer do the indexing task for you. There is 2 method A. Hack the plugin language-identifier. i. hack all the class except the LanguageIdentifier.java: The detail will not mention here, because this is too many step and I write in rush. But 2 principle here is: a. remove all the reference to a LanguageIdentifier object, include declaration and call of this method via this reference. This is much easier if you have an IDE like NetBeans or Eclipse b. remember to change the language variable inner class of HTMLLanguageParser or Change the default return language when all the case fail. ii. change the langmappings.properties to the acutal encoding of your language - include all possible combination, in lower case. e.g. za = za, zah, utf, utf8 For the full list you can refer to the list of Iconv language support list - most system will support everything and you will see your language variance (well, utf-8 can be utf-8 or utf_8 or utf8!). Also, you may need to include the first part if the target encoding has - or _ , like utf-8 written in utf and utf8 in example. then build the language-identifier again *XML is you need to create your own Parser based on HTMLLanguageParser . But you will fail in to default case quite soon if the xml witten bad enough that using UTF-8 as encoding but no lang element here. B. Hack the Indexer.java , mentioned by this post: http://www.mail-archive.com/[EMAIL PROTECTED]/msg05952.html *For CJK, the default CJKAnalyzer can handle most of the case (especially you change documents to unicode...), just let zh/ja/kr go as default case. Vinci wrote: Hi all, How can I change the analyzer which is used by the indexer for specific language? Also, can I use all the analyzer that I see in luke? Thank you. -- View this message in context: http://www.nabble.com/Change-of-analyzer-for-specific-language-tp16065385p16067807.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Confusion of -depth parameter
Hi all, [This is a follow up post] I found this is my fault so I need to crawl one more level that I expected. Thank you Vinci wrote: Hi all, I have a confusion of the keyword depth... -seed.txt url1 -link1 -link2 -link3 -link4 url2 -link5 ...etc However, I found the second level link (begin with -link) cannot be crawled unless I set the depth is 3 but not 2, why? Does the depth 1 is the seed url file? -- View this message in context: http://www.nabble.com/Confusion-of--depth-parameter-tp16047305p16067808.html Sent from the Nutch - User mailing list archive at Nabble.com.
Missing zh.ngp for zh locate support for language Identifier
Hi all, I found there is missing zh.ngp for zh locate. I have seen this file via a screenshot and then I googled the filename return nothing for me...can anyone provide this file for me? Thank you -- View this message in context: http://www.nabble.com/Missing-zh.ngp-for-zh-locate-support-for-language-Identifier-tp16068532p16068532.html Sent from the Nutch - User mailing list archive at Nabble.com.
incorrect Query tokenization
Hi all, I have change the NutchAnalyzer in the indexing phase by plugin (plug-in based on anaylsis-fr or analysis-fr), but I found the query tokenized in its old way - look like the tokenizer did not parse the query with the same tokenizer index them... I checked the index, they are indexed as I want. I also checked the hadoop log, all plugin loaded (Include the one changed the Indexer). However, both from the nutchBean and webapps, the tokenization is not correct. How can I fix it? (*The fastest solution Look like assign the language [by plugin language-identifier] of query, but I don't know where to start...) -- View this message in context: http://www.nabble.com/incorrect-Query-tokenization-tp16070144p16070144.html Sent from the Nutch - User mailing list archive at Nabble.com.
nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
I am running nutch 0.9, with tomcat 6.0.14. When I use the NutchBean to search the index, it works fine. I get back results, no errors. I have used tomcat before and it has worked fine. Now I am getting an error searching through tomcat. This is the tomcat error I am seeing in the catalina.out log file: - 2008-03-15 15:38:38,715 INFO NutchBean - query request from 192.168.245.58 2008-03-15 15:38:38,717 INFO NutchBean - query: penasquitos 2008-03-15 15:38:38,717 INFO NutchBean - lang: en Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.NullPointerException at org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159) at org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177) - When I run a search using the NutchBean, I see debug log entries in the hadoop.log. When I run the search using Tomcat, I never see any hadoop.log entires. We have 1.4 million indexed pages, taking up 31gb for the nutch/crawl directory. The search term doesn't matter. My guess is it may be a memory error, but I am not seeing it anywhere. Is there a place where I can set the memory footprint for tomcat to use more memory? Or, is there another place I should be looking? Thanks in advance for any pointers or assistance. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
Hi, please check the path of the search.dir in property file located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or not. if you use absolute path then this will be another problem Hope it help John Mendenhall wrote: I am running nutch 0.9, with tomcat 6.0.14. When I use the NutchBean to search the index, it works fine. I get back results, no errors. I have used tomcat before and it has worked fine. Now I am getting an error searching through tomcat. This is the tomcat error I am seeing in the catalina.out log file: - 2008-03-15 15:38:38,715 INFO NutchBean - query request from 192.168.245.58 2008-03-15 15:38:38,717 INFO NutchBean - query: penasquitos 2008-03-15 15:38:38,717 INFO NutchBean - lang: en Mar 15, 2008 3:38:41 PM org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.NullPointerException at org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:159) at org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(FetchedSegments.java:177) - When I run a search using the NutchBean, I see debug log entries in the hadoop.log. When I run the search using Tomcat, I never see any hadoop.log entires. We have 1.4 million indexed pages, taking up 31gb for the nutch/crawl directory. The search term doesn't matter. My guess is it may be a memory error, but I am not seeing it anywhere. Is there a place where I can set the memory footprint for tomcat to use more memory? Or, is there another place I should be looking? Thanks in advance for any pointers or assistance. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services -- View this message in context: http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075186.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
please check the path of the search.dir in property file (nutch-site.xml) located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or not. if you use absolute path then this will be another problem Super! Thanks a bunch! That was it. The property is actually serverer.dir. We always use absolute paths since it helps tremendously not having to worry about where one is when the process is started. We had moved it from one matchine to another and had forgotten to make sure the tomcat process owner 'tomcat' was in the nutch group 'nutch'. Fixed that and it works like a charm. Thanks again! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
Re: nutch 0.9, tomcat 6.0.14, nutchbean okay, tomcat search error
Hi, congrat:) btw, unless you set permission other then 755, no much permission thing you need to care if you use tomcat. one question: did you changed the plugin list? What plugin are you using? I wonder how can you get the language of your query... John Mendenhall wrote: please check the path of the search.dir in property file (nutch-site.xml) located in webapps/nutch_depoly_directory/WEB-INF/classes, check it is accessable or not. if you use absolute path then this will be another problem Super! Thanks a bunch! That was it. The property is actually serverer.dir. We always use absolute paths since it helps tremendously not having to worry about where one is when the process is started. We had moved it from one matchine to another and had forgotten to make sure the tomcat process owner 'tomcat' was in the nutch group 'nutch'. Fixed that and it works like a charm. Thanks again! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services -- View this message in context: http://www.nabble.com/nutch-0.9%2C-tomcat-6.0.14%2C-nutchbean-okay%2C-tomcat-search-error-tp16073740p16075816.html Sent from the Nutch - User mailing list archive at Nabble.com.