[jira] Updated: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=all ] Enis Soztutar updated NUTCH-289: Attachment: ipInCrawlDatumDraftV5.1.patch The version 5 patch does not run on the current build. So i have fixed it and resend the patch(did not changed any code). I think this patch should be included in the trunk. CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-289) CrawlDatum should store IP address
[ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12450315 ] Uros Gruber commented on NUTCH-289: --- One question. Why does IP need to be in CrawlDatum and not in metadata? CrawlDatum should store IP address -- Key: NUTCH-289 URL: http://issues.apache.org/jira/browse/NUTCH-289 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8 Reporter: Doug Cutting Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch If the CrawlDatum stored the IP address of the host of it's URL, then one could: - partition fetch lists on the basis of IP address, for better politeness; - truncate pages to fetch per IP address, rather than just hostname. This would be a good way to limit the impact of domain spammers. The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: implement thai language indexing and search
ok. I was able to enable the language identifier plugin by adding the value in plugin.includes attribute in nutch-site.xml - but i'm not sure just by doing that I can have thai text recognized and tokenized properly. What else do I have to do ? Please help me. Thanks and regards, sanjeev. sanjeev wrote: hi all, I've been trying unsuccessfully for the past week to implement the thai language analyzer with nutch. One thing I don't understand is the thai analyzer belongs to the lucene.analysis package instead of the nutch.analysis package. I have the thai ngp file + analyzer (albeit from lucene) + nutch 0.8 dev pack My question is how to integrate this into nutch that when I index and search - it will analyze and search the thai lanaguage correctly. Someone please help as i'm sure it can be done. thanks and regards, sanjeev -- View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375203 Sent from the Nutch - Dev mailing list archive at Nabble.com.
Re: implement thai language indexing and search
ok. I was able to enable the language identifier plugin by adding the value in plugin.includes attribute in nutch-site.xml - but i'm not sure just by doing that I can have thai text recognized and tokenized properly. What else do I have to do ? Please help me. 1. You must create a thai NGP (Ngram Profile file) so that the language identifier can identify thai ! 2. You must create a thai analyzer (see for instance analysis-fr and analysis-de sample analyzers). Best Regards Jérôme
Re: implement thai language indexing and search
Thanks Jerome, i used an existing ThaiAnalyzer which was in lucene package. ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) take a look at the log file for a sample crawl - somehow i feel the language-identifier is still not activated. Need your help urgently in resolving this issue. cheers and regards and thanks for all your help. sanjeev. 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-default.xml 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/crawl-tool.xml 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-site.xml 491116 151804 No FS indicated, using default:local 491116 151804 crawl started in: crawlnewxx2 491116 151804 rootUrlFile = urls 491116 151804 threads = 10 491116 151804 depth = 10 491116 151804 Created webdb at LocalFS,C:\cygwin\home\robert\nutch-0.7.2\crawlnewxx2\db 491116 151804 Starting URL processing 491116 151804 Plugins: looking in: C:\cygwin\home\robert\nutch-0.7.2\plugins 491116 151804 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\clustering-carrot2 491116 151804 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\creativecommons 491116 151804 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\index-basic\plugin.xml 491116 151805 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\index-more 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\language-identifier 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\ontology 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-ext 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-html\plugin.xml 491116 151805 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-js 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-msword 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-pdf 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-rss 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-text\plugin.xml 491116 151805 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-file 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-ftp 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-http\plugin.xml 491116 151805 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-httpclient 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-basic\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-more 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-site\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-url\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-prefix 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml 491116 151805 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 491116 151805 found resource regex-urlfilter.txt at file:/C:/cygwin/home/robert/nutch-0.7.2/conf/regex-urlfilter.txt 491116 151805 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer Jérôme Charron wrote: ok. I was able to enable the language identifier plugin by adding the value in plugin.includes attribute in nutch-site.xml - but i'm not sure just by doing that I can have thai text recognized and tokenized properly. What else do I have to do ? Please help me. 1. You must create a thai NGP (Ngram Profile file) so that the language identifier can identify thai ! 2. You must create a thai analyzer (see for instance analysis-fr and analysis-de sample analyzers). Best Regards Jérôme -- View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375925 Sent from the Nutch - Dev mailing list archive at Nabble.com.
0.7.3 version
Hello committers, Based on a recent discussion on nutch user list - (Strategic Direction of Nutch) I would like to prepare 0.7.3 release. The idea is to allow people who still use 0.7.2 to get rid of most important bugs and allow them to add some small features they would need as the claim is 0.8.1 is not good for small crawls at the moment. It will allow us to work on 0.8 branch so it would be more small installation friendly. I would like to approach it this way that if noone objects I would create a 0.7.3 release in JIRA and ask people to assign issues with patches to it. I do not have a lot of time personally so I do not plan to do any development myself - just taking care of high quality patches and committing them - after some time when we gather some aomount of bugfixes/isues I would prepare 0.7.3 release. Any objections comments? Regards Piotr