[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-11-16 Thread Enis Soztutar (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Enis Soztutar updated NUTCH-289:


Attachment: ipInCrawlDatumDraftV5.1.patch

The version 5 patch does not run on the current build. So i have fixed it and 
resend the patch(did not changed any code). I think this patch should be 
included in the trunk. 

 CrawlDatum should store IP address
 --

 Key: NUTCH-289
 URL: http://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cutting
 Attachments: ipInCrawlDatumDraftV1.patch, 
 ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
 ipInCrawlDatumDraftV5.patch


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-11-16 Thread Uros Gruber (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12450315 ] 

Uros Gruber commented on NUTCH-289:
---

One question. Why does IP need to be in CrawlDatum and not in metadata?

 CrawlDatum should store IP address
 --

 Key: NUTCH-289
 URL: http://issues.apache.org/jira/browse/NUTCH-289
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cutting
 Attachments: ipInCrawlDatumDraftV1.patch, 
 ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, 
 ipInCrawlDatumDraftV5.patch


 If the CrawlDatum stored the IP address of the host of it's URL, then one 
 could:
 - partition fetch lists on the basis of IP address, for better politeness;
 - truncate pages to fetch per IP address, rather than just hostname.  This 
 would be a good way to limit the impact of domain spammers.
 The IP addresses could be resolved when a CrawlDatum is first created for a 
 new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: implement thai language indexing and search

2006-11-16 Thread sanjeev


ok. I was able to enable the language identifier plugin by adding the value
in plugin.includes attribute 
in nutch-site.xml - but i'm not sure just by doing that I can have thai text
recognized and tokenized
properly. 
What else do I have to do ? Please help me.

Thanks and regards,
sanjeev.




sanjeev wrote:
 
 hi all,
 
 I've been trying unsuccessfully for the past week to implement the thai
 language analyzer 
 with nutch. One thing I don't understand is the thai analyzer belongs to
 the lucene.analysis package
 instead of the nutch.analysis package.
 
 I have the thai ngp file + analyzer (albeit from lucene) + nutch 0.8 dev
 pack
 
 My question is how to integrate this into nutch that when I index and
 search - it will analyze and search 
 
 the thai lanaguage correctly.
 
 Someone please help as i'm sure it can be done.
 
 thanks and regards,
 sanjeev
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375203
Sent from the Nutch - Dev mailing list archive at Nabble.com.



Re: implement thai language indexing and search

2006-11-16 Thread Jérôme Charron

ok. I was able to enable the language identifier plugin by adding the
value
in plugin.includes attribute
in nutch-site.xml - but i'm not sure just by doing that I can have thai
text
recognized and tokenized
properly.
What else do I have to do ? Please help me.


1. You must create a thai NGP (Ngram Profile file) so that the language
identifier can identify thai !
2. You must create a thai analyzer (see for instance analysis-fr and
analysis-de sample analyzers).

Best Regards

Jérôme


Re: implement thai language indexing and search

2006-11-16 Thread sanjeev

Thanks Jerome,

i used an existing ThaiAnalyzer which was in lucene package.

ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)

take a look at the log file for a sample crawl - somehow i feel the
language-identifier is still not
activated. 

Need your help urgently in resolving this issue.

cheers and regards and thanks for all your help.

sanjeev.
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-default.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/crawl-tool.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-site.xml
491116 151804 No FS indicated, using default:local
491116 151804 crawl started in: crawlnewxx2
491116 151804 rootUrlFile = urls
491116 151804 threads = 10
491116 151804 depth = 10
491116 151804 Created webdb at
LocalFS,C:\cygwin\home\robert\nutch-0.7.2\crawlnewxx2\db
491116 151804 Starting URL processing
491116 151804 Plugins: looking in: C:\cygwin\home\robert\nutch-0.7.2\plugins
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\clustering-carrot2
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\creativecommons
491116 151804 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-more
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\language-identifier
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\ontology
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-ext
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-html\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-js
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-msword
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-pdf
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-rss
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-text\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-file
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-ftp
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-http\plugin.xml
491116 151805 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-httpclient
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-more
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-site\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-url\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-prefix
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
491116 151805 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
491116 151805 found resource regex-urlfilter.txt at
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/regex-urlfilter.txt
491116 151805 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer










Jérôme Charron wrote:
 
 ok. I was able to enable the language identifier plugin by adding the
 value
 in plugin.includes attribute
 in nutch-site.xml - but i'm not sure just by doing that I can have thai
 text
 recognized and tokenized
 properly.
 What else do I have to do ? Please help me.
 
 1. You must create a thai NGP (Ngram Profile file) so that the language
 identifier can identify thai !
 2. You must create a thai analyzer (see for instance analysis-fr and
 analysis-de sample analyzers).
 
 Best Regards
 
 Jérôme
 
 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375925
Sent from the Nutch - Dev mailing list archive at Nabble.com.



0.7.3 version

2006-11-16 Thread Piotr Kosiorowski

Hello committers,
Based on a recent discussion on nutch user list - (Strategic Direction
of Nutch) I would like to prepare 0.7.3 release. The idea is to allow
people who still use 0.7.2 to get rid of most important bugs and allow
them to add some small features they would need as the claim is 0.8.1
is not good for small crawls at the moment. It will allow us to work
on 0.8 branch so it would be more small installation friendly.
I would like to approach it this way that if noone objects I would
create a 0.7.3 release in JIRA and ask people to assign issues with
patches to it. I do not have a lot of time personally so I do not plan
to do any development myself - just taking care of high quality
patches and committing them - after some time when we gather some
aomount of bugfixes/isues I would prepare 0.7.3 release. Any
objections comments?
Regards
Piotr