[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171725#comment-15171725 ] Tien Nguyen Manh commented on NUTCH-2236: - No problem, just to make it run on Hadoop 2.7.1 >

[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171264#comment-15171264 ] Tien Nguyen Manh commented on NUTCH-2234: - elasticsearch 2.1.1 use httpclient 4.3.6 > Upgrade to

[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2236: Attachment: NUTCH-2236.patch I run Nutch 1.11 on Hadoop 2.7.1 with this patch. We also need

[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2236: --- Summary: Upgrade to Hadoop 2.7.1 Key: NUTCH-2236 URL: https://issues.apache.org/jira/browse/NUTCH-2236 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 >

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch Here it is: I update my initial patch for version 1.11. I

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: (was: NUTCH-1687-2.patch) > Pick queue in Round Robin >

[jira] [Issue Comment Deleted] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Comment: was deleted (was: I update my initial patch for ver 1.11. I crawl large number of

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: (was: NUTCH-2234.patch) > Upgrade to elasticsearch 2.1.1 >

[jira] [Updated] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2234: Attachment: NUTCH-2234.patch > Upgrade to elasticsearch 2.1.1 >

[jira] [Created] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-24 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2234: --- Summary: Upgrade to elasticsearch 2.1.1 Key: NUTCH-2234 URL: https://issues.apache.org/jira/browse/NUTCH-2234 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2016-02-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687-2.patch I update my initial patch for ver 1.11. I crawl large number

[jira] [Updated] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2225: Attachment: NUTCH-2225.patch > Parsed time not include time to parse >

[jira] [Created] (NUTCH-2225) Parsed time not include time to parse

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2225: --- Summary: Parsed time not include time to parse Key: NUTCH-2225 URL: https://issues.apache.org/jira/browse/NUTCH-2225 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2224: Attachment: NUTCH-2224.patch > Wrong metric compute in Fetcher status report >

[jira] [Created] (NUTCH-2224) Wrong metric compute in Fetcher status report

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2224: --- Summary: Wrong metric compute in Fetcher status report Key: NUTCH-2224 URL: https://issues.apache.org/jira/browse/NUTCH-2224 Project: Nutch Issue

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Attachment: NUTCH-2223.patch Patch for nutch 1.11 > Upgrade xercesImpl to 2.11.0 to fix

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: 1.13 > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype

[jira] [Created] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2223: --- Summary: Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection Key: NUTCH-2223 URL: https://issues.apache.org/jira/browse/NUTCH-2223

[jira] [Updated] (NUTCH-2223) Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika mimetype detection

2016-02-17 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2223: Fix Version/s: (was: 1.13) > Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-26 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15117020#comment-15117020 ] Tien Nguyen Manh commented on NUTCH-961: Can NUTCH-1233: use tika to extract outlink solve that

[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772 ] Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM: - AH yes,

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-25 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772 ] Tien Nguyen Manh commented on NUTCH-961: AH yes, Could you explain why we need to parse it twice?

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-24 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114658#comment-15114658 ] Tien Nguyen Manh commented on NUTCH-961: One note with boilerpipe support, it is significant slower

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2016-01-20 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110217#comment-15110217 ] Tien Nguyen Manh commented on NUTCH-961: i'm using this patch NUTCH-961-1.11-1.patch, it works fine

[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-08-23 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1679: Attachment: NUTCH-1679-2.patch I have another solution. With a new link in DbUpdaterReducer

[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1702: --- Summary: Port HostNormalizer to 2.x Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch Port HostNormalizer to 2.x --

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Fix Version/s: 2.3 Port HostNormalizer to 2.x --

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch Port HostNormalizer to 2.x --

[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: (was: NUTCH-1702.patch) Port HostNormalizer to 2.x

[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1704: Attachment: NUTCH-1704.patch Port DomainBlacklist urlfilter to 2.x

[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1478: Attachment: NUTCH-1478-parse-v2.patch i port parse-metatags to 2.x, this patch support

[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1705: Attachment: NUTCH-1705.patch Make configuration option for HtmlParser TikaParser to

[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1705: --- Summary: Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page Key: NUTCH-1705 URL:

[jira] [Created] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1701: --- Summary: Make Solr Document Boost as an option Key: NUTCH-1701 URL: https://issues.apache.org/jira/browse/NUTCH-1701 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1701: Fix Version/s: 1.8 2.3 Make Solr Document Boost as an option

[jira] [Updated] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-14 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1701: Attachment: NUTCH-1701-2x.patch Make Solr Document Boost as an option

[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861142#comment-13861142 ] Tien Nguyen Manh commented on NUTCH-1686: - In this patch i also fixed an bug with

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Issue Type: New Feature (was: Bug) TextMD5Signatue compute on textual content

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1693: Fix Version/s: 2.3 TextMD5Signatue compute on textual content

[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-02 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861195#comment-13861195 ] Tien Nguyen Manh commented on NUTCH-1693: - this patch only work with a minor

[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859364#comment-13859364 ] Tien Nguyen Manh commented on NUTCH-1687: - It is nice! Pick queue in Round Robin

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: NUTCH-1687.patch add Apache Header fixed lost tail pointer when deleting

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-29 Thread Tien Nguyen Manh (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1687: Attachment: (was: NUTCH-1687.patch) Pick queue in Round Robin