[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861420#comment-13861420 ] Markus Jelsma commented on NUTCH-1693: -- +1, but this should also be ported to trunk.

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1693: - Fix Version/s: 1.8 Assignee: Markus Jelsma TextMD5Signatue compute on textual content

[jira] [Updated] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1693: - Attachment: NUTCH-1693-trunk.patch Patch for trunk. This patch works identical to the original

[jira] [Commented] (NUTCH-1693) TextMD5Signatue compute on textual content

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861434#comment-13861434 ] Markus Jelsma commented on NUTCH-1693: -- In any case, i think both 2x and 1x should

RE: Nutch Crawl a Specific List Of URLs (150K)

2014-01-03 Thread Markus Jelsma
Hi - Are they exact duplicates? If you inject http://nutch.apache.org/ a thousand times, it is added only once, and crawled only once, until it is scheduled to crawl again. -Original message- From: Bin Wangbinwang...@gmail.com Sent: Thursday 2nd January 2014 23:13 To:

[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861452#comment-13861452 ] Markus Jelsma commented on NUTCH-356: - Thanks, i have pushed it to our production

[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861468#comment-13861468 ] Markus Jelsma commented on NUTCH-1691: -- To test whether -D override works you have to

[jira] [Commented] (NUTCH-1647) protocol-http throws unzipBestEffort returned null for some pages

2014-01-03 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861471#comment-13861471 ] Markus Jelsma commented on NUTCH-1647: -- Hmm http.redirect.max already works on the

[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-03 Thread lufeng (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861502#comment-13861502 ] lufeng commented on NUTCH-1691: --- like urlfilter-prefix plugin, we can move WARN code to

Independent Map Reduce to parse Nutch content (Cont.)

2014-01-03 Thread Bin Wang
Hi, I tried to modify the code here to parse the nutch content data... http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup And in the end of this email is a prototype that I have written to run map reduce to calculate the HTML content length of

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-03 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862237#comment-13862237 ] Tejas Patil commented on NUTCH-1465: Hi [~wastl-nagel], Yes. I think that it should be