[jira] [Created] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2012-01-25 Thread Mike (Created) (JIRA)
Support for the x-robots-tag HTTP Header Key: NUTCH-1257 URL: https://issues.apache.org/jira/browse/NUTCH-1257 Project: Nutch Issue Type: New Feature Components: fetcher

[jira] [Created] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Created) (JIRA)
MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL:

[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1258: - Attachment: NUTCH-1258-1.5-1.patch Patch for 1.5. Adds configuration to read from contentmeta,

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192992#comment-13192992 ] Markus Jelsma commented on NUTCH-1258: -- Comments? Tested and things work as expected,

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Julien Nioche (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13192997#comment-13192997 ] Julien Nioche commented on NUTCH-1258: -- What about using a similar mechanism for the

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193001#comment-13193001 ] Markus Jelsma commented on NUTCH-1258: -- That may be a good idea indeed but we need to

[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient

2012-01-25 Thread Ferdy Galema (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193007#comment-13193007 ] Ferdy Galema commented on NUTCH-1086: - Seems like a JVM bug, perhaps you could

[jira] [Commented] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193019#comment-13193019 ] Markus Jelsma commented on NUTCH-1258: -- Ah, the Content-Type detected by Tika is

[jira] [Commented] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-01-25 Thread Markus Jelsma (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193030#comment-13193030 ] Markus Jelsma commented on NUTCH-1259: -- A solution would be to prevent the type to be

[jira] [Commented] (NUTCH-1086) Rewrite protocol-httpclient

2012-01-25 Thread Oleg Kalnichevski (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193031#comment-13193031 ] Oleg Kalnichevski commented on NUTCH-1086: -- For what it is worth to you,

[jira] [Commented] (NUTCH-1253) Incompatible neko and xerces versions

2012-01-25 Thread Ferdy Galema (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193042#comment-13193042 ] Ferdy Galema commented on NUTCH-1253: - Hi, Looking at the revision history it seems

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Fix Version/s: 1.5 Thanks. Marked for 1.5, keeping it on the radar.

[jira] [Updated] (NUTCH-1256) WebGraph to dump host + score

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1256: - Attachment: NUTCH-1256-1.5-1.patch Patch introduces new parameter with two mandatory arguments.

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Thanks. Marked for 1.5, keeping it on the radar. SegmentReader -get shows wrong

[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2012-01-25 Thread Sebastian Nagel (Commented) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13193115#comment-13193115 ] Sebastian Nagel commented on NUTCH-1113: I had a look at the attached segment