[jira] [Updated] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-03-01 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1293: - Attachment: NUTCH-1293-1.5-1.patch Patch for 1.5. IndexingFiltersChecker to

[jira] [Updated] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-03-01 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1293: - Attachment: (was: NUTCH-1293-1.5-1.patch) IndexingFiltersChecker to store detected

[jira] [Updated] (NUTCH-1293) IndexingFiltersChecker to store detected content type in crawldatum metadata

2012-03-01 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1293: - Attachment: NUTCH-1293-1.5-1.patch Wrong patch indeed :)

[jira] [Updated] (NUTCH-1291) Fetcher to stringify exception on // unexpected exception

2012-02-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1291: - Attachment: NUTCH-1291-1.5-1.patch Patch for 1.5. Fetcher to stringify

[jira] [Updated] (NUTCH-1215) UpdateDB should not require segment as input

2012-02-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1215: - Attachment: NUTCH-1215-1.5-1.patch Patch for 1.5. Couldn't be simpler.

[jira] [Updated] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-09 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1259: - Patch Info: Patch Available TikaParser should not add Content-Type from HTTP Headers to

[jira] [Updated] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-02-09 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1262: - Priority: Minor (was: Major) Patch Info: Patch Available Map `duplicating`

[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-02-09 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1258: - Patch Info: Patch Available MoreIndexingFilter should be able to read Content-Type from

[jira] [Updated] (NUTCH-1005) Parse headings plugin

2012-02-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1005: - Summary: Parse headings plugin (was: Index headings plugin) Parse headings plugin

[jira] [Updated] (NUTCH-1259) TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata

2012-02-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1259: - Attachment: NUTCH-1259-1.5-1.patch Here's a patch for 1.5. Comments? We have this running in

[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-02-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1258: - Priority: Minor (was: Major) MoreIndexingFilter should be able to read Content-Type from

[jira] [Updated] (NUTCH-1266) Subcollection to optionally write to configured fields

2012-02-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1266: - Attachment: NUTCH-1266-1.5-1.patch Patch add an optional key element. If configured that value

[jira] [Updated] (NUTCH-1266) Subcollection to optionally write to configured fields

2012-02-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1266: - Patch Info: Patch Available Subcollection to optionally write to configured fields

[jira] [Updated] (NUTCH-1005) Index headings plugin

2012-02-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1005: - Attachment: NUTCH-1005-1.5-5.patch New patch without indexing capabilities. Use NUTCH-1264 for

[jira] [Updated] (NUTCH-1005) Index headings plugin

2012-02-01 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1005: - Attachment: NUTCH-1005-1.5-4.patch New patch as per Julien's comments. Index

[jira] [Updated] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-01-31 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1262: - Attachment: NUTCH-1262-1.5-1.patch Here's a patch for 1.5. It seems to work fine when tested

[jira] [Updated] (NUTCH-1242) Allow disabling of URL Filters in ParseSegment

2012-01-31 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1242: - Attachment: NUTCH-1242-1.5-1.patch Patch for latest trunk. Changed config options from

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-30 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Priority: Critical (was: Major) URL gone with 404 after db.fetch.interval.max stays

[jira] [Updated] (NUTCH-1260) Fetcher should log fetching of redirects

2012-01-27 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1260: - Fix Version/s: 1.5 Fetcher should log fetching of redirects

[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1258: - Attachment: NUTCH-1258-1.5-1.patch Patch for 1.5. Adds configuration to read from contentmeta,

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Fix Version/s: 1.5 Thanks. Marked for 1.5, keeping it on the radar.

[jira] [Updated] (NUTCH-1256) WebGraph to dump host + score

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1256: - Attachment: NUTCH-1256-1.5-1.patch Patch introduces new parameter with two mandatory arguments.

[jira] [Updated] (NUTCH-1252) SegmentReader -get shows wrong data

2012-01-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1252: - Thanks. Marked for 1.5, keeping it on the radar. SegmentReader -get shows wrong

[jira] [Updated] (NUTCH-1201) Allow for different FetcherThread impls

2012-01-24 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1201: - Attachment: CustomFetcher.java NUTCH-1201-1.5-wip.patch Here's a WIP that allows

[jira] [Updated] (NUTCH-1251) Deletion of duplicates fails with org.apache.solr.client.solrj.SolrServerException

2012-01-17 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1251: - Fix Version/s: 1.5 Deletion of duplicates fails with

[jira] [Updated] (NUTCH-1248) Generator to select on status

2012-01-13 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1248: - Attachment: NUTCH-1248-1.5-1.patch Any comments? Tests pass and it works as expected. I'll

[jira] [Updated] (NUTCH-1139) Indexer to delete documents

2012-01-09 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1139: - Attachment: NUTCH-1139-1.5-2.patch New patch for 1.5. Any final comments?

[jira] [Updated] (NUTCH-827) HTTP POST Authentication

2012-01-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-827: Fix Version/s: 1.5 HTTP POST Authentication Key:

[jira] [Updated] (NUTCH-1245) URL gone with 404 after db.fetch.interval.max stays db_unfetched in CrawlDb and is generated over and over again

2012-01-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1245: - Fix Version/s: 1.5 URL gone with 404 after db.fetch.interval.max stays db_unfetched in

[jira] [Updated] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-05 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1244: - Attachment: NUTCH-1244-1.5-1.patch Patch for 1.5. It relies on an exact match of the whole

[jira] [Updated] (NUTCH-1244) CrawlDBDumper to filter by regex

2012-01-05 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1244: - Attachment: NUTCH-1244-1.5-2.patch Patch for 1.5 fixes small issue with arguments and adds

[jira] [Updated] (NUTCH-1210) DomainBlacklistFilter

2012-01-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1210: - Attachment: NUTCH-1210-1.5-1.patch Patch for 1.5. DomainBlacklistFilter

[jira] [Updated] (NUTCH-1210) DomainBlacklistFilter

2012-01-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1210: - Patch Info: Patch Available DomainBlacklistFilter -

[jira] [Updated] (NUTCH-1232) Remove host field from index-basic

2012-01-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1232: - Summary: Remove host field from index-basic (was: Remove host|site fields from index-basic)

[jira] [Updated] (NUTCH-1239) Webgraph should remove deleted pages from segment input

2011-12-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1239: - Attachment: NUTCH-1239-1.5-1.patch Patch for 1.5. Little review would be appreciated. I added a

[jira] [Updated] (NUTCH-1238) Fetcher throughput threshold must start before feeder finished

2011-12-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1238: - Priority: Trivial (was: Major) Fetcher throughput threshold must start before feeder

[jira] [Updated] (NUTCH-1238) Fetcher throughput threshold must start before feeder finished

2011-12-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1238: - Patch Info: Patch Available Fetcher throughput threshold must start before feeder finished

[jira] [Updated] (NUTCH-1238) Fetcher throughput threshold must start before feeder finished

2011-12-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1238: - Attachment: NUTCH-1238-1.5-1.patch Patch for 1.5. The exceeding check is replaced by the new

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-12-27 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1104: - Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the

[jira] [Updated] (NUTCH-1230) MimeType API deprecated and breaks with Tika 1.0

2011-12-21 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1230: - Priority: Blocker (was: Major) Patch Info: Patch Available Summary: MimeType API

[jira] [Updated] (NUTCH-1230) MimeType API deprecated and breaks with Tika 1.0

2011-12-21 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1230: - Attachment: NUTCH-1230-1.5-2.patch Patches for MimeUtil and some other classes. Everything works

[jira] [Updated] (NUTCH-1230) MimeType API deprecated and breaks with Tika 1.0

2011-12-21 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1230: - Attachment: NUTCH-1230-1.5-3.patch I feel like a fool sometimes but its sorted now! All tests

[jira] [Updated] (NUTCH-1233) Rely on Tika for outlink extraction

2011-12-21 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1233: - Attachment: NUTCH-1233-1.5-wip.patch The boilerpipe code relies on an unavailable BP version.

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-12-20 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Description: Fetcher improvements to parse and follow outlinks up to a specified depth. The

[jira] [Updated] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2011-12-19 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1222: - Assignee: Markus Jelsma Summary: Upgrade to new Hadoop 0.22.0 (was: Upgrade to newer Hadoop

[jira] [Updated] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2011-12-19 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1222: - Patch Info: Patch Available Upgrade to new Hadoop 0.22.0

[jira] [Updated] (NUTCH-1222) Upgrade to new Hadoop 0.22.0

2011-12-19 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1222: - Attachment: NUTCH-1222-1.5-1.patch Ivy patch. Everything is fine! Make sure to do ant clean or

[jira] [Updated] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1225: - Patch Info: Patch Available Assignee: Markus Jelsma Migrate CrawlDBScanner to

[jira] [Updated] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1225: - Attachment: NUTCH-1225-1.5-1.patch Patch for 1.5. This is only compatible with Hadoop 0.21 or

[jira] [Updated] (NUTCH-1225) Migrate CrawlDBScanner to MapReduce API

2011-12-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1225: - Attachment: NUTCH-1225-1.5-2.patch New patch uses proper value iteration in reducer. Old API:

[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API

2011-12-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1226: - Attachment: NUTCH-1226-1.5-1.patch First crack! Had a lot of trouble with some deprecated stuff

[jira] [Updated] (NUTCH-1226) Migrate CrawlDbReader to MapReduce API

2011-12-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1226: - Description: Hadoop 0.21 only! Patch Info: Patch Available Migrate CrawlDbReader to

[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2011-12-14 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1219: - Description: We should upgrade to the new Hadoop API for Nutch trunk as already has been done

[jira] [Updated] (NUTCH-1221) Migrate DomainStatistics to MapReduce API

2011-12-14 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1221: - Patch Info: Patch Available Migrate DomainStatistics to MapReduce API

[jira] [Updated] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2011-12-13 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1219: - Description: We should upgrade to the new Hadoop API for Nutch trunk as already has been done

[jira] [Updated] (NUTCH-1214) DomainStats tool should be named for what it's doing

2011-11-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1214: - Patch Info: Patch Available DomainStats tool should be named for what it's doing

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-11-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1104: - Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-25 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-9-ParseOutputFormat.patch Patch fixes issue described in NUTCH-1212.

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-11-21 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1104: - Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-9.patch New patch [9] solves an issue of NPE in filtering. It's now

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: (was: NUTCH-1185-1.5-9.patch) Fetcher to parse and follow Nth degree

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-16 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-9.patch Fetcher to parse and follow Nth degree outlinks

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Description: Fetcher improvements to parse and follow outlinks up to a specified depth. The

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-6.patch New patch includes all involved files: * ParseData *

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1185-1.5-7.patch This patch refactors filtering and parsing of outlinks to a

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-15 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Description: Fetcher improvements to parse and follow outlinks up to a specified depth. The

[jira] [Updated] (NUTCH-1203) ParseSegment to list ms per record

2011-11-11 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1203: - Attachment: NUTCH-1203-1.5-1.patch ParseSegment to list ms per record

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-11 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-5-ParseData.patch Patch for ParseData was missing. This now has a

[jira] [Updated] (NUTCH-1171) WebGraph to overwrite normalized input keys

2011-11-10 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1171: - Fix Version/s: (was: 1.4) WebGraph to overwrite normalized input keys

[jira] [Updated] (NUTCH-1153) LinkRank must not log all hyperlinks

2011-11-10 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1153: - Attachment: NUTCH-1153-1.5-2.patch Final patch also disabled writing of _SUCCESS files by recent

[jira] [Updated] (NUTCH-1155) Host/domain limit in generator is generate.max.count+1

2011-11-10 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1155: - Attachment: NUTCH-1155-1.5-1.patch simple patch Host/domain limit in generator

[jira] [Updated] (NUTCH-1173) DomainStats doesn't count db_not_modified

2011-11-10 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1173: - Attachment: NUTCH-1173-1.5-1.patch Simple patch. DomainStats doesn't count

[jira] [Updated] (NUTCH-1098) better url-normalizer basic

2011-11-04 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1098: - Attachment: patch-with-utf8-encoding.diff Restored original patch. better

[jira] [Updated] (NUTCH-1140) index-more plugin, resetTitle method creates multiple values in the Title field

2011-11-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1140: - Fix Version/s: 1.5 index-more plugin, resetTitle method creates multiple values in the

[jira] [Updated] (NUTCH-828) Fetch Filter

2011-11-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-828: Due Date: 9/Jun/10 (was: 9/Jun/10) Fix Version/s: 1.5 Fetch Filter

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-11-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-5.patch New patch adds fetcher.follow.outlinks.num.links setting that

[jira] [Updated] (NUTCH-1193) Incorrect url transform to lowercase: parameter solr

2011-11-02 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1193: - Priority: Trivial (was: Major) Fix Version/s: 1.5 Thank you for reporting. This is

[jira] [Updated] (NUTCH-1104) Port issues from trunk NutchGora branch

2011-11-01 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1104: - Description: Umbrella issue for tracking issues that should be ported from 1.x trunk to the

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-10-31 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-3.patch New patch fixes the todo's and incorporates NUTCH-1174.

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-10-31 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-4.patch New patch does not initialize maxOutlinkDepth in fetcher.

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-10-28 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-2.patch New patch uses HashSet to deduplicate the outlinks. Todo: *

[jira] [Updated] (NUTCH-1184) Fetcher to parse and follow Nth degree outlinks

2011-10-27 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1184: - Attachment: NUTCH-1184-1.5-1.patch Here's a first attempt, it introduces a new configuration

[jira] [Updated] (NUTCH-1178) Incorrect CSV header CrawlDatumCsvOutputFormat

2011-10-24 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1178: - Attachment: NUTCH-1178-1.5-1.patch Patch adding a new distinct retry interval field.

[jira] [Updated] (NUTCH-1180) UpdateDB to backup previous CrawlDB

2011-10-24 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1180: - Description: Nutch currently replaces an existing CrawlDB with the new CrawlDB. By optionally

[jira] [Updated] (NUTCH-1177) Generator to select on retry interval

2011-10-23 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1177: - Attachment: NUTCH-1177-1.5-1.patch Patch for trunk. Generator to select on

[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-12 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1142: - Attachment: NUTCH-1142-1.5-3.patch New patch with the ability to normalize and filter existing

[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1142: - Attachment: NUTCH-1142-1.5-2.patch New patch also filters collected outlinks instead of just map

[jira] [Updated] (NUTCH-1151) Index-anchor to add numInlinks count

2011-10-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1151: - Attachment: NUTCH-1151-1.5-1.patch Patch for trunk. Adds configuration directive to

[jira] [Updated] (NUTCH-1153) LinkRank must not log all hyperlinks

2011-10-07 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1153: - Attachment: NUTCH-1153-1.5-1.patch Patch for trunk. LinkRank must not log all

[jira] [Updated] (NUTCH-1139) Indexer to delete documents

2011-10-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1139: - Fix Version/s: (was: 1.4) 1.5 Needs proper testing, pass to 1.5

[jira] [Updated] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2011-10-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1024: - Fix Version/s: (was: 1.4) 1.5 Dynamically set fetchInterval by

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

2011-10-06 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-965: Fix Version/s: (was: 1.4) 1.5 Skip parsing for truncated documents

[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer

2011-10-05 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1147: - Patch Info: Patch Available WebGraph nodeDumper uses only 1 reducer

[jira] [Updated] (NUTCH-1147) WebGraph nodeDumper uses only 1 reducer

2011-10-05 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1147: - Attachment: NUTCH-1147-1.5-1.patch Patch for trunk. WebGraph nodeDumper uses

[jira] [Updated] (NUTCH-1150) http.redirect.max can lead to multiple parses of the same url

2011-10-05 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1150: - Summary: http.redirect.max can lead to multiple parses of the same url (was: http.redirect.max

[jira] [Updated] (NUTCH-1144) Filtering optional in WebGraph

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1144: - Fix Version/s: (was: 1.5) Filtering optional in WebGraph

[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1142: - Description: The WebGraph programs performs URL normalization. Since normalization of outlinks

[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-717: Fix Version/s: (was: 1.4) 1.5 Make Nutch Solr integration easier

[jira] [Updated] (NUTCH-1061) Migrate MoreIndexingFilter from Apache ORO to java.util.regex

2011-09-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1061: - Fix Version/s: (was: 1.4) (was: nutchgora) 1.5

[jira] [Updated] (NUTCH-1084) ReadDB url throws exception

2011-09-29 Thread Markus Jelsma (Updated) (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1084: - Affects Version/s: (was: 1.4) 1.3 Fix Version/s: (was:

<    1   2   3   >