[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Assignee: Tejas Patil Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Tejas Patil (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551857#comment-13551857 ] Tejas Patil commented on NUTCH-1284: Can anyone kindly review the patch ?

[jira] [Updated] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1274: Fix Version/s: 2.2 Fix [cast] javac warnings -

[jira] [Resolved] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1274. - Resolution: Fixed Committed @revision 1432469 in trunk Committed @revision

[jira] [Updated] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1042: Fix Version/s: 2.2 1.7 Fetcher.max.crawl.delay property

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1284: Fix Version/s: 2.2 Add site fetcher.max.crawl.delay as log output by default.

[jira] [Commented] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551959#comment-13551959 ] Lewis John McGibbney commented on NUTCH-1284: - Hi Tejas. Nice catch btw as it

[jira] [Commented] (NUTCH-1274) Fix [cast] javac warnings

2013-01-12 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13551964#comment-13551964 ] Hudson commented on NUTCH-1274: --- Integrated in Nutch-trunk #2081 (See

[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1472: Fix Version/s: 2.2 InvalidRequestException(why:(String didn't validate.)

[jira] [Resolved] (NUTCH-1436) bin/nutch absent in zip package

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1436. - Resolution: Won't Fix As we have released 1.6, which includes the bin/nutch

[jira] [Updated] (NUTCH-1472) InvalidRequestException(why:(String didn't validate.) [webpage][f][ts] failed validation)

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1472: Component/s: injector This issue occurs when injecting URLs into Cassandra using

[jira] [Updated] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1495: Fix Version/s: 2.2 -normalize and -filter for updatedb command in nutch 2.x

[jira] [Updated] (NUTCH-1190) MoreIndexingFilter refactor: move data formats used to parse lastModified to a config file.

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1190: Fix Version/s: 2.2 1.7 MoreIndexingFilter refactor: move

[jira] [Updated] (NUTCH-1015) MoreIndexingFilter: can't parse erroneous date: 2006-05-24T20:03:42

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1015: Fix Version/s: 2.2 1.7 MoreIndexingFilter: can't parse

Jenkins build is back to normal : Nutch-nutchgora #463

2013-01-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/463/

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1483: Patch Info: Patch Available Can't crawl filesystem with protocol-file plugin

[jira] [Updated] (NUTCH-1461) Problem with TableUtil

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1461: Patch Info: Patch Available Fix Version/s: 2.2 Problem with TableUtil

[Nutch Wiki] Trivial Update of FrontPage by LewisJohnMcgibbney

2013-01-12 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The FrontPage page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FrontPage?action=diffrev1=254rev2=255 * ApacheConUs2009MeetUp - List of topics for !MeetUp at

[jira] [Resolved] (NUTCH-1094) create comprehensive documentation for Nutchgora branch

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1094. - Resolution: Fixed I would argue that this has been significantly addressed in

[jira] [Updated] (NUTCH-1447) Nutch 2.x with Cloudera CDH 4 get Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1447: Fix Version/s: 2.2 Nutch 2.x with Cloudera CDH 4 get Error: Found interface

[jira] [Updated] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1418: Fix Version/s: 1.7 error parsing robots rules- can't decode path:

[jira] [Updated] (NUTCH-1458) Support for raw HTML field added to Solr

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1458: Fix Version/s: 1.7 Support for raw HTML field added to Solr

[jira] [Updated] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1457: Fix Version/s: 2.2 Nutch2 Refactor the update process so that fetched items

[jira] [Updated] (NUTCH-1452) hadoop.job.history.user.location in nutch-default making job history useless

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1452: Fix Version/s: 2.2 1.7 hadoop.job.history.user.location in

[jira] [Updated] (NUTCH-806) Merge CrawlDBScanner with CrawlDBReader

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-806: --- Fix Version/s: 1.7 Merge CrawlDBScanner with CrawlDBReader

[jira] [Updated] (NUTCH-1410) impact of a map-reduce problem

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1410: Fix Version/s: 2.2 1.7 impact of a map-reduce problem

[jira] [Updated] (NUTCH-1502) Test for CrawlDatum state transitions

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1502: Fix Version/s: 2.2 1.7 Test for CrawlDatum state

[jira] [Updated] (NUTCH-1481) When using MySQL as storage unicode characters within URLS cause nutch to fail

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1481: Fix Version/s: 2.2 When using MySQL as storage unicode characters within URLS

[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1490: Fix Version/s: 2.2 Data Truncation exceptions when using mysql

[jira] [Updated] (NUTCH-1490) Data Truncation exceptions when using mysql

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1490: Patch Info: Patch Available Data Truncation exceptions when using mysql

[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1487: Fix Version/s: 2.2 Nutch parse fails first time for PDF files and works on

[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1297: Fix Version/s: 1.7 it is better for fetchItemQueues to select items from

[jira] [Updated] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1297: Patch Info: Patch Available it is better for fetchItemQueues to select items

[jira] [Updated] (NUTCH-1286) Refactoring/reimplementing crawling API (NutchApp)

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1286: Fix Version/s: 2.2 Refactoring/reimplementing crawling API (NutchApp)

[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1267: Fix Version/s: 1.7 urlmeta to delegate indexing to index-metadata

[jira] [Updated] (NUTCH-1268) parse-meta to delegate indexing to index-metadata

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1268: Fix Version/s: 1.7 parse-meta to delegate indexing to index-metadata

[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1303: Fix Version/s: 1.7 Fetcher to skip queues for URLS getting repeated

[jira] [Updated] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1303: Patch Info: Patch Available Fetcher to skip queues for URLS getting repeated

[jira] [Updated] (NUTCH-1270) some of Deflate encoded pages not fetched

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1270: Patch Info: Patch Available Fix Version/s: 1.7 some of Deflate encoded

[jira] [Updated] (NUTCH-1282) linkdb scalability

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1282: Fix Version/s: 1.7 linkdb scalability --

[jira] [Updated] (NUTCH-1269) Generate main problems

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1269: Fix Version/s: 1.7 Generate main problems --

[jira] [Updated] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1281: Fix Version/s: 2.2 1.7 tika parser not work properly with

[jira] [Updated] (NUTCH-1278) Fetch Improvement in threads per host

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1278: Patch Info: Patch Available Fetch Improvement in threads per host

[jira] [Updated] (NUTCH-926) Nutch follows wrong url in META http-equiv=refresh tag

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-926: --- Fix Version/s: 1.7 Nutch follows wrong url in META http-equiv=refresh tag

[jira] [Updated] (NUTCH-881) Good quality documentation for Nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-881: --- Fix Version/s: 1.7 Good quality documentation for Nutch

[jira] [Updated] (NUTCH-1253) Incompatible neko and xerces versions

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1253: Fix Version/s: 2.2 1.7 Incompatible neko and xerces

[jira] [Updated] (NUTCH-1257) Support for the x-robots-tag HTTP Header

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1257: Fix Version/s: 2.2 1.7 Support for the x-robots-tag HTTP

[jira] [Updated] (NUTCH-1250) parse-html does not parse links with empty anchor

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1250: Fix Version/s: 2.2 1.7 parse-html does not parse links

[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1080: Fix Version/s: 2.2 Type safe members , arguments for better readability

[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1080: Fix Version/s: 1.7 Type safe members , arguments for better readability

[jira] [Updated] (NUTCH-1076) Solrindex has no documents following bin/nutch solrindex when using protocol-file

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1076: Fix Version/s: 1.7 Solrindex has no documents following bin/nutch solrindex

[jira] [Updated] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1371: Fix Version/s: 2.2 1.7 Replace Ivy with Maven Ant tasks

[jira] [Updated] (NUTCH-1382) Adding support for EmbeddedSolrServer to SolrIndexer

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1382: Fix Version/s: 1.7 Adding support for EmbeddedSolrServer to SolrIndexer

[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1387: Fix Version/s: 2.2 1.7 All parsers should respond to

[jira] [Updated] (NUTCH-1375) extract main content of a html file

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1375: Patch Info: Patch Available Fix Version/s: 1.7 extract main content of

[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1334: Fix Version/s: 1.7 NPE in FetcherOutputFormat ---

[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1334: Patch Info: Patch Available NPE in FetcherOutputFormat

[jira] [Updated] (NUTCH-1329) parser not extract outlinks to external web sites

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1329: Fix Version/s: 2.2 1.7 parser not extract outlinks to

[jira] [Updated] (NUTCH-1321) IDNNormalizer

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1321: Fix Version/s: 1.7 IDNNormalizer - Key:

[jira] [Updated] (NUTCH-1315) reduce speculation on but ParseOutputFormat doesn't name output files correctly?

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1315: Fix Version/s: 1.7 reduce speculation on but ParseOutputFormat doesn't name

[jira] [Updated] (NUTCH-1309) fetch queue management

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1309: Fix Version/s: 1.7 fetch queue management --

[jira] [Updated] (NUTCH-933) Fetcher does not save a pages Last-Modified value in CrawlDatum

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-933: --- Fix Version/s: 1.7 Fetcher does not save a pages Last-Modified value in

[jira] [Updated] (NUTCH-929) Create a REST-based admin UI for Nutch

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-929: --- Fix Version/s: 2.2 Create a REST-based admin UI for Nutch

[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-891: --- Fix Version/s: 2.2 Nutch build should not depend on unversioned local deps

[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-891: --- Patch Info: Patch Available Nutch build should not depend on unversioned local

[jira] [Updated] (NUTCH-952) fix outlink which started with '?' in html parser

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-952: --- Fix Version/s: 1.7 fix outlink which started with '?' in html parser

[jira] [Updated] (NUTCH-649) Log list of files found but not crawled.

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-649: --- Fix Version/s: 1.7 Log list of files found but not crawled.

[jira] [Resolved] (NUTCH-960) Language ID - confidence factor

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-960. Resolution: Won't Fix This is way too old and as Ken pointed out this should be

[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-945: --- Fix Version/s: 2.2 Indexing to multiple SOLR Servers

[jira] [Updated] (NUTCH-945) Indexing to multiple SOLR Servers

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-945: --- Patch Info: Patch Available Indexing to multiple SOLR Servers

[jira] [Resolved] (NUTCH-734) option to filter a tag text

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-734. Resolution: Won't Fix This is simply not required and dated. Plus I assume by

[jira] [Resolved] (NUTCH-745) MyHtmlParser getParse return not null,so all Analyzer-(zh|fr) cannot run

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-745. Resolution: Invalid close of legacy issue MyHtmlParser getParse

[jira] [Updated] (NUTCH-685) Content-level redirect status lost in ParseSegment

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-685: --- Fix Version/s: 2.2 1.7 Content-level redirect status lost in

[jira] [Updated] (NUTCH-583) FeedParser empty links for items

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-583: --- Fix Version/s: 2.2 1.7 FeedParser empty links for items

[jira] [Updated] (NUTCH-356) Plugin repository cache can lead to memory leak

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-356: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7

[jira] [Updated] (NUTCH-366) Merge URLFilters and URLNormalizers

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-366: --- Fix Version/s: 2.2 1.7 Merge URLFilters and URLNormalizers

[jira] [Updated] (NUTCH-475) Adaptive crawl delay

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-475: --- Fix Version/s: 1.7 Adaptive crawl delay

[jira] [Updated] (NUTCH-207) Bandwidth target for fetcher rather than a thread count

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-207: --- Patch Info: Patch Available Fix Version/s: 1.7 Bandwidth target for

[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1508: Fix Version/s: 2.2 Port limit crawler to defined depth to 2.x

[jira] [Resolved] (NUTCH-314) Multiple language identifier instances

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-314. Resolution: Won't Fix close of legacy issue Multiple language

[jira] [Updated] (NUTCH-1483) Can't crawl filesystem with protocol-file plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1483: Fix Version/s: 2.2 1.7 Can't crawl filesystem with

[jira] [Updated] (NUTCH-802) Problems managing outlinks with large url length

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-802: --- Fix Version/s: 1.7 Problems managing outlinks with large url length

[jira] [Updated] (NUTCH-795) Add ability to maintain nofollow attribute in linkdb

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-795: --- Fix Version/s: 1.7 Add ability to maintain nofollow attribute in linkdb

[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1478: Fix Version/s: 2.2 Parse-metatags and index-metadata plugin for Nutch 2.x

[jira] [Updated] (NUTCH-1511) Metadata in MYSQL updated with 'garbage'

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1511: Fix Version/s: 2.2 Metadata in MYSQL updated with 'garbage'

[jira] [Updated] (NUTCH-1505) java.lang.IllegalArgumentException during updatedb

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1505: Fix Version/s: 2.2 java.lang.IllegalArgumentException during updatedb

[jira] [Updated] (NUTCH-804) CrawlDatum.statNames can be modified

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-804: --- Fix Version/s: 1.7 CrawlDatum.statNames can be modified

[jira] [Updated] (NUTCH-789) Improvements to Tika parser

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-789: --- Fix Version/s: 2.2 1.7 Improvements to Tika parser

[jira] [Updated] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-813: --- Fix Version/s: 1.7 Repetitive crawl 403 status page

[jira] [Updated] (NUTCH-1464) index-static plugin doesn't allow the colon within the field value

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1464: Patch Info: Patch Available Fix Version/s: 1.7 index-static plugin

[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1497: Fix Version/s: 2.2 Better default gora-sql-mapping.xml with larger field

[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1499: Fix Version/s: 1.7 Usage of multiple ipv4 addresses and network cards on

[jira] [Updated] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1499: Patch Info: Patch Available Usage of multiple ipv4 addresses and network

[jira] [Updated] (NUTCH-1485) TableUtil reverseURL to keep userinfo part

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1485: Fix Version/s: 2.2 TableUtil reverseURL to keep userinfo part

[jira] [Updated] (NUTCH-1182) fetcher should track and shut down hung threads

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1182: Fix Version/s: 2.2 1.7 fetcher should track and shut down

[jira] [Resolved] (NUTCH-1018) Solr Document Size Limit

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1018. - Resolution: Won't Fix Looks like a plugin is the solution here. Closing as won't

[jira] [Resolved] (NUTCH-1007) Add readdb -host output

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1007. - Resolution: Won't Fix This is not a problem and as Markus mentioned the

[jira] [Commented] (NUTCH-1499) Usage of multiple ipv4 addresses and network cards on fetcher machines

2013-01-12 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552028#comment-13552028 ] Sebastian Nagel commented on NUTCH-1499: So, a vote for won't fix. Comments?

[jira] [Resolved] (NUTCH-1316) create EmbeddedNutchInstance testing utility class

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1316. - Resolution: Won't Fix We already have a testing class relating to Fetching

[jira] [Updated] (NUTCH-1313) Nutch trunk add response headers to datastore for the protocol-httpclient plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1313: Fix Version/s: 1.7 Nutch trunk add response headers to datastore for the

  1   2   >