[jira] [Created] (NUTCH-1678) Remove dependency on org.apache.oro
James Sullivan created NUTCH-1678: - Summary: Remove dependency on org.apache.oro Key: NUTCH-1678 URL: https://issues.apache.org/jira/browse/NUTCH-1678 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor org.apache.oro has been archived for three years and it may be good to remove the dependency as Java has had a built in regexes for quite some time now. There don't seem to have been any specific Perl5 functionality needed in the regexes so unless there are specific threading or performance reasons for continuing to use oro it may be time to lose the dependency. Attached patch needs to be checked thoroughly as I am rusty with Java and the unit tests are sparse. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1678) Remove dependency on org.apache.oro
[ https://issues.apache.org/jira/browse/NUTCH-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1678: -- Attachment: 2.x.patch parse/OutlinkExtractor index-more parse-js urlnormalizer-basic Needs to be looked over and tested first. Remove dependency on org.apache.oro --- Key: NUTCH-1678 URL: https://issues.apache.org/jira/browse/NUTCH-1678 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Labels: newbie, patch Attachments: 2.x.patch org.apache.oro has been archived for three years and it may be good to remove the dependency as Java has had a built in regexes for quite some time now. There don't seem to have been any specific Perl5 functionality needed in the regexes so unless there are specific threading or performance reasons for continuing to use oro it may be time to lose the dependency. Attached patch needs to be checked thoroughly as I am rusty with Java and the unit tests are sparse. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1678) Remove dependency on org.apache.oro
[ https://issues.apache.org/jira/browse/NUTCH-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1678: -- Description: org.apache.oro has been archived for three years and it may be good to remove the dependency as Java has had built in regexes for quite some time now. There don't seem to have been any specific Perl5 functionality needed in the regexes so unless there are specific threading or performance reasons for continuing to use oro it may be time to lose the dependency. Attached patch needs to be checked thoroughly as I am rusty with Java and the unit tests are sparse. (was: org.apache.oro has been archived for three years and it may be good to remove the dependency as Java has had a built in regexes for quite some time now. There don't seem to have been any specific Perl5 functionality needed in the regexes so unless there are specific threading or performance reasons for continuing to use oro it may be time to lose the dependency. Attached patch needs to be checked thoroughly as I am rusty with Java and the unit tests are sparse. ) Remove dependency on org.apache.oro --- Key: NUTCH-1678 URL: https://issues.apache.org/jira/browse/NUTCH-1678 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Labels: newbie, patch Attachments: 2.x.patch org.apache.oro has been archived for three years and it may be good to remove the dependency as Java has had built in regexes for quite some time now. There don't seem to have been any specific Perl5 functionality needed in the regexes so unless there are specific threading or performance reasons for continuing to use oro it may be time to lose the dependency. Attached patch needs to be checked thoroughly as I am rusty with Java and the unit tests are sparse. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1475: -- Attachment: index-more-2x.patch This patch uses getModifiedTime Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681899#comment-13681899 ] James Sullivan commented on NUTCH-1475: --- Some additional information--this problem with the date field only happens at sites that don't set the last modified in the headers properly. I have attached a new very simple patch for 2.x using modifiedTime per Sebastien's recommendation. This same patch may work for 1.x but I am not that familiar with the 1.x branches so have not submitted a patch for that branch. The patch does not check to see if it was unmodified after the previous fetch as it is not critical (although nice to have) and I don't know how to do that concisely. Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 1.8 Attachments: index-more-1xand2x.patch, index-more-2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1576) Need to keep hotStore.flush() exception catching
[ https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13671830#comment-13671830 ] James Sullivan commented on NUTCH-1576: --- Thanks for fixing this. Just compiled it with gora-core 0.2.1 and it worked fine. Need to keep hotStore.flush() exception catching Key: NUTCH-1576 URL: https://issues.apache.org/jira/browse/NUTCH-1576 Project: Nutch Issue Type: Bug Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Fix For: 2.2 Attachments: patch.txt Still need exception checking for hoststorelflush() for those who have to use gora-core 0.2.1 otherwise Nutch 2.x will not compile. !-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -- Index: src/java/org/apache/nutch/host/HostDb.java === --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (revision 1487824) +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (working copy) @@ -87,7 +87,11 @@ CacheHost removeFromCacheHost = notification.getValue(); if (removeFromCacheHost != NULL_HOST) { if (removeFromCacheHost.timestamp lastFlush.get()) { -hostStore.flush(); +try { + hostStore.flush(); +} catch (IOException e) { + throw new RuntimeException(e); +} lastFlush.set(System.currentTimeMillis()); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1576) Need to keep hotStore.flush() exception catching
James Sullivan created NUTCH-1576: - Summary: Need to keep hotStore.flush() exception catching Key: NUTCH-1576 URL: https://issues.apache.org/jira/browse/NUTCH-1576 Project: Nutch Issue Type: Bug Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Still need exception checking for hoststorelflush() for those who have to use gora-core 0.2.1 otherwise Nutch 2.x will not compile. !-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -- Index: src/java/org/apache/nutch/host/HostDb.java === --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (revision 1487824) +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (working copy) @@ -87,7 +87,11 @@ CacheHost removeFromCacheHost = notification.getValue(); if (removeFromCacheHost != NULL_HOST) { if (removeFromCacheHost.timestamp lastFlush.get()) { -hostStore.flush(); +try { + hostStore.flush(); +} catch (IOException e) { + throw new RuntimeException(e); +} lastFlush.set(System.currentTimeMillis()); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1576) Need to keep hotStore.flush() exception catching
[ https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1576: -- Attachment: patch.txt Need to keep hotStore.flush() exception catching Key: NUTCH-1576 URL: https://issues.apache.org/jira/browse/NUTCH-1576 Project: Nutch Issue Type: Bug Affects Versions: 2.2 Reporter: James Sullivan Priority: Minor Attachments: patch.txt Still need exception checking for hoststorelflush() for those who have to use gora-core 0.2.1 otherwise Nutch 2.x will not compile. !-- Uncomment this to use SQL as Gora backend. It should be noted that the gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. Users should downgrade to gora-core 0.2.1 in order to use SQL as a backend. -- Index: src/java/org/apache/nutch/host/HostDb.java === --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (revision 1487824) +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java (working copy) @@ -87,7 +87,11 @@ CacheHost removeFromCacheHost = notification.getValue(); if (removeFromCacheHost != NULL_HOST) { if (removeFromCacheHost.timestamp lastFlush.get()) { -hostStore.flush(); +try { + hostStore.flush(); +} catch (IOException e) { + throw new RuntimeException(e); +} lastFlush.set(System.currentTimeMillis()); } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1497: -- Attachment: gora-mysql-mapping-patch Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1497: -- Patch Info: Patch Available Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496954#comment-13496954 ] James Sullivan commented on NUTCH-1497: --- I have attached it as a patch. MySQL users would still need to rename it to gora-sql-mapping.xml in order to use it. Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496959#comment-13496959 ] James Sullivan commented on NUTCH-1497: --- I have attached it as a patch. Sorry it took so long. At it stands, MySQL users will still have to rename it to gora-sql-mapping.xml in order to use it. On Mon, Nov 12, 2012 at 10:13 PM, Lewis John McGibbney (JIRA) Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1497: -- Patch Info: (was: Patch Available) Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1497: -- Attachment: gora-mysql-mapping.xml Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496122#comment-13496122 ] James Sullivan commented on NUTCH-1497: --- Nathan I've made the changes to the lengths and uploaded. Could you check it is correct. One note I left the column as typ, as although I agree it is odd, I thought consistency was more important. Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496131#comment-13496131 ] James Sullivan commented on NUTCH-1497: --- I agree one standard file for SQL databases would be preferable but one example of why I couldn't stay with one file for both hsql and MySQL is the text column was being turned into a blob, not text at larger sizes. Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
James Sullivan created NUTCH-1497: - Summary: Better default gora-sql-mapping.xml with larger field sizes for MySQL Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1481) When using MySQL as storage unicode characters within URLS cause nutch to fail
[ https://issues.apache.org/jira/browse/NUTCH-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493768#comment-13493768 ] James Sullivan commented on NUTCH-1481: --- There is a way around the 190 character restriction up to 767 or 768 characters which should be good enough for most URLs. Use the following options for a recent version of MySQL. innodb_file_format=barracuda innodb_file_per_table=true innodb_large_prefix=true ROW_FORMAT=COMPRESSED For step by step instructions I've updated http://nlp.solutions.asia/?p=180. The hash is probably a better long-term solution (given URL is stored in other fields as well) but probably involves mores work. When using MySQL as storage unicode characters within URLS cause nutch to fail -- Key: NUTCH-1481 URL: https://issues.apache.org/jira/browse/NUTCH-1481 Project: Nutch Issue Type: Bug Components: crawldb Affects Versions: 2.1 Environment: mysql 5.5.28 on centos Reporter: Arni Sumarlidason Labels: database, sql, unicode, utf8 MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. currently the url of a web page is used as a primary key in nutch storage. when using latin1 character set on the 'id' column @ length 767 bytes/characters; unicode characters in urls cause jdbc to throw an exception, java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: '\xE2\x80\x8' for column 'id' at row 1 when using utf8mb4 character set on the 'id' column @ length 190 characters / 760 bytes to fully support unicode characters; the field length becomes insufficient It may be better to use a hash of the url as the primary key instead of the url itself. This would allow urls of any length and full utf8 support. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475757#comment-13475757 ] James Sullivan commented on NUTCH-1475: --- Agreed fetch time would be even better but this seems a simple interim solution until Nutch-1457 happens. Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Fix For: 1.6, 2.2 Attachments: index-more-1xand2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1475: -- Attachment: index-more-1xand2x.patch Attaching new patch that patches both 1.x and 2.x Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: 2.1, 1.5.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Attachments: index-more-1xand2x.patch, index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
James Sullivan created NUTCH-1475: - Summary: Nutch 2.1 Index-More Plugin -- A better fall back value for date field Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 2.1 Environment: All Reporter: James Sullivan Priority: Minor Attachments: index-more-2x.patch Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field
[ https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Sullivan updated NUTCH-1475: -- Attachment: index-more-2x.patch Nutch 2.1 Index-More Plugin -- A better fall back value for date field -- Key: NUTCH-1475 URL: https://issues.apache.org/jira/browse/NUTCH-1475 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 2.1 Environment: All Reporter: James Sullivan Priority: Minor Labels: index-more, plugins Attachments: index-more-2x.patch Original Estimate: 1h Remaining Estimate: 1h Among other fields, the more plugin for Nutch 2.x provides a last modified and date field for the Solr index. The last modified field is the last modified date from the http headers if available, if not available it is left empty. Currently, the date field is the same as the last modified field unless that field is empty in which case getFetchTime is used as a fall back. I think getFetchTime is not a good fall back as it is the next fetch time and often a month or more in the future which doesn't make sense for the date field. Users do not expect webpages/documents with future dates. A more sensible fallback would be current date at the time it is indexed. This is possible by simply changing line 97 of https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java from time = page.getFetchTime(); // use fetch time to time = new Date().getTime(); Users interested in the getFetchTime value can still get it from the tstamp field. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira