[jira] [Created] (NUTCH-1678) Remove dependency on org.apache.oro

2013-12-02 Thread James Sullivan (JIRA)
James Sullivan created NUTCH-1678:
-

 Summary: Remove dependency on org.apache.oro
 Key: NUTCH-1678
 URL: https://issues.apache.org/jira/browse/NUTCH-1678
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor


org.apache.oro has been archived for three years and it may be good to remove 
the dependency as Java has had a built in regexes for quite some time now. 
There don't seem to have been any specific Perl5 functionality needed in the 
regexes so unless there are specific threading or performance reasons for 
continuing to use oro it may be time to lose the dependency. Attached patch 
needs to be checked thoroughly as I am rusty with Java and the unit tests are 
sparse. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1678) Remove dependency on org.apache.oro

2013-12-02 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1678:
--

Attachment: 2.x.patch

parse/OutlinkExtractor
index-more
parse-js
urlnormalizer-basic

Needs to be looked over and tested first.

 Remove dependency on org.apache.oro
 ---

 Key: NUTCH-1678
 URL: https://issues.apache.org/jira/browse/NUTCH-1678
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor
  Labels: newbie, patch
 Attachments: 2.x.patch


 org.apache.oro has been archived for three years and it may be good to remove 
 the dependency as Java has had a built in regexes for quite some time now. 
 There don't seem to have been any specific Perl5 functionality needed in the 
 regexes so unless there are specific threading or performance reasons for 
 continuing to use oro it may be time to lose the dependency. Attached patch 
 needs to be checked thoroughly as I am rusty with Java and the unit tests are 
 sparse. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1678) Remove dependency on org.apache.oro

2013-12-02 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1678:
--

Description: org.apache.oro has been archived for three years and it may be 
good to remove the dependency as Java has had built in regexes for quite some 
time now. There don't seem to have been any specific Perl5 functionality needed 
in the regexes so unless there are specific threading or performance reasons 
for continuing to use oro it may be time to lose the dependency. Attached patch 
needs to be checked thoroughly as I am rusty with Java and the unit tests are 
sparse.   (was: org.apache.oro has been archived for three years and it may be 
good to remove the dependency as Java has had a built in regexes for quite some 
time now. There don't seem to have been any specific Perl5 functionality needed 
in the regexes so unless there are specific threading or performance reasons 
for continuing to use oro it may be time to lose the dependency. Attached patch 
needs to be checked thoroughly as I am rusty with Java and the unit tests are 
sparse. )

 Remove dependency on org.apache.oro
 ---

 Key: NUTCH-1678
 URL: https://issues.apache.org/jira/browse/NUTCH-1678
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor
  Labels: newbie, patch
 Attachments: 2.x.patch


 org.apache.oro has been archived for three years and it may be good to remove 
 the dependency as Java has had built in regexes for quite some time now. 
 There don't seem to have been any specific Perl5 functionality needed in the 
 regexes so unless there are specific threading or performance reasons for 
 continuing to use oro it may be time to lose the dependency. Attached patch 
 needs to be checked thoroughly as I am rusty with Java and the unit tests are 
 sparse. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2013-06-12 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1475:
--

Attachment: index-more-2x.patch

This patch uses getModifiedTime

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2013-06-12 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13681899#comment-13681899
 ] 

James Sullivan commented on NUTCH-1475:
---

Some additional information--this problem with the date field only happens at 
sites that don't set the last modified in the headers properly. I have attached 
a new very simple patch for 2.x using modifiedTime per Sebastien's 
recommendation. This same patch may work for 1.x but I am not that familiar 
with the 1.x branches so have not submitted a patch for that branch. The patch 
does not check to see if it was unmodified after the previous fetch as it is 
not critical (although nice to have) and I don't know how to do that concisely.

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 1.8

 Attachments: index-more-1xand2x.patch, index-more-2x.patch, 
 index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1576) Need to keep hotStore.flush() exception catching

2013-05-31 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13671830#comment-13671830
 ] 

James Sullivan commented on NUTCH-1576:
---

Thanks for fixing this. Just compiled it with gora-core 0.2.1 and it worked 
fine.

 Need to keep hotStore.flush() exception catching
 

 Key: NUTCH-1576
 URL: https://issues.apache.org/jira/browse/NUTCH-1576
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor
 Fix For: 2.2

 Attachments: patch.txt


 Still need exception checking for hoststorelflush() for those who have to use 
 gora-core 0.2.1 otherwise Nutch 2.x will not compile.
 !-- Uncomment this to use SQL as Gora backend. It should be noted that the 
 gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. 
 Users should 
 downgrade to gora-core 0.2.1 in order to use SQL as a backend. --
 Index: src/java/org/apache/nutch/host/HostDb.java
 ===
 --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (revision 1487824)
 +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (working copy)
 @@ -87,7 +87,11 @@
  CacheHost removeFromCacheHost = notification.getValue();
  if (removeFromCacheHost != NULL_HOST) {
if (removeFromCacheHost.timestamp  lastFlush.get()) {
 -hostStore.flush();
 +try {
 +  hostStore.flush();
 +} catch (IOException e) {
 +  throw new RuntimeException(e);
 +}
  lastFlush.set(System.currentTimeMillis());
}
  }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1576) Need to keep hotStore.flush() exception catching

2013-05-30 Thread James Sullivan (JIRA)
James Sullivan created NUTCH-1576:
-

 Summary: Need to keep hotStore.flush() exception catching
 Key: NUTCH-1576
 URL: https://issues.apache.org/jira/browse/NUTCH-1576
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor


Still need exception checking for hoststorelflush() for those who have to use 
gora-core 0.2.1 otherwise Nutch 2.x will not compile.

!-- Uncomment this to use SQL as Gora backend. It should be noted that the 
gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. 
Users should 
downgrade to gora-core 0.2.1 in order to use SQL as a backend. --


Index: src/java/org/apache/nutch/host/HostDb.java
===
--- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java   
(revision 1487824)
+++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java   
(working copy)
@@ -87,7 +87,11 @@
 CacheHost removeFromCacheHost = notification.getValue();
 if (removeFromCacheHost != NULL_HOST) {
   if (removeFromCacheHost.timestamp  lastFlush.get()) {
-hostStore.flush();
+try {
+  hostStore.flush();
+} catch (IOException e) {
+  throw new RuntimeException(e);
+}
 lastFlush.set(System.currentTimeMillis());
   }
 }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1576) Need to keep hotStore.flush() exception catching

2013-05-30 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1576:
--

Attachment: patch.txt

 Need to keep hotStore.flush() exception catching
 

 Key: NUTCH-1576
 URL: https://issues.apache.org/jira/browse/NUTCH-1576
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2
Reporter: James Sullivan
Priority: Minor
 Attachments: patch.txt


 Still need exception checking for hoststorelflush() for those who have to use 
 gora-core 0.2.1 otherwise Nutch 2.x will not compile.
 !-- Uncomment this to use SQL as Gora backend. It should be noted that the 
 gora-sql 0.1.1-incubating artifact is NOT compatable with gora-core 0.3. 
 Users should 
 downgrade to gora-core 0.2.1 in order to use SQL as a backend. --
 Index: src/java/org/apache/nutch/host/HostDb.java
 ===
 --- java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (revision 1487824)
 +++ java/workspace/2.x/src/java/org/apache/nutch/host/HostDb.java 
 (working copy)
 @@ -87,7 +87,11 @@
  CacheHost removeFromCacheHost = notification.getValue();
  if (removeFromCacheHost != NULL_HOST) {
if (removeFromCacheHost.timestamp  lastFlush.get()) {
 -hostStore.flush();
 +try {
 +  hostStore.flush();
 +} catch (IOException e) {
 +  throw new RuntimeException(e);
 +}
  lastFlush.set(System.currentTimeMillis());
}
  }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-14 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Attachment: gora-mysql-mapping-patch

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-14 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Patch Info: Patch Available

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-14 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496954#comment-13496954
 ] 

James Sullivan commented on NUTCH-1497:
---

I have attached it as a patch. MySQL users would still need to rename it to 
gora-sql-mapping.xml in order to use it. 

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-14 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496959#comment-13496959
 ] 

James Sullivan commented on NUTCH-1497:
---

I have attached it as a patch. Sorry it took so long. At it stands, MySQL
users will still have to rename it to gora-sql-mapping.xml in order to use
it.




On Mon, Nov 12, 2012 at 10:13 PM, Lewis John McGibbney (JIRA) 



 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Patch Info:   (was: Patch Available)

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Attachment: gora-mysql-mapping.xml

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496122#comment-13496122
 ] 

James Sullivan commented on NUTCH-1497:
---

Nathan I've made the changes to the lengths and uploaded. Could you check it is 
correct. One note I left the column as typ, as although I agree it is odd, I 
thought consistency was more important.

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496131#comment-13496131
 ] 

James Sullivan commented on NUTCH-1497:
---

I agree one standard file for SQL databases would be preferable but one example 
of why I couldn't stay with one file for both hsql and MySQL is the text column 
was being turned into a blob, not text at larger sizes. 

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-11 Thread James Sullivan (JIRA)
James Sullivan created NUTCH-1497:
-

 Summary: Better default gora-sql-mapping.xml with larger field 
sizes for MySQL
 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor


The current generic default gora-sql-mapping.xml has field sizes that are too 
small in almost all situations when used with MySQL. I have included a mapping 
which will work better for MySQL (takes slightly more space but will be able to 
handle larger fields necessary for real world use). Includes patch from 
Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not 
possible to use the same gora-sql-mapping for both hsqldb and MySQL without a 
significantly degraded lowest common denominator resulting. Should the user 
manually rename the attached file to gora-sql-mapping.xml or is there a way to 
have Nutch automatically use it when MySQL is selected in other configurations 
(Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1481) When using MySQL as storage unicode characters within URLS cause nutch to fail

2012-11-08 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493768#comment-13493768
 ] 

James Sullivan commented on NUTCH-1481:
---

There is a way around the 190 character restriction up to 767 or 768 characters 
which should be good enough for most URLs. Use the following options for a 
recent version of MySQL.

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
ROW_FORMAT=COMPRESSED

For step by step instructions I've updated http://nlp.solutions.asia/?p=180.

The hash is probably a better long-term solution (given URL is stored in other 
fields as well) but probably involves mores work.


 When using MySQL as storage unicode characters within URLS cause nutch to fail
 --

 Key: NUTCH-1481
 URL: https://issues.apache.org/jira/browse/NUTCH-1481
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 2.1
 Environment: mysql 5.5.28 on centos
Reporter: Arni Sumarlidason
  Labels: database, sql, unicode, utf8

 MySQL's (innodb) primary key / unique key is restricted to 767 bytes.. 
 currently the url of a web page is used as a primary key in nutch storage.
 when using latin1 character set on the 'id' column @ length 767 
 bytes/characters; unicode characters in urls cause jdbc to throw an exception,
 java.io.IOException: java.sql.BatchUpdateException: Incorrect string value: 
 '\xE2\x80\x8' for column 'id' at row 1
 when using utf8mb4 character set on the 'id' column @ length 190 characters / 
 760 bytes to fully support unicode characters; the field length becomes 
 insufficient
 It may be better to use a hash of the url as the primary key instead of the 
 url itself. This would allow urls of any length and full utf8 support.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-13 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13475757#comment-13475757
 ] 

James Sullivan commented on NUTCH-1475:
---

Agreed fetch time would be even better but this seems a simple interim solution 
until Nutch-1457 happens.

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Fix For: 1.6, 2.2

 Attachments: index-more-1xand2x.patch, index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-08 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1475:
--

Attachment: index-more-1xand2x.patch

Attaching new patch that patches both 1.x and 2.x

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1, 1.5.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Attachments: index-more-1xand2x.patch, index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-06 Thread James Sullivan (JIRA)
James Sullivan created NUTCH-1475:
-

 Summary: Nutch 2.1 Index-More Plugin -- A better fall back value 
for date field
 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 2.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
 Attachments: index-more-2x.patch

Among other fields, the more plugin for Nutch 2.x provides a last modified 
and date field for the Solr index. The last modified field is the last 
modified date from the http headers if available, if not available it is left 
empty. Currently, the date field is the same as the last modified field 
unless that field is empty in which case getFetchTime is used as a fall back. I 
think getFetchTime is not a good fall back as it is the next fetch time and 
often a month or more in the future which doesn't make sense for the date 
field. Users do not expect webpages/documents with future dates. A more 
sensible fallback would be current date at the time it is indexed. 

This is possible by simply changing line 97 of 
https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
 from


time = page.getFetchTime(); // use fetch time

to

time = new Date().getTime();


Users interested in the getFetchTime value can still get it from the tstamp 
field.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-06 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1475:
--

Attachment: index-more-2x.patch

 Nutch 2.1 Index-More Plugin -- A better fall back value for date field
 --

 Key: NUTCH-1475
 URL: https://issues.apache.org/jira/browse/NUTCH-1475
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 2.1
 Environment: All
Reporter: James Sullivan
Priority: Minor
  Labels: index-more, plugins
 Attachments: index-more-2x.patch

   Original Estimate: 1h
  Remaining Estimate: 1h

 Among other fields, the more plugin for Nutch 2.x provides a last modified 
 and date field for the Solr index. The last modified field is the last 
 modified date from the http headers if available, if not available it is left 
 empty. Currently, the date field is the same as the last modified field 
 unless that field is empty in which case getFetchTime is used as a fall back. 
 I think getFetchTime is not a good fall back as it is the next fetch time and 
 often a month or more in the future which doesn't make sense for the date 
 field. Users do not expect webpages/documents with future dates. A more 
 sensible fallback would be current date at the time it is indexed. 
 This is possible by simply changing line 97 of 
 https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
  from
 time = page.getFetchTime(); // use fetch time
 to
 time = new Date().getTime();
 Users interested in the getFetchTime value can still get it from the tstamp 
 field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira