[jira] [Resolved] (NUTCH-1496) ParserJob logs skipped urls with level info

2012-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1496.
-

   Resolution: Fixed
Fix Version/s: 2.2

Committed @revision 1408271 in 2.2-SNAPSHOT

Thank you very much Nathan for the contribution

 ParserJob logs skipped urls with level info
 ---

 Key: NUTCH-1496
 URL: https://issues.apache.org/jira/browse/NUTCH-1496
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Nathan Gass
Priority: Trivial
 Fix For: 2.2

 Attachments: patch-parserjob-log-level-2012.txt


 ParserJob is the only one which logs *all* skipped urls with level info. 
 Attached patch changes this to level debug, the same level already used by 
 FetcherJob, IndexerJob, and GeneratorJob.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495243#comment-13495243
 ] 

Lewis John McGibbney commented on NUTCH-1497:
-

Hi James, this is great. Is it possible for you to submit a patch? If a patch 
is applied it is much easier for us to track the provenance through the viewvc 
system as oppose to a complete file change.

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1494) RSS feed plugin seems broken

2012-11-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495249#comment-13495249
 ] 

Lewis John McGibbney commented on NUTCH-1494:
-

Without tracking the big in Rome we have little chance of sorting this out in 
Nutch. For the time being our rome dependency should remain as 0.9. 

 RSS feed plugin seems broken
 

 Key: NUTCH-1494
 URL: https://issues.apache.org/jira/browse/NUTCH-1494
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5.1
 Environment: Ubuntu 12.04 + JDK 7
Reporter: Sourajit Basak
 Fix For: 1.6

 Attachments: sourajit.NUTCH-1494.2.patch


 The RSS feed plugin is broken.
 I had to change the plugin dependencies to point to the correct rome library 
 version.
 !-- changed to the version thats bundled with v1.5.1, previously 0.9 --
  library name=rome-1.0.0.jar /
  !-- added this due to a CNFE from rome --
  library name=jdom-1.0.jar /
 Still it fails due to some (known) problem in rome.
 Caused by: java.lang.NullPointerException
 at java.util.Properties$LineReader.readLine(Properties.java:434)
 at java.util.Properties.load0(Properties.java:353)
 at java.util.Properties.load(Properties.java:341)
 at 
 com.sun.syndication.io.impl.PropertiesLoader.init(PropertiesLoader.java:74)
 at 
 com.sun.syndication.io.impl.PropertiesLoader.getPropertiesLoader(PropertiesLoader.java:46)
 at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:54)
 at com.sun.syndication.io.impl.PluginManager.init(PluginManager.java:46)
 at 
 com.sun.syndication.feed.synd.impl.Converters.init(Converters.java:40)
 at 
 com.sun.syndication.feed.synd.SyndFeedImpl.clinit(SyndFeedImpl.java:59)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-12 Thread Nathan Gass (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495269#comment-13495269
 ] 

Nathan Gass commented on NUTCH-1497:


Some comments about the differences to NUTCH-1490:

The renaming of column typ was because this column is oddly named and should 
imho be done, but is not actually specific to mysql. The length increasing of 
the column on the other hand is necessary as I got truncation exceptions with 
the typ column set to length 32. Of course if this should not happen I can try 
to find out which Url was responsible for this Exception to get at the root 
cause.

Setting outlinks to the same length as inlinks makes it unnecessary large (at 
least as soon as the maximum outlink number actually gets enforced in nutch). 
With the patch in NUTCH-1490 gora uses the column type mediumblob whereas with 
this file it would use longblob. I've no idea if this difference is significant.

Increasing the maximum length of urls and titles only makes the truncation 
errors occur less frequent. A real fix is to enforce the given maximum length 
with appropriate checks in nutch code.


 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1451) Upgrade automaton jar to 1.11-8

2012-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1451.
-

Resolution: Fixed

Committed @revision 1408282 in trunk
Committed @revision 1408289 in 2.2-SNAPSHOT

I didn't upload patches for these fixes as the generated patches contained 
loads of non-Utf8 characters which corrupted the file. 
The fixes remove our dependency upon shipping with the automaton.jar and 
licenses. The automaton deps are now pulled by ivy. 

 Upgrade automaton jar to 1.11-8
 ---

 Key: NUTCH-1451
 URL: https://issues.apache.org/jira/browse/NUTCH-1451
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2


 The latest version 1.11-8 was released September 7, 2011.
 This library is significantly faster than the default regex parsing. I 
 haven't got a clue what version we currently use but the license states 2005 
 so I'm guessing its been a long time since it was upgraded.
 I'll get a patch together and for completeness run independent test to 
 compare results pre and post upgrade. It would be nice to see  marginal 
 improvements :0)  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-12 Thread Nathan Gass (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495272#comment-13495272
 ] 

Nathan Gass commented on NUTCH-1497:


It seems to me that gora should hide as many database specifics as possible, 
and gora actually seems to do this just fine in this cases (using mysql 
specific column types when length is large). What are the disadvantages with 
hsqldb using the new length values?


 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1451) Upgrade automaton jar to 1.11-8

2012-11-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13495277#comment-13495277
 ] 

Hudson commented on NUTCH-1451:
---

Integrated in nutch-trunk-maven #491 (See 
[https://builds.apache.org/job/nutch-trunk-maven/491/])
NUTCH-1451 Upgrade automaton jar to 1.11-8 (Revision 1408282)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/urlfilter-automaton/ivy.xml
* /nutch/trunk/src/plugin/urlfilter-automaton/lib
* /nutch/trunk/src/plugin/urlfilter-automaton/plugin.xml


 Upgrade automaton jar to 1.11-8
 ---

 Key: NUTCH-1451
 URL: https://issues.apache.org/jira/browse/NUTCH-1451
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6, 2.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2


 The latest version 1.11-8 was released September 7, 2011.
 This library is significantly faster than the default regex parsing. I 
 haven't got a clue what version we currently use but the license states 2005 
 so I'm guessing its been a long time since it was upgraded.
 I'll get a patch together and for completeness run independent test to 
 compare results pre and post upgrade. It would be nice to see  marginal 
 improvements :0)  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1484) TableUtil unreverseURL fails on file:// URLs

2012-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1484.


Resolution: Fixed

Committed to 2.x (rev. 1408465)

 TableUtil unreverseURL fails on file:// URLs
 

 Key: NUTCH-1484
 URL: https://issues.apache.org/jira/browse/NUTCH-1484
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Sebastian Nagel
Priority: Critical
 Fix For: 2.2

 Attachments: NUTCH-1484.patch


 (reported by Rogério Pereira Araújo, see NUTCH-1483)
 When crawling the local filesystem TableUtil.unreverseURL fails for URLs with 
 empty host part (file:///Documents/). StringUtils.split(String, char) does 
 not preserve empty parts which causes:
 {code}
 java.lang.ArrayIndexOutOfBoundsException: 1
 at org.apache.nutch.util.TableUtil.unreverseUrl(TableUtil.java:98)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-1.x.patch

Ferdy is right: custom counters are more transparent.
Patch for 1.x


 Expose exact number of urls injected @runtime 
 --

 Key: NUTCH-1370
 URL: https://issues.apache.org/jira/browse/NUTCH-1370
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2

 Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch


 Example: When using trunk, currently we see 
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 I would like to see
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
 crawl/crawldb
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 This would make debugging easier and would help those who end up getting 
 {code}
 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
 for fetching, exiting ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira