[jira] Updated: (NUTCH-952) fix outlink which started with '?' in html parser

2011-01-07 Thread Stondet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stondet updated NUTCH-952:
--

Affects Version/s: (was: 1.3)
   2.0

 fix outlink which started with '?' in html parser
 -

 Key: NUTCH-952
 URL: https://issues.apache.org/jira/browse/NUTCH-952
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.0
Reporter: Stondet
 Attachments: NUTCH-952-v2.patch


 a href=?w=ruby%20on%20railsty=csd=0 ruby on rails/a(a snippet from 
 http://bbs.soso.com/search?ty=csd=0w=rails)
 outlink parsed from above link: 
 http://bbs.soso.com/?w=ruby%20on%20railsty=csd=0
 but expected is http://bbs.soso.com/search?w=ruby%20on%20railsty=csd=0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-954) Bugfix for Content-Length limit in http protocols

2011-01-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-954.
-

Resolution: Fixed

1.3 : Committed revision 1056359
trunk : Committed revision 1056362

Thanks Alexis!

 Bugfix for Content-Length limit in http protocols
 -

 Key: NUTCH-954
 URL: https://issues.apache.org/jira/browse/NUTCH-954
 Project: Nutch
  Issue Type: Sub-task
  Components: fetcher
Affects Versions: 1.3, 2.0
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.3, 2.0


 3. Content-Length limit (nutch3.patch)
 This is related to NUTCH-899.
 The patch avoids the entire flush operation on the Gora datastore to crash 
 because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
 and protocol-httpclient plugins were problematic.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-950) Content-Length limit, URL filter and few minor issues

2011-01-07 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12978832#action_12978832
 ] 

Julien Nioche commented on NUTCH-950:
-

Have committed the first 3 sub-issues.

Regarding the last one, I haven't tested the first point (version changes) but 
here are a few comments about the other issues : 
* Hbase + MySQL : these backends should not be provided by default, same for 
the MySQL connector. One option would be to add them to the ivy file but 
comment them out and give a bit of an explanation e.g. uncomment this if you 
want to use xxx as a GORA backend
* the dependency  com.jcraft/jsch should be placed in the ivy file of the 
corresponding plugin, not in the main one

Alexis, could you please create a new issue for this then mark this issue as 
resolved? Having a single JIRA number for completely separated issues is a bad 
idea and does not help keeping things in sync with the svn commits.

Thanks a lot for your contributions

Julien


 Content-Length limit, URL filter and few minor issues
 -

 Key: NUTCH-950
 URL: https://issues.apache.org/jira/browse/NUTCH-950
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
Reporter: Alexis
 Attachments: nutch1.patch, nutch2.patch, nutch3.patch, nutch4.patch


 1. crawl command (nutch1.patch)
 The class was renamed to Crawler but the references to it were not updated.
 2. URL filter (nutch2.patch)
 This avoids a NPE on bogus urls which host do not have a suffix.
 3. Content-Length limit (nutch3.patch)
 This is related to NUTCH-899.
 The patch avoids the entire flush operation on the Gora datastore to crash 
 because the MySQL blob limit was exceeded by a few bytes. Both protocol-http 
 and protocol-httpclient plugins were problematic.
 4. Ivy configuration (nutch4.patch)
 - Change xercesImpl and restlet versions. These 2 version changes are 
 required. The first one currently makes a JUnit test crash, the second one is 
 missing in default Maven repository.
 - Add gora-hbase, zookeeper which is an HBase dependency. Add MySQL 
 connector. These jars are necesary to run Gora with HBase or MySQL 
 datastores. (more a suggestion that a requirement here)
 - Add com.jcraft/jsch, which is a protocol-sftp plugin dependency. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-824) Crawling - File Error 404 when fetching file with an hexadecimal character in the file name.

2011-01-07 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-824.
-

Resolution: Fixed
  Assignee: Julien Nioche  (was: Markus Jelsma)

Have reactivated the tests for protocol-file in 1.3 and reorganised the test 
documents to follow other plugins i.e. test docs in sample dir
Protocol-file now decodes the input URLs with UTF-8

trunk : Committed revision 1056394
1.3 : Committed revision 1056401

Thanks Michela

 Crawling - File Error 404 when fetching file with an hexadecimal character in 
 the file name.
 

 Key: NUTCH-824
 URL: https://issues.apache.org/jira/browse/NUTCH-824
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.0.0, 1.2, 1.3, 2.0
 Environment: Linux nube 2.6.31-20-server #58-Ubuntu SMP x86_64 
 GNU/Linux
Reporter: Michela Becchi
Assignee: Julien Nioche
 Fix For: 1.3, 2.0, 1.0.0


 Hello,
 I am performing a local file system crawling.
 My problem is the following: all files that contain some hexadecimal 
 characters in the name do not get crawled.
 For example, I will see the following error:
 fetching 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
 org.apache.nutch.protocol.file.FileError: File Error: 404
 at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:92)
 at 
 org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535)
 fetch of 
 file:/nutch-1.0/wikidump/wiki-en/en/articles/a/2E/m/A.M._%28album%29_8a09.html
  failed with: org.apache.nutch.protocol.file.FileError: File Error: 404
 I am using nutch-1.0.
 Among other standard settings, I configured nutch-site.conf as follows:
 property
   nameplugin.includes/name
   
 valueprotocol-file|protocol-http|urlfilter-regex|parse-(text|html|js|pdf)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
   descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable
   protocol-httpclient, but be aware of possible intermittent problems with the
   underlying commons-httpclient library.
   /description
 /property
 property
   namefile.content.limit/name
   value-1/value
 /property
 Moreover, crawl-urlfilter.txt   looks like:
 # skip http:, ftp:,  mailto: urls
 -^(http|ftp|mailto):
 # skip image and other suffixes we can't yet parse
 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
 # skip URLs containing certain characters as probable queries, etc.
 -[...@=]
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 # accept hosts in MY.DOMAIN.NAME
 #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
 # accept everything else
 +.*
 ~
 ---
 Thanks,
 Michela

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Build failed in Hudson: Nutch-trunk #1361

2011-01-07 Thread Apache Hudson Server
See https://hudson.apache.org/hudson/job/Nutch-trunk/1361/changes

Changes:

[jnioche] NUTCH-824 FileProtocol does not resolve encoded URLs

[jnioche] NUTCH-954 Strict application of Content-Length limit for http 
protocols (Alexis Detreglode via jnioche)

--
[...truncated 1007 lines...]
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
A 
src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html
A src/plugin/index-more/plugin.xml
A src/plugin/index-more/build.xml
AUsrc/plugin/plugin.dtd
A src/plugin/parse-ext
A src/plugin/parse-ext/ivy.xml
A src/plugin/parse-ext/src
A src/plugin/parse-ext/src/test
A src/plugin/parse-ext/src/test/org
A src/plugin/parse-ext/src/test/org/apache
A src/plugin/parse-ext/src/test/org/apache/nutch
A src/plugin/parse-ext/src/test/org/apache/nutch/parse
A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java
A src/plugin/parse-ext/src/java
A src/plugin/parse-ext/src/java/org
A src/plugin/parse-ext/src/java/org/apache
A src/plugin/parse-ext/src/java/org/apache/nutch
A src/plugin/parse-ext/src/java/org/apache/nutch/parse
A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext
A 
src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java
A src/plugin/parse-ext/plugin.xml
A src/plugin/parse-ext/build.xml
A src/plugin/parse-ext/command
A src/plugin/urlnormalizer-pass
A src/plugin/urlnormalizer-pass/ivy.xml
A src/plugin/urlnormalizer-pass/src
A src/plugin/urlnormalizer-pass/src/test
A src/plugin/urlnormalizer-pass/src/test/org
A src/plugin/urlnormalizer-pass/src/test/org/apache
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java
A src/plugin/urlnormalizer-pass/src/java
A src/plugin/urlnormalizer-pass/src/java/org
A src/plugin/urlnormalizer-pass/src/java/org/apache
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch
A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer
A 
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass
AU
src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java
AUsrc/plugin/urlnormalizer-pass/plugin.xml
AUsrc/plugin/urlnormalizer-pass/build.xml
A src/plugin/parse-html
A src/plugin/parse-html/ivy.xml
A src/plugin/parse-html/lib
A src/plugin/parse-html/lib/tagsoup.LICENSE.txt
A src/plugin/parse-html/src
A