[jira] [Updated] (NUTCH-966) Behavior of NOINDEX,FOLLOW is not intuitive

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-966:
---

Fix Version/s: 2.2
   1.7

 Behavior of NOINDEX,FOLLOW is not intuitive
 ---

 Key: NUTCH-966
 URL: https://issues.apache.org/jira/browse/NUTCH-966
 Project: Nutch
  Issue Type: Improvement
  Components: indexer, parser
Affects Versions: 1.2
Reporter: Josh Pavel
Priority: Minor
 Fix For: 1.7, 2.2


 If a page has NOINDEX,FOLLOW for the ROBOTS metatag, Nutch will still create 
 a document that can be found in the index via metatag or URL matching.  
 Instead, Nutch should rely on doc or parse metadata but nothing should be 
 stored by the html parser. (thanks to Julien Nioche for helping me to 
 understand the issue). 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-911) recrawls file protocol causes Errors/Exceptions when actually not modified or gone

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-911:
---

Fix Version/s: 1.7

 recrawls file protocol causes Errors/Exceptions when actually not modified or 
 gone
 --

 Key: NUTCH-911
 URL: https://issues.apache.org/jira/browse/NUTCH-911
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.1
Reporter: Peter Lundberg
Priority: Minor
 Fix For: 1.7


 When recrawling file systems file are marked as error and logging occurs such 
 as:
 java.net.MalformedURLException
   at java.net.URL.init(URL.java:601)
   at java.net.URL.init(URL.java:464)
   at java.net.URL.init(URL.java:413)
   at org.apache.nutch.protocol.file.File.getProtocolOutput(File.java:85)
   at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:627)
 fetch of file:/Users/peter.lundberg/Documents/valtech/scan-test/Peter 
 Lundberg 20090929.pdf failed with: java.net.MalformedURLException
 This is due to FileResponse and File not working well together. The same is 
 true for files that after a while disappear from the file system being 
 crawled (ie error instead of GONE). I am too new with nutch to know the 
 design rational behind this or any sideaffect. Below is a patch that I have 
 used that cleans up the segment data and removevs false errors in the log 
 file.
 --- 
 src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
 (revision 997976)
 +++ 
 src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/File.java
 (working copy)
 @@ -79,6 +79,10 @@
  if (code == 200) {  // got a good response
return new ProtocolOutput(response.toContent());  // 
 return it

 +} else if (code == 404) {   // handle no such file
 +  return new ProtocolOutput(response.toContent(), 
 ProtocolStatus.STATUS_GONE );  
 +} else if (code == 304) {   // handle not modified
 +  return new ProtocolOutput(response.toContent(), 
 ProtocolStatus.STATUS_NOTMODIFIED );  
  } else if (code = 300  code  400) { // handle redirect
if (redirects == MAX_REDIRECTS)
  throw new FileException(Too many redirects:  + url);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-813) Repetitive crawl 403 status page

2013-01-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-813.
---

Resolution: Duplicate

The described problem is identical to that of NUTCH-578. The provided patch 
(call setPageGoneSchedule when retry counter hits db.fetch.retry.max) is 
included in all patches of NUTCH-578.

 Repetitive crawl 403 status page
 

 Key: NUTCH-813
 URL: https://issues.apache.org/jira/browse/NUTCH-813
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.1
Reporter: Nguyen Manh Tien
Priority: Minor
 Fix For: 1.7

 Attachments: ASF.LICENSE.NOT.GRANTED--Patch


 When we crawl a page the return a 403 status. It will be crawl repetitively 
 each days with default schedule.
 Even when we restrict by paramter db.fetch.retry.max

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-910) Cached.jsp has a bug with encoding

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-910.


Resolution: Won't Fix

this is a legacy issue so we won't be fixing it.

 Cached.jsp has a bug with encoding
 --

 Key: NUTCH-910
 URL: https://issues.apache.org/jira/browse/NUTCH-910
 Project: Nutch
  Issue Type: Bug
  Components: web gui
Affects Versions: 1.0.0
 Environment: Any enironment
Reporter: Attila Pados
Priority: Minor
   Original Estimate: 2m
  Remaining Estimate: 2m

 cached.jsp
 Pages that has a non default encoding, or not utf-8 etc, the cached content 
 is displayed screwed. This is quite annoying, but doesn't harm critically 
 functionality.
 add   :   Metadata parseData = bean.getParseData(details).getParseMeta();
 original :  Metadata metaData = bean.getParseData(details).getContentMeta();
 replace: String encoding = (String) 
 parseData.get(CharEncodingForConversion);
 In the cached jsp, the encoding variable is tried to retrieved from the wrong 
 metadata source, contentMeta, which doesn't include this value.
 It resides in the parseMetadata instead. 
 First line is not a replacement above, it has to be added.  Original metadata 
 is needed there for other things.
 Then below, the encoding value line has to be changed, that is a replacement.
 This fix is for 1.0 nutch version, i didn't found an issue in the list that 
 would cover this, just a mail found with google, on a mailing list that 
 refered to it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-923) Multilingual support for Solr-index-mapping

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-923:
---

   Patch Info: Patch Available
Fix Version/s: 1.7

 Multilingual support for Solr-index-mapping
 ---

 Key: NUTCH-923
 URL: https://issues.apache.org/jira/browse/NUTCH-923
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.2
Reporter: Matthias Agethle
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.7

 Attachments: patch-923-nutch-release-1.2.txt


 It would be useful to extend the mapping-possibilites when indexing to solr.
 One useful feature would be to use the detected language of the html page 
 (for example via the language-identifier plugin) and send the content to 
 corresponding language-aware solr-fields.
 The mapping file could be as follows:
 field dest=lang source=lang/
 field dest=title_${lang} source=title /
 so that the title-field gets mapped to title_en for English-pages and 
 tilte_fr for French pages.
 What do you think? Could this be useful also to others?
 Or are there already other solutions out there?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-829) duplicate hadoop temp files

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-829:
---

Fix Version/s: 2.2
   1.7

 duplicate hadoop temp files
 ---

 Key: NUTCH-829
 URL: https://issues.apache.org/jira/browse/NUTCH-829
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.0.0, 1.1
Reporter: Mike Baranczak
Priority: Minor
 Fix For: 1.7, 2.2


 When two crawls are started at exactly the same time, I see the following 
 error: 
 {quote}
 org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
 file:/tmp/hadoop-mike/mapred/temp/generate-temp-1276463469075 already exists
   at 
 org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:793)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
   at org.apache.nutch.crawl.Generator.generate(Generator.java:472)
   at org.apache.nutch.crawl.Generator.generate(Generator.java:409)
 [...]
 {quote}
 I traced it down to this code in Generator (I'm using Nutch 1.0, but this is 
 still in the trunk):
 {quote}
 Path tempDir =
   new Path(getConf().get(mapred.temp.dir, .) +
/generate-temp-+ System.currentTimeMillis());
 {quote}
 I admit that this is an unlikely scenario for most users, but it just so 
 happens that I ran into it. To absolutely guarantee that the temp directory 
 doesn't already exist, I suggest changing System.currentTimeMillis() to 
 java.util.UUID.randomUUID().toString().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-625) Non-ascii character broken in dumped content for mixed encoding (utf-8 and multi-byte)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-625.


Resolution: Won't Fix

as per Dogacan's comments

 Non-ascii character broken in dumped content for mixed encoding (utf-8 and 
 multi-byte)
 --

 Key: NUTCH-625
 URL: https://issues.apache.org/jira/browse/NUTCH-625
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Vinci
Priority: Minor

 If the crawl db contains both utf-8 non-ascii character and non-utf-8 
 non-ascii character(i.e. multi-byte character), the dumped contents by 
 readseg utility will have garbled character appear in all of the non-utf8 
 non-ascii text, and those texts are unable to repair by encoding reload.
 At the same time, the utf-8 text is normal, only the non-utf8 text broken.
 Any possible solution available for repairing the broken text?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-609) Allow Plugins to be Loaded from Jar File(s)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-609:
---

Fix Version/s: 2.2
   1.7

 Allow Plugins to be Loaded from Jar File(s)
 ---

 Key: NUTCH-609
 URL: https://issues.apache.org/jira/browse/NUTCH-609
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-609-1-20080212.patch


 Currently plugins cannot be loaded from a jar file.  Plugins must be unzipped 
 in one or more directories specified by the plugin.folders config.  I have 
 been thinking about an extension to PluginRepository or PluginManifestParser 
 (or both) that would allow plugins to packaged into multiple independent jar 
 files and placed on the classpath.  The system would search the classpath for 
 resources with the correct folder name and would load any plugins in those 
 jars.
 This functionality would be very useful in making the nutch core more 
 flexible in terms of packaging.  It would also help with web applications 
 where we don't want to have a plugins directory included in the webapp.
 Thoughts so far are unzipping those plugin jars into a common temp directory 
 before loading.  Another option is using something like commons vfs to 
 interact with the jar files.  VFS essential uses a disk based temporary cache 
 for jar files, so it is pretty much the same solution.   What are everyone 
 else's thoughts on this?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-670) feed plugin does not parse RSS2 enclosures

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-670:
---

Fix Version/s: 2.2
   1.7

 feed plugin does not parse RSS2 enclosures
 --

 Key: NUTCH-670
 URL: https://issues.apache.org/jira/browse/NUTCH-670
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor
 Fix For: 1.7, 2.2

   Original Estimate: 1h
  Remaining Estimate: 1h

 The feed parse in plugins/feed does not get count links found in RSS2 
 enclosure tags as Outlinks.
 It's a pretty simple patch - SyndEntry has a getEnclosures call. I'll submit 
 the patch tomorrow.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-664) Possibility to update already stored documents.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-664:
---

Fix Version/s: 2.2

 Possibility to update already stored documents.
 ---

 Key: NUTCH-664
 URL: https://issues.apache.org/jira/browse/NUTCH-664
 Project: Nutch
  Issue Type: Wish
Reporter: Sergey Khilkov
Priority: Minor
 Fix For: 2.2


 We have huge index of stored documents. It is high cost procedure to fetch 
 page, merge indexes any time we update some information about page. The 
 information can be changed 1-3 times per day. At this moment we have to store 
 changed info in database, but in this case we have lots of problems with 
 sorting, search restricions and so on. Lucene itself allows delete single 
 document and add new one into existing index. But there is a problem with 
 hadoop... As I understand hadoop filesystem has no possibility to write in 
 random positions. But it will be great feature if nutch will be able to 
 update created index.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-718) urlfilter-subnets plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-718:
---

Fix Version/s: 2.2
   1.7

 urlfilter-subnets plugin
 

 Key: NUTCH-718
 URL: https://issues.apache.org/jira/browse/NUTCH-718
 Project: Nutch
  Issue Type: New Feature
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: NUTCH-718-nutchbase.patch, 
 NUTCH-718_urlfilter_subnets.patch, NUTCH-718_urlfilter_subnets_v2.patch


 This plugin filter urls by netmasks in CIDR-notation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-750) HtmlParser plugin - page title extraction

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-750:
---

Fix Version/s: 1.7

 HtmlParser plugin - page title extraction
 -

 Key: NUTCH-750
 URL: https://issues.apache.org/jira/browse/NUTCH-750
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.0.0
Reporter: Alexey Torochkov
Priority: Minor
 Fix For: 1.7

 Attachments: SkipBody.patch


 A little improvement to trying to extract title tag in body if it doesn't 
 exist in head.
 In current version DOMContentUtils just skip all after body in getTitle() 
 method.
 Attached patch allows to change this behavior (for default it doesn't change 
 anything) and can cope with webmasters mistakes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-737) urlnormalizer-unalias plugin

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-737:
---

Fix Version/s: 1.7

 urlnormalizer-unalias plugin
 

 Key: NUTCH-737
 URL: https://issues.apache.org/jira/browse/NUTCH-737
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-737_urlfilter_unalias.patch


 I tried to search any whole site duplication detection tools without success. 
 This plugin allows to do domain name transformation (for example 
 www.google.com - google.com). It is very stupid, but can be useful when 
 fighting with site aliases. For detect site aliases I use my own ugly class 
 (based on SolrDeleteDuplicates).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552041#comment-13552041
 ] 

Ben McCann commented on NUTCH-1345:
---

You've probably set $NUTCH_JAVA_HOME then. I don't see why that should be 
required however if Java in on your path. It's fine to allow for an override, 
but it's just one extra thing to do to get setup for most people.

 JAVA_HOME should not be required
 

 Key: NUTCH-1345
 URL: https://issues.apache.org/jira/browse/NUTCH-1345
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Ben McCann
Priority: Minor
 Attachments: nutch, nutch.patch


 Trying to run Nutch spits out the message Error: JAVA_HOME is not set.  I 
 already have java on my path, so I really wish I didn't need to set 
 JAVA_HOME.  It's an extra step to get up and running and is not updated by 
 Ubuntu's update-alternatives, so it makes it a lot harder to switch between 
 versions of Java.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-690) bug in DomContentUtils.shouldThrowAwayLink?

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-690:
---

Fix Version/s: 1.7

 bug in DomContentUtils.shouldThrowAwayLink?
 ---

 Key: NUTCH-690
 URL: https://issues.apache.org/jira/browse/NUTCH-690
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Peter Sparks
Priority: Minor
 Fix For: 1.7


 I found a potential bug in DomContentUtils.shouldThrowAwayLink. It returns 
 true for the 5 links at the top of the home page for www.aksteel.com. Here 
 are the links in the source:
 a id=Search href=/search/default.aspx/a
 a id=Investor style=height: 15px; 
 href=/investor_information//a
 a id=Markets href=/markets_products//a
 a id=Production href=/production_facilities//a
 a id=News href=/news//a
 Perhaps I am just ignorant of what this function is supposed to do but 
 returning true for these 5 links on that site make that site impossible to 
 crawl.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-589) Hierarchical Classloaders

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-589:
---

Fix Version/s: 1.7

 Hierarchical Classloaders
 -

 Key: NUTCH-589
 URL: https://issues.apache.org/jira/browse/NUTCH-589
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Ryan Levering
Priority: Minor
 Fix For: 1.7


 Currently the Nutch plugin classloader flattens all the jars from a plugins' 
 dependencies and instantiates a new classloader for each plugin.  I think it 
 would be better to create a hierarchical classloader chain.  Currently 
 plugins can't pass objects from a common plugin to one another because the 
 objects are created using different classloaders.  Nutch currently avoids 
 this by only using interfaces from a common classloader to pass objects 
 between plugins, but I can't see the harm in improving the plugin 
 classloader.  It would require a change to PluginDescription and 
 PluginClassLoader in order to override ClassLoader to maintain the export 
 filter functionality that currently exists.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-569) Protocol plugins should report progress to the fetcher

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-569:
---

Fix Version/s: 1.7

 Protocol plugins should report progress to the fetcher
 --

 Key: NUTCH-569
 URL: https://issues.apache.org/jira/browse/NUTCH-569
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.0.0
Reporter: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.7


 When downloading very large files over slow connections, protocol plugins 
 spend long time in Protocol.getProtocolOutput(...). This sometimes leads to a 
 timeout in Fetcher / Fetcher2, with the message aborting with hung threads. 
 Protocol plugins should periodically notify their caller about progress. In a 
 situation when the call to getProtocolOutput takes very long time to return, 
 this will help the caller to determine whether the wait is justified.
 Preferably, the callback interface should allow the monitoring of not only 
 the binary progress / no-progress, but also the download speed, so that the 
 caller could terminate slow connections. E.g.
 {noformat}
 interface ProtocolReporter {
   void progress(long bytesDownloaded);
 }
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-566) Sun's URL class has bug in creation of relative query URLs

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-566:
---

Fix Version/s: 2.2
   1.7

 Sun's URL class has bug in creation of relative query URLs
 --

 Key: NUTCH-566
 URL: https://issues.apache.org/jira/browse/NUTCH-566
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: MacOS X and Linux (CentOS 4.5) both
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: RelativeURL.java


 I'm using 0.81, but this will affect all other versions as well.
 Relative links of the form ?blah are resolved incorrectly. For example, 
 with a base URL of http://www.fleurie.org/entreprise.asp, and a relative link 
 of ?id_entrep=111, Nutch will resolve this pair to the link
 http://www.fleurie.org/?id_entrep=111;. No such URL exists, and all browsers 
 I tried will resolve the pair to 
 http://www.fleurie.org/entreprise.asp?id_entrep=111;.
 I tracked this down to what could be called a bug in Sun's URL class. 
 According to Sun's spec, they parse the relative URL according to RFC 2396. 
 But the original RFC for relative links was RFC 1808, and the two RFCs differ 
 in how they handle relative links beginning with ?. Most browsers 
 (Netscape/Mozilla, IE, Safari) implemented RFC 1808, and stuck with it (for 
 compatibility and also because the behavior makes more sense). Apparently 
 even the people that wrote RFC 2396 recognized that this was a mistake, and 
 the specified behavior was changed in RFC 3986 to match what browsers do. 
 For a discussion of this, see  
 http://gbiv.com/protocols/uri/rev-2002/issues.html#003-relative-query
 Sun's URL implementation, however, still implements RFC2396, as far as I can 
 tell, and is out of step with the rest of the world.
 This breaks link extraction on a number of sites.
 I implemented a simple workaround, which I'm attaching. It is a static method 
 to create URLs which behaves exactly as new URL(URL base, String 
 relativePath), and I use it as a drop-in replacement for that in 
 DOMContentUtils, Javascript link extraction, etc. Obviously, it really only 
 matters wherever links are extracted. I haven't included the calling code 
 from DOMContentUtils, etc. because my local versions are largely rewritten, 
 but it should be pretty obvious.
 I put it in the org.apache.nutch.net directory, but obviously feel free to 
 move it to another place if you feel it belongs there!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-431) Move plugin specific properties out of nutch-site.xml and into specific conf files for plugins

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-431:
---

Fix Version/s: 2.2
   1.7

 Move plugin specific properties out of nutch-site.xml and into specific conf 
 files for plugins
 --

 Key: NUTCH-431
 URL: https://issues.apache.org/jira/browse/NUTCH-431
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: MacBook Pro, Intel Core Duo 2.0 Ghz, 1.5 GB RAM, Mac OSX 
 10.4 although improvement is independent of environment
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.7, 2.2


 Currently, there are many plugin-specific properties that live in the global 
 nutch properties files, nutch-site.xml and nutch-default.xml. These would be 
 things like the protocol-ftp properties, even the protocol-http properties. 
 It would be nice to refactor these properties out, into plugin specific 
 configuration files, that ship with the plugins themselves. Thoughts? 
 Comments? Tomatoes? :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-427) protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This protocol allows Nutch to crawl Microsoft Windows Shares remotely using the CIFS/SMB protocol implmen

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-427:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 protocol-smb: plugin protocol implementing the CIFS/SMB protocol. This 
 protocol allows Nutch to crawl Microsoft Windows Shares remotely using the 
 CIFS/SMB protocol implmentation.
 --

 Key: NUTCH-427
 URL: https://issues.apache.org/jira/browse/NUTCH-427
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 1.0.0
 Environment: JAVA - OS independent
Reporter: Armel Nene
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: protocol-smb-diff.txt, protocol-smb-dist.zip, 
 protocol-smb.zip, protocol-smb.zip


 Title:protocol-smb - Nutch protocol plugin for crawling Microsoft Windows 
 shares
 Author:   Armel T. Nene 
 Update:   Vadim Bauer
 Email:armel.nene NOSPAM-AT-NOSPAM idna-solutions.com, V a d i m B a u e r 
 AT g m x . d e
 A.  Introduction
 The protocol-smb plugins allows you to crawl Microsoft Windows shares. It 
 implements
 the CIFS/SMB protocol which is commonly used on Microsoft OS. The plugin 
 replicate the
 behaviour of the protocol-file over CIFS/SMB protocol. This plugin uses 
 the JCifs library and also
 support all the properties from the JCifs library.
 You can find more information on the following site: 
 http://jcifs.samba.org/
 The smb protocol syntax for crawling is as follow: smb://x (i.e. 
 smb://server/share).
 
 B.  Installation
 1) Binaries only:   The protocol-smb files can be found in the ../plugins 
 directory.
   Copy the protocol-smb to 
 NUTCHHOME/build/plugins directory.
 Put the smb.properties file in the NUTCHHOME/conf 
 directory.
 Configure the properties in smb.properties file
 Enable the plugin by updating nutch-site.xml file 
 found in NUTCHHOME/conf directory
   e.g. property
   nameplugin.includes/name
   valueprotocol-smb| other 
 plugins.../value
   description
   /description
/property
 2)  Source code:The protocol-smb sources can be found in the ../src 
 directory.
   Always refer to the Nutch wiki for detailed 
 instructions on building Nutch.  In short:
 Copy the 'protocol-smb' folder to NUTCHHOME/src/plugin
 Update the build.xml in NUTCHHOME/src/plugin to 
 include plugin
 Update the NUTCHHOME/default.properties file to 
 include plugin
 run ant to build
 Copy the 'smb.properties' file to NUTCHHOME/conf, and 
 configure the properties
 Enable the plugin by updating the nutch-site.xml file
 C: Known Issues
 1) URLMalformedException: unkown protocol: smb
The SMB URL protocol handler is not being successfully installed. 
In short, the jCIFS jar must be loaded by the System class loader.
Workaround: a) a short term solutions will be to installed the JCIFS 
 jar 
   library found in protocol-smb folder in 
   JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
b) After completing step a), if the exeception is still 
 thrown
   set the System properties by passing the following 
 arguments
   to the JVM: 
   -Djava.protocol.handler.pkgs=jcifs
c) You can set the property also in your Code for 
 example if 
   you start Crawling with org.apache.nutch.crawl.Crawl
   Add the following two lines. This will be the Same 
 like in b)
   public static void main(String args[]) throws 
 Exception {
   
 System.setProperty(java.protocol.handler.pkgs, jcifs);
   new 
 java.util.PropertyPermission(java.protocol.handler.pkgs,read, write)
   //and so on
Also you can visit the FAQ page: 
 http://jcifs.samba.org/src/docs/faq.html
 2) FATAL smb.SMB - Could not read content of protocol: smb://xx
This problem usually occurs if the following properties are not set 
 correctly in
the smb.properties 

[jira] [Updated] (NUTCH-410) Faster RegexNormalize with more features

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-410:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Faster RegexNormalize with more features
 

 Key: NUTCH-410
 URL: https://issues.apache.org/jira/browse/NUTCH-410
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
 Environment: Tested on MacOS X 10.4.7/10.4.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: betterRegexNorm.patch


 The patch associated with this is backwards-compatible and has several 
 improvements over the stock 0.8 RegexURLNormalizer:
 1) About a 34% performance improvement, from only executing the superclass 
 (BasicURLNormalizer) once in most cases, instead of twice as the stock 
 version did. 
 2) Support for expensive host-specific normalizations with good performance. 
 Each regex block optionally takes a list of hosts to which to apply the 
 associated regex. If supplied, the regex will only be applied to these hosts. 
 This should have scalable performance; the comparison is O(1) regardless of 
 the number of hosts. The format is:
 regex
 hostwww.host1.com/host
 hosthost2.site2.com/host
 pattern my pattern here /pattern
 substitution my substitution here /substitution
/regex
 3)  Support for decoding URLs with escaped character encodings (e.g. %20, 
 etc.). This is useful, for example, to decode jump redirects which have the 
 target URL encoded within the source, as on Yahoo. I tried to create an 
 extensible notion of options, the first of which is unescape. The 
 unescape function is applied *after* the substitution and *only* if the 
 substitution pattern matches. A simple pattern to unescape Yahoo directory 
 redirects would be something like:
 regex
   pattern^http://[a-z\.]*\.yahoo\.com/.*/\*+(http[^amp;]+)/pattern
   substitution$1/substitution
   optionsunescape/options
 /regex
 4) Added the notion of iterating the pattern chain. This is useful when the 
 result of a normalization can itself be normalized. While some of this can be 
 handled in the stock version by repeating patterns, or by careful ordering of 
 patterns, the notion of iterating is cleaner and more powerful. The chain is 
 defined to iterate only when the previous iteration changes the input, up to 
 a configurable maxium number of iterations. The config parameter to change 
 is: urlnormalizer.regex.maxiterations, which defaults to 1 (previous 
 behavior). The change is performance-neutral when disabled, and has a 
 relatively small performance cost when enabled.
 Pardon any potentially unconventional Java on my part. I've got lots of C/C++ 
 search engine experience, but Nutch is my first large Java app. I welcome any 
 feedback, and hope this is useful.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-409) Add short circuit notion to filters to speedup mixed site/subsite crawling

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-409:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Add short circuit notion to filters to speedup mixed site/subsite crawling
 

 Key: NUTCH-409
 URL: https://issues.apache.org/jira/browse/NUTCH-409
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
Reporter: Doug Cook
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: shortcircuit.patch


 In the case where one is crawling a mixture of sites and sub-sites, the 
 prefix matcher can match the sites quite quickly, but either the regex or 
 automaton filters are considerably slower matching the sub-sites. In the 
 current model of AND-ing all the filters together, the pattern-matching 
 filter will be run on every site that matches the prefix matcher -- even if 
 that entire site is to be crawled and there are no sub-site patterns. If only 
 a small portion of the sites actually need sub-site pattern matching, this is 
 much slower than it should be.
 I propose (and attach) a simple modification allowing considerable speedup 
 for this usage pattern. I define the notion of a short circuit match that 
 means accept this URL and don't run any of the remaining filters in the 
 filter chain. 
 Though with this change, any filter plugin can in theory return a 
 short-circuit match, I have only implemented the short-circuit match for the 
 PrefixURLFilter. The configuration file format is backwards-compatible; 
 shortcircuit matches just have SHORTCIRCUIT: in front of them.
 One minor gotcha:
 * Because the shortcircuit matches will avoid running any later filters, all 
 of the site-independent filters need to be BEFORE the PrefixURLFilter in the 
 chain.
 I get my best performance using the following filter chain:
 1) The SuffixURLFilter  to throw away anything with unwanted extensions
 2) The RegexURLFilter to do site-independent cleanup (ad removal, skipping 
 mailto:, bulletin-board pages, etc.)
 3) The PrefixURLFilter, with SHORTCIRCUIT: in front of every site name EXCEPT 
 the sites needing subsite matching
 4) The AutomatonURLFilter to match those sites needing subsite pattern 
 matching.
 I have tens of thousands of sites and an order of magnitude fewer subsites, 
 so skipping step #4 90% of the time speeds things up considerably (my reduce 
 time on a round of crawling is down from some 26 hours to less than 10).
 There are only two drawbacks to the implementation, and I think they're 
 pretty minor:
 1) Because I pass a special token (_PASS_) in the place of the URL to 
 implement the short circuit, if for some reason someone wanted to crawl a URL 
 named _PASS_, there would be problems. I find this highly unlikely, since 
 that's an invalid URL.
 2) The correct behavior of steps #3 and #4 above depends upon coordination of 
 the config files between the prefix and automaton filters, making an 
 opportunity for user screwup. I thought about creating a new kind of filter 
 which essentially combined prefix  automaton's behaviors, took one config 
 file, and internally handled the short-circuiting. But I think the approach I 
 took is simpler, cleaner, more flexible, and avoids creating yet another kind 
 of filter. Coordinating the config files is pretty easy (I generate them 
 programmatically).
 As this is my first contribution to Nutch I'm sure that there are things I've 
 missed, whether in coding style or desired patch format. I welcome any 
 feedback, suggestions, etc.
 Doug

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-449) Format of junit output should be configurable

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-449:
---

Fix Version/s: 2.2
   1.7

 Format of junit output should be configurable
 -

 Key: NUTCH-449
 URL: https://issues.apache.org/jira/browse/NUTCH-449
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
Reporter: Nigel Daley
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: hudson.patch


 Allow the junit output format to be set by a system property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-449) Format of junit output should be configurable

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-449:
---

Patch Info: Patch Available

 Format of junit output should be configurable
 -

 Key: NUTCH-449
 URL: https://issues.apache.org/jira/browse/NUTCH-449
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1
Reporter: Nigel Daley
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: hudson.patch


 Allow the junit output format to be set by a system property.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-386) Plugin to index categories by url rules

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-386:
---

   Patch Info: Patch Available
Fix Version/s: 1.7

 Plugin to index categories by url rules
 ---

 Key: NUTCH-386
 URL: https://issues.apache.org/jira/browse/NUTCH-386
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Ernesto De Santis
Priority: Minor
 Fix For: 1.7

 Attachments: index-url-category-0.1.zip, index-url-category.jar


 The compressed zip has a install_notes.txt file with instructions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-351) Protocol forward proxy

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-351:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Protocol forward proxy
 --

 Key: NUTCH-351
 URL: https://issues.apache.org/jira/browse/NUTCH-351
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
Reporter: Sami Siren
Assignee: Sami Siren
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: protocol-http-proxy-adapter.txt


 Protocol proxy adapter takes advantage of protocols known to http forward 
 proxy. Usually there's atleast http, https and ftp.
 You must configure nutch to use this plugin and to use http proxy before use.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-346) Improve readability of logs/hadoop.log

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-346:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Improve readability of logs/hadoop.log
 --

 Key: NUTCH-346
 URL: https://issues.apache.org/jira/browse/NUTCH-346
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
 Environment: ubuntu dapper
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: log4j_plugins.diff


 adding
 log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
 to conf/log4j.properties
 dramatically improves the readability of the logs in logs/hadoop.log (removes 
 all INFO)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-477) Extend URLFilters to support different filtering chains

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-477:
---

Fix Version/s: 1.7

 Extend URLFilters to support different filtering chains
 ---

 Key: NUTCH-477
 URL: https://issues.apache.org/jira/browse/NUTCH-477
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.7

 Attachments: urlfilters.patch


 I propose to make the following changes to URLFilters:
 * extend URLFilters so that they support different filtering rules depending 
 on the context where they are executed. This functionality mirrors the one 
 that URLNormalizers already support.
 * change their return value to an int code, in order to support early 
 termination of long filtering chains.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-490:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 Extension point with filters for Neko HTML parser (with patch)
 --

 Key: NUTCH-490
 URL: https://issues.apache.org/jira/browse/NUTCH-490
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Any
Reporter: Marcin Okraszewski
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, 
 nutch-extensionpoins_plugin.xml.diff


 In my project I need to set filters for Neko HTML parser. So instead of 
 adding it hard coded, I made an extension point to define filters for Neko. I 
 was fallowing the code for HtmlParser filters. In fact the method to get 
 filters I think could be generalized to handle both cases. But I didn't want 
 to make too big mess.
 The attached patch is for Nutch 0.9. This part of code wasn't changed in 
 trunk, so should be applicable easily.
 BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by 
 extension point itself. Now there are options for Neko and TagSoap. But if 
 someone would like to use something else or set give different settings for 
 the parser, he would need to modify HtmlParser class, instead of replacing a 
 plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-248) add support for internationalized domain names

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-248.


Resolution: Won't Fix

this is legacy

 add support for internationalized domain names
 --

 Key: NUTCH-248
 URL: https://issues.apache.org/jira/browse/NUTCH-248
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Reporter: Sami Siren
Priority: Minor

 Internationalized domain names are gaining ground and so nutch should give a 
 little bit more support to this feature, atleast we need punycode 
 encoding/decoding functionality so we can display/enter internationalized 
 domain names in ui.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-213) checkstyle

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-213:
---

   Patch Info: Patch Available
Fix Version/s: 2.2
   1.7

 checkstyle
 --

 Key: NUTCH-213
 URL: https://issues.apache.org/jira/browse/NUTCH-213
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Stefan Groschupf
Assignee: Dennis Kubes
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: checkstyle-all-4.1.jar, checkstyle.patch


 Adding checkstyle target to ant build file to support contributers verifying 
 whitespace problems.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-215) Plugin execution order

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-215.


Resolution: Won't Fix

we can now explicitly specify the order of indexing, parsing etc. plugins.
This can be closed as a legacy issue. 

 Plugin execution order
 --

 Key: NUTCH-215
 URL: https://issues.apache.org/jira/browse/NUTCH-215
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8
Reporter: Enrico Triolo
Priority: Minor
 Attachments: plugin_order.patch


 This patch allows nutch to automatically guess the correct order of execution 
 of plugins, depending on their dependencies.
 This means that, for example, if plugin A depends on plugin B (as stated in 
 the plugins.xml file), then B will be executed before A.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-49?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-49.
---

Resolution: Won't Fix

This is well and truly a legacy issue. The FetchListTool no longer even exists.

 Flag for generate to fetch only new pages to complement the -refetchonly flag
 -

 Key: NUTCH-49
 URL: https://issues.apache.org/jira/browse/NUTCH-49
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Luke Baker
Priority: Minor
 Attachments: fetchnewonly.patch


 It would be useful, especially for research/testing purposes, to have a flag 
 for the FetchListTool that make sure to only include URLs in the fetchlist 
 that have not already been fetched (according to the information from the 
 webdb that you're generating the fetchlist from).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-737) urlnormalizer-unalias plugin

2013-01-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-737.
-

Resolution: Duplicate

 urlnormalizer-unalias plugin
 

 Key: NUTCH-737
 URL: https://issues.apache.org/jira/browse/NUTCH-737
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 1.0.0
Reporter: Dmitry Lihachev
Priority: Minor
 Fix For: 1.7

 Attachments: NUTCH-737_urlfilter_unalias.patch


 I tried to search any whole site duplication detection tools without success. 
 This plugin allows to do domain name transformation (for example 
 www.google.com - google.com). It is very stupid, but can be useful when 
 fighting with site aliases. For detect site aliases I use my own ugly class 
 (based on SolrDeleteDuplicates).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-693) Add configurable option for treating nofollow behaviour.

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-693:
---

Fix Version/s: 2.2
   1.7

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1513:


Fix Version/s: 2.2
   1.7

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7, 2.2


 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1500) bin/crawl fails on step solrindex with wrong path to segment

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1500:


Fix Version/s: 1.7

 bin/crawl fails on step solrindex with wrong path to segment
 

 Key: NUTCH-1500
 URL: https://issues.apache.org/jira/browse/NUTCH-1500
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.6
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.7

 Attachments: NUTCH-1500.patch


 The bin/crawl script calls the command (bin/nutch) solrindex with the wrong 
 path to the segment which causes solrindex to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-1489) elasticindex should report the indexed documents like solrindex does

2013-01-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1489.
---

Resolution: Not A Problem

This functionality is addressed both when deployed in local mode and via the 
Hadoop Map output record counters.

 elasticindex should report the indexed documents like solrindex does
 

 Key: NUTCH-1489
 URL: https://issues.apache.org/jira/browse/NUTCH-1489
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.1
Reporter: Rogério Pereira Araújo
Priority: Trivial

 When I run:
 nutch elasticindex elasticsearch
 To index crawled documents in a standard elasticsearch setup, the process 
 takes some time, finishes, but doesn't report how many documents was indexed, 
 it would be nice to have the same feedback as solrindex.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-49) Flag for generate to fetch only new pages to complement the -refetchonly flag

2013-01-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-49?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552049#comment-13552049
 ] 

Markus Jelsma commented on NUTCH-49:


This has been implemented in NUTCH-1248.

 Flag for generate to fetch only new pages to complement the -refetchonly flag
 -

 Key: NUTCH-49
 URL: https://issues.apache.org/jira/browse/NUTCH-49
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Reporter: Luke Baker
Priority: Minor
 Attachments: fetchnewonly.patch


 It would be useful, especially for research/testing purposes, to have a flag 
 for the FetchListTool that make sure to only include URLs in the fetchlist 
 that have not already been fetched (according to the information from the 
 webdb that you're generating the fetchlist from).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.

2013-01-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552050#comment-13552050
 ] 

Markus Jelsma commented on NUTCH-693:
-

Vote for `won't fix`. We also don't implement an ignore.robotstxt option for 
the above reasons.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-693) Add configurable option for treating nofollow behaviour.

2013-01-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552053#comment-13552053
 ] 

Lewis John McGibbney commented on NUTCH-693:


+1 Markus. Please close off when you can.

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Fix For: 1.7, 2.2

 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Closed] (NUTCH-693) Add configurable option for treating nofollow behaviour.

2013-01-12 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma closed NUTCH-693.
---

   Resolution: Won't Fix
Fix Version/s: (was: 2.2)
   (was: 1.7)

 Add configurable option for treating nofollow behaviour.
 

 Key: NUTCH-693
 URL: https://issues.apache.org/jira/browse/NUTCH-693
 Project: Nutch
  Issue Type: New Feature
Reporter: Andrew McCall
Priority: Minor
 Attachments: nutch.nofollow.patch


 For my purposes I'd like to follow links even if they're marked nofollow- 
 Ideally I'd like to follow them, but not pass the link juice between them. 
 I've attached a patch that adds a configuration element 
 parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
 nofollow elements on a page. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552082#comment-13552082
 ] 

Sebastian Nagel commented on NUTCH-1345:


JAVA_HOME (or NUTCH_JAVA_HOME) is currently used for two things:
# use $JAVA_HOME/bin/java as Java executable
# determining the location of lib/tools.jar which is part of JDK (not JRE). 
It's probably an unneeded artifact, cf. MAPREDUCE-3624 and HADOOP-7374.

If JAVA_HOME is not set bin/nutch definitely refuses to work. I agree that 
setting an environment variable may be a little hurdle, however there are 
arguments in favour of using JAVA_HOME:
- I had to install Nutch on many customers' machines where the default java 
executable on PATH was not the correct one (= 1.6): setting JAVA_HOME is more 
transparent than manipulating PATH. NUTCH_JAVA_HOME is even more explicit.
- back-ward compatibility: Nutch should be run by the same JVM as before, not 
accidentally by another one.
- staying parallel to Hadoop which still uses JAVA_HOME

Btw., let JAVA_HOME point to /usr/lib/jvm/default-java for Ubuntu's 
update-alternatives.

 JAVA_HOME should not be required
 

 Key: NUTCH-1345
 URL: https://issues.apache.org/jira/browse/NUTCH-1345
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Ben McCann
Priority: Minor
 Attachments: nutch, nutch.patch


 Trying to run Nutch spits out the message Error: JAVA_HOME is not set.  I 
 already have java on my path, so I really wish I didn't need to set 
 JAVA_HOME.  It's an extra step to get up and running and is not updated by 
 Ubuntu's update-alternatives, so it makes it a lot harder to switch between 
 versions of Java.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1345) JAVA_HOME should not be required

2013-01-12 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13552083#comment-13552083
 ] 

Ben McCann commented on NUTCH-1345:
---

I think it's fine to allow overriding the version of Java used with JAVA_HOME 
or NUTCH_JAVA_HOME, but it shouldn't be required. Convention over 
configuration. There's far too much configuration required for Nutch.

 JAVA_HOME should not be required
 

 Key: NUTCH-1345
 URL: https://issues.apache.org/jira/browse/NUTCH-1345
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Ben McCann
Priority: Minor
 Attachments: nutch, nutch.patch


 Trying to run Nutch spits out the message Error: JAVA_HOME is not set.  I 
 already have java on my path, so I really wish I didn't need to set 
 JAVA_HOME.  It's an extra step to get up and running and is not updated by 
 Ubuntu's update-alternatives, so it makes it a lot harder to switch between 
 versions of Java.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Nutch-trunk #2082

2013-01-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2082/

--
[...truncated 3965 lines...]
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:89:
 warning: [deprecation] delete(Path) in FileSystem has been deprecated
[javac] fs.delete(testDir);
[javac]   ^
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:108:
 warning: [rawtypes] found raw type: Iterator
[javac] Iterator it = expected.keySet().iterator();
[javac] ^
[javac]   missing type arguments for generic class IteratorE
[javac]   where E is a type-variable:
[javac] E extends Object declared in interface Iterator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:123:
 warning: [deprecation] delete(Path) in FileSystem has been deprecated
[javac] fs.delete(testDir);
[javac]   ^
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:126:
 warning: [rawtypes] found raw type: TreeSet
[javac]   private void createCrawlDb(Configuration config, FileSystem fs, 
Path crawldb, TreeSet init, CrawlDatum cd) throws Exception {
[javac] 
^
[javac]   missing type arguments for generic class TreeSetE
[javac]   where E is a type-variable:
[javac] E extends Object declared in class TreeSet
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestCrawlDbMerger.java:130:
 warning: [rawtypes] found raw type: Iterator
[javac] Iterator it = init.iterator();
[javac] ^
[javac]   missing type arguments for generic class IteratorE
[javac]   where E is a type-variable:
[javac] E extends Object declared in interface Iterator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:71:
 warning: [rawtypes] found raw type: TreeMap
[javac]   TreeMap init1 = new TreeMap();
[javac]   ^
[javac]   missing type arguments for generic class TreeMapK,V
[javac]   where K,V are type-variables:
[javac] K extends Object declared in class TreeMap
[javac] V extends Object declared in class TreeMap
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:71:
 warning: [rawtypes] found raw type: TreeMap
[javac]   TreeMap init1 = new TreeMap();
[javac]   ^
[javac]   missing type arguments for generic class TreeMapK,V
[javac]   where K,V are type-variables:
[javac] K extends Object declared in class TreeMap
[javac] V extends Object declared in class TreeMap
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:72:
 warning: [rawtypes] found raw type: TreeMap
[javac]   TreeMap init2 = new TreeMap();
[javac]   ^
[javac]   missing type arguments for generic class TreeMapK,V
[javac]   where K,V are type-variables:
[javac] K extends Object declared in class TreeMap
[javac] V extends Object declared in class TreeMap
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:72:
 warning: [rawtypes] found raw type: TreeMap
[javac]   TreeMap init2 = new TreeMap();
[javac]   ^
[javac]   missing type arguments for generic class TreeMapK,V
[javac]   where K,V are type-variables:
[javac] K extends Object declared in class TreeMap
[javac] V extends Object declared in class TreeMap
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:73:
 warning: [rawtypes] found raw type: HashMap
[javac]   HashMap expected = new HashMap();
[javac]   ^
[javac]   missing type arguments for generic class HashMapK,V
[javac]   where K,V are type-variables:
[javac] K extends Object declared in class HashMap
[javac] V extends Object declared in class HashMap
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/test/org/apache/nutch/crawl/TestLinkDbMerger.java:73:
 warning: [rawtypes] found raw type: HashMap
[javac]   HashMap expected = new HashMap();
[javac]  ^
[javac]   missing type arguments for generic class 

<    1   2