[jira] Commented: (NUTCH-826) Mailing list is broken.

2010-05-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870980#action_12870980
 ] 

Hudson commented on NUTCH-826:
--

Integrated in Nutch-trunk #1163 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1163/])
NUTCH-826 : update mailing list and version control pages on wesite after 
move to TLP


> Mailing list is broken.
> ---
>
> Key: NUTCH-826
> URL: https://issues.apache.org/jira/browse/NUTCH-826
> Project: Nutch
>  Issue Type: Bug
>Reporter: John Sherwood
>Assignee: Julien Nioche
>Priority: Blocker
> Fix For: 1.1
>
>
> All of the following addresses are failing:
> nutch-u...@nutch.apache.org
> nutch-user-subscr...@nutch.apache.org
> nutch-user-subscr...@lucene.apache.org
> For the last one, the mailer daemon said 
> "This mailing list has moved to user at nutch.apache.org."
> Below is the message I tried to send:
> Hi people,
> I've been banging my head against this problem for two days now.
> Simply, I want to add a field with the value of a given meta tag.
> I've been trying the parse-xml plugin, but that seems that it doesn't
> work with version 1.0.  I've tried the code at
> http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html
> and it hasn't worked.  I don't even know why.  I don't even know if my
> plugin is being used... or even looked for!  Nutch seems to have a
> infuriating "Fail silently" policy for plugins.  I put a
> System.exit(1) in my filters just to see if my code is even being
> encountered.  It has not in spite of my config telling it to.
> Here's my config:
> nutch-site.xml
> ...
> 
>  plugin.includes
>  
> protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata
> 
> ...
> parse-plugins.xml
> ...
> 
>
>
> 
> 
>   
>   
> 
> 
>   
>   
> 
> 
>  
>  
> 
> 
> 
> ...
>  extension-id="com.example.website.nutch.parsing.MetaTagExtractorParseFilter"
> />
> ...
> I've also copied the plugin.xml and jar from my build/metadata to the
> plugins root dir.
> Nonetheless, Nutch runs and puts data in solr for me.  Afaik, Nutch is
> completely unaware of my plugin despite my config options.  Is the
> some other place I need to tell Nutch to use my plugin?  Is there some
> other approach to do this without having to write a plugin?  This does
> seem like a lot of work to simply get a meta tag into a field.  Any
> help would be appreciated.
> Sincerely,
> John Sherwood

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-278) Fetcher-status might need clarification: kbit/s instead of kb/s shown

2010-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882835#action_12882835
 ] 

Hudson commented on NUTCH-278:
--

Integrated in Nutch-trunk #1189 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/])
- fix for NUTCH-278 Fetcher-status might need clarification: kbit/s instead 
of kb/s shown


> Fetcher-status might need clarification: kbit/s instead of kb/s shown
> -
>
> Key: NUTCH-278
> URL: https://issues.apache.org/jira/browse/NUTCH-278
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Stefan Neufeind
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.2
>
> Attachments: PATCH.NUTCH-278
>
>
> In Fetcher.java, method reportStatus() there is
> + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, ";
> Is that a bit misleading, since the user reading the status might guess it's 
> "kilobytes" (kb) whereas "kbit/s" would be more clear in this case?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-832) Website menu has lots of broken links - in particular the API docs

2010-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882837#action_12882837
 ] 

Hudson commented on NUTCH-832:
--

Integrated in Nutch-trunk #1189 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/])
- fix for NUTCH-832 Website menu has lots of broken links - in particular 
the API docs


> Website menu has lots of broken links - in particular the API docs
> --
>
> Key: NUTCH-832
> URL: https://issues.apache.org/jira/browse/NUTCH-832
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.1
> Environment: Web
>Reporter: Alex McLintock
>Assignee: Chris A. Mattmann
> Fix For: 1.2
>
> Attachments: PATCH.NUTCH-832
>
>
> The website seems to have lots of broken links. eg the menu on the left 
> points to various URLs of the form 
> http://nutch.apache.org/apidocs-1.0/index.html
> but these don't seem to exist on the server. 
> Also 
> http://nutch.apache.org/release/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-833) Website is still Lucene branded

2010-06-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882836#action_12882836
 ] 

Hudson commented on NUTCH-833:
--

Integrated in Nutch-trunk #1189 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/])
- changelog for NUTCH-833 Website is still Lucene branded
- progress towards NUTCH-833 Website is still Lucene branded
- progress towards NUTCH-833 Website is still Lucene branded
- progress towards NUTCH-833 Website is still Lucene branded
- progress towards NUTCH-833 Website is still Lucene branded
- progress towards NUTCH-833 Website is still Lucene branded
fix for NUTCH-833 Website is still Lucene branded.
fix for NUTCH-833 Website is still Lucene branded.


> Website is still Lucene branded
> ---
>
> Key: NUTCH-833
> URL: https://issues.apache.org/jira/browse/NUTCH-833
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.1
> Environment: Web
>Reporter: Alex McLintock
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.2
>
>
> The Nutch website still has a lot of Lucene branding and links which are 
> confusing. eg the breadcrumbs
> Apache > Lucene > Nutch  > 
> appear at the top of most pages, along with the lucene logo and link to their 
> home page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-834) Separate the Nutch web site from trunk

2010-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884177#action_12884177
 ] 

Hudson commented on NUTCH-834:
--

Integrated in Nutch-trunk #1194 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1194/])
(NUTCH-834) Separate the Nutch web site from trunk


> Separate the Nutch web site from trunk
> --
>
> Key: NUTCH-834
> URL: https://issues.apache.org/jira/browse/NUTCH-834
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
>
> As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site 
> sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the 
> svnpubsub mechanism for instant deployment of site changes.
> The related issue for infra is 
> https://issues.apache.org/jira/browse/INFRA-2822
> See also https://issues.apache.org/jira/browse/PDFBOX-623

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature

2010-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884540#action_12884540
 ] 

Hudson commented on NUTCH-835:
--

Integrated in Nutch-trunk #1195 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1195/])
NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel 
via ab)


> document deduplication (exact duplicates) failed using MD5Signature
> ---
>
> Key: NUTCH-835
> URL: https://issues.apache.org/jira/browse/NUTCH-835
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.0.0, 1.1
> Environment: Linux, Ubuntu 10.04, Java 1.6.0_20
>Reporter: Sebastian Nagel
>Assignee: Andrzej Bialecki 
> Fix For: 1.2, 2.0
>
>
> The MD5Signature class calculates different signatures for identical 
> documents.
> The reason is that
>   byte[] data = content.getContent();
>   ... StringBuilder().append(data) ...
> uses java.lang.Object.toString() to get a string representation of the 
> (binary) content
> which results in unique hash codes (e.g., [...@30dc9065) even for two byte 
> arrays
> with identical content.
> A solution would be to take the MD5 sum of the binary content as first part 
> of the
> final signature calculation (the parsed content is the second part):
>   ... 
> .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText());
> Of course, there are many other solutions...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies

2010-07-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884996#action_12884996
 ] 

Hudson commented on NUTCH-837:
--

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])


> Remove search servers and Lucene dependencies 
> --
>
> Key: NUTCH-837
> URL: https://issues.apache.org/jira/browse/NUTCH-837
> Project: Nutch
>  Issue Type: Task
>  Components: searcher, web gui
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-837.patch
>
>
> One of the main aspects of 2.0 is the delegation of the indexing and search 
> to external resources like SOLR. We can simplify the code a lot by getting 
> rid of the : 
> * search servers
> * indexing and analysis with Lucene
> * search side functionalities : ontologies / clustering etc...
> In the short term only SOLR / SOLRCloud will be supported but the plan would 
> be to add other systems as well. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-838) Add timing information to all Tool classes

2010-07-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884997#action_12884997
 ] 

Hudson commented on NUTCH-838:
--

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])
- fix for NUTCH-838 Add timing information to all Tool classes


> Add timing information to all Tool classes
> --
>
> Key: NUTCH-838
> URL: https://issues.apache.org/jira/browse/NUTCH-838
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator, indexer, linkdb, parser
>Affects Versions: 1.1
> Environment: JDK 1.6, Linux & Windows
>Reporter: Jeroen van Vianen
>Assignee: Chris A. Mattmann
> Fix For: 1.2, 2.0
>
> Attachments: timings.patch
>
>
> Am happily trying to crawl a few hundred URLs incrementally. Performance is 
> degrading suddenly after the index reaches approximately 25000 URLs.
> At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, 
> solrindex, solrdedup batch takes approximately half an hour with topN 500, 
> but elapsed times now increase to 00h45m,  01h15m, 01h30m with every batch. 
> As I'm uncertain which of the phases takes so much time I decided to add 
> start and finish times to al classes that implement Tool so I at least have a 
> feeling and can review them in a log file.
> Am using pretty old hardware, but I am planning to recrawl these URLs on a 
> regular basis and if every iteration is going to take more and more time, 
> index updates will be few and far between :-(
> I added timing information to *all* Tool classes for consistency whereas 
> there are only 10 or so Tools that are really interesting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-836) Remove deprecated parse plugins

2010-07-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884995#action_12884995
 ] 

Hudson commented on NUTCH-836:
--

Integrated in Nutch-trunk #1197 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/])


> Remove deprecated parse plugins
> ---
>
> Key: NUTCH-836
> URL: https://issues.apache.org/jira/browse/NUTCH-836
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-836-2.patch
>
>
> Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These 
> plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely 
> on parse-tika almost exclusively. Some existing plugins might be kept when 
> there is no equivalent in Tika (to be discussed). The following plugins are 
> removed : 
> * parse-html
> * parse-msexcel
> * parse-mspowerpoint
> * parse-msword
> * parse-pdf
> * parse-oo
> * parse-text
> * lib-jakarta-poi
> * lib-parsems
> The patch does not (yet) remove :
> * parse-ext
> * parse-js
> * parse-rss
> * parse-swf
> * parse-zip
> * feed
> Please review the patch and vote for its inclusion in the trunk.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] [Commented] (NUTCH-994) Fine tune Solr schema

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056983#comment-13056983
 ] 

Hudson commented on NUTCH-994:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Fine tune Solr schema
> -
>
> Key: NUTCH-994
> URL: https://issues.apache.org/jira/browse/NUTCH-994
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-994-all.patch
>
>
> The supplied schema is old and doesn't use more advanced fieldTypes such as 
> Trie based (since Solr 1.4) and perhaps other improvements. We need to fine 
> tune the schema.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056986#comment-13056986
 ] 

Hudson commented on NUTCH-1012:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])
NUTCH-1012 Cannot handle illegal charset

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140696
Files : 
* /nutch/trunk/src/java/org/apache/nutch/util/EncodingDetector.java
* /nutch/trunk/CHANGES.txt


> Cannot handle illegal charset $charset
> --
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): 
> java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: 
> Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN  parse.html - 
> java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN  parse.html - at 
> java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: 
> http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-986) Dedup fails due to date format (long)

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056984#comment-13056984
 ] 

Hudson commented on NUTCH-986:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Dedup fails due to date format (long)
> -
>
> Key: NUTCH-986
> URL: https://issues.apache.org/jira/browse/NUTCH-986
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-986-1.3-1.patch, NUTCH-986-1.3-2.patch, 
> NUTCH-986-trunk-1.patch, NUTCH-986-trunk-2.patch
>
>
> As already mentioned on the list, dedup also failes because of invalid date 
> formats.
> Apr 19, 2011 10:34:50 AM 
> org.apache.solr.request.BinaryResponseWriter$Resolver 
> getDoc
> WARNING: Error reading a field from document : 
> SolrDocument[{digest=7ff92a31c58e43a34fd45bc6d87cda03}]
> java.lang.NumberFormatException: For input string: "2011-04-19T08:16:31.675Z"
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Long.parseLong(Long.java:419)
> at java.lang.Long.valueOf(Long.java:525)
> at org.apache.solr.schema.LongField.toObject(LongField.java:82)
> 
> Strange enough, Solr seems to allow updates of long fields with a formatted 
> date. In Nutch 1.2 the tstamp field is actually a long but in 1.3 the field 
> is 
> a valid Solr date format. This exception is only triggered using the javabin 
> response writer so there's something weird in Solr too.
> We need to either change the tstamp field back to a long or update the Solr 
> example schema and fix SolrDeleteDuplicates to use the formatted date instead 
> of the long.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-989) index-basic plugin doesn't use Solr date fieldType

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056982#comment-13056982
 ] 

Hudson commented on NUTCH-989:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> index-basic plugin doesn't use Solr date fieldType
> --
>
> Key: NUTCH-989
> URL: https://issues.apache.org/jira/browse/NUTCH-989
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
>
> The index-basic plugin actually sends over a properly formatted date with 
> millis but the schema isn't configured to use the dateField.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-995) Generate POM file using the Ivy makepom task

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056985#comment-13056985
 ] 

Hudson commented on NUTCH-995:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Generate POM file using the Ivy makepom task 
> -
>
> Key: NUTCH-995
> URL: https://issues.apache.org/jira/browse/NUTCH-995
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Julien Nioche
>Assignee: Chris A. Mattmann
> Fix For: 1.3
>
> Attachments: NUTCH-955-1.3.patch, NUTCH-997.branch-1.3.v2.patch, 
> mvn-template-build.patch
>
>
> We currently have a pom.xml file in the SVN repository and use it for 
> publishing our artefacts. The trouble with this is that we need to keep its 
> content in sync with our ivy file. Instead we could use the makepom task 
> (http://ant.apache.org/ivy/history/2.2.0/use/makepom.html) to generate the 
> pom.xml automatically.
> The existing pom.xml for 1.3 needs fixing anyway as it declares dependencies 
> to GORA and has the wrong versions for some dependencies.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1010) ContentLength not trimmed

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056987#comment-13056987
 ] 

Hudson commented on NUTCH-1010:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> ContentLength not trimmed
> -
>
> Key: NUTCH-1010
> URL: https://issues.apache.org/jira/browse/NUTCH-1010
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1010-1.4.patch, NUTCH-1010-2.0.patch
>
>
> Somewhere in some component the ContentLength field is not trimmed. This 
> allows a seemingly numeric field to be treated as a string by the indexer in 
> cases one or more leading or trailing whitespace is added. The result is a 
> hard to debug exception with no way to identify the bad document (amongst 
> thousands) or the bad field.
> {code}
> Jun 22, 2011 1:03:42 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.NumberFormatException: For input string: "32717 "
> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
> at java.lang.Long.parseLong(Long.java:419)
> at java.lang.Long.parseLong(Long.java:468)
> {code}
> This can be quickly fixed in the index-more plugin by simply using the trim() 
> when adding the field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056992#comment-13056992
 ] 

Hudson commented on NUTCH-967:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Upgrade to Tika 0.9
> ---
>
> Key: NUTCH-967
> URL: https://issues.apache.org/jira/browse/NUTCH-967
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, 
> NUTCH-967-1.3.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1006) meta equiv with single quotes not accepted

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056990#comment-13056990
 ] 

Hudson commented on NUTCH-1006:
---

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> meta equiv with single quotes not accepted
> --
>
> Key: NUTCH-1006
> URL: https://issues.apache.org/jira/browse/NUTCH-1006
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2, 1.3, 1.4, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1006-104.patch, NUTCH-1006-2.0.patch
>
>
> As posted by Alex F:
> the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not
> suitable for sites using single quotes for 
>   Example: 
>   We experienced a couple of pages with that kind of quotes and Nutch-1.2
> was not able to handle it.
> Is there any fallback or would it be good to use the following
> regex: "]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single
> or regular quotes are accepted)?
> See this thread:
> http://lucene.472066.n3.nabble.com/Character-encoding-on-Html-Pages-td3034850.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-999) Normalise String representation for Dates in IndexingFilters

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056988#comment-13056988
 ] 

Hudson commented on NUTCH-999:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Normalise String representation for Dates in IndexingFilters
> 
>
> Key: NUTCH-999
> URL: https://issues.apache.org/jira/browse/NUTCH-999
> Project: Nutch
>  Issue Type: Task
>  Components: indexer
>Affects Versions: 2.0
>Reporter: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-999.patch
>
>
> NUTCH-997 has been applied to Nutch-1.3 so that various indexing filters 
> store Date objects as value for fields. However in trunk NutchDocuments can 
> have only String values which means that we will have to convert the Dates to 
> Strings in each indexing filter.  

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-983) Upgrade SolrJ

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056989#comment-13056989
 ] 

Hudson commented on NUTCH-983:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Upgrade SolrJ
> -
>
> Key: NUTCH-983
> URL: https://issues.apache.org/jira/browse/NUTCH-983
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.3, 2.0
>
>
> Solr 3.1 has been released a while ago. The Javabin format between 1.4.1 and 
> 3.1 has been changed so our SolrJ 1.4.1 cannot send documents to 3.1. Since 
> Nutch 2.0 won't be released within a short period i believe it would be a 
> good idea to upgrade our SolrJ to 3.1. New Solr users are encouraged to use 
> Solr 3.1 or upgrade so i expect more users wanting to use 3.1 as well. Any 
> thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-991) SolrDedup must issue a commit

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056991#comment-13056991
 ] 

Hudson commented on NUTCH-991:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> SolrDedup must issue a commit
> -
>
> Key: NUTCH-991
> URL: https://issues.apache.org/jira/browse/NUTCH-991
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-991-1.3-1.patch, NUTCH-991-trunk-1.patch
>
>
> Title says it all. SolrDedup job doesn't commit but it should.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-888) Remove parse-rss

2011-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056993#comment-13056993
 ] 

Hudson commented on NUTCH-888:
--

Integrated in Nutch-trunk #1530 (See 
[https://builds.apache.org/job/Nutch-trunk/1530/])


> Remove parse-rss
> 
>
> Key: NUTCH-888
> URL: https://issues.apache.org/jira/browse/NUTCH-888
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.3, 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.3, 2.0
>
>
> See https://issues.apache.org/jira/browse/NUTCH-887
> {quote}
> CM : I wrote parse-rss back in 2005, and used commons-feedparser from Kevin 
> Burton and his crew. At the time it was well developed, and a little more 
> flexible and easier for me to pick up than Rome. Since then however, its 
> development has really become stagnant and it is no longer maintained.
> In terms of real differences in terms of functionality, they are roughly 
> equivalent so there isn't much difference.
> {quote}
> Already +1 from Andrzej and Chris. Will remove it tomorrow if there aren't 
> any objections in the meantime 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex

2011-07-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059677#comment-13059677
 ] 

Hudson commented on NUTCH-1013:
---

Integrated in Nutch-trunk #1536 (See 
[https://builds.apache.org/job/Nutch-trunk/1536/])
NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1142687
Files : 
* 
/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
* /nutch/trunk/CHANGES.txt


> Migrate RegexURLNormalizer from Apache ORO to java.util.regex
> -
>
> Key: NUTCH-1013
> URL: https://issues.apache.org/jira/browse/NUTCH-1013
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1013-1.4.patch
>
>
> Apache ORO uses old Perl 5-style regular expressions. Features such as the 
> powerful lookbehind are not available. The project has become retired as 
> well. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-07-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061041#comment-13061041
 ] 

Hudson commented on NUTCH-1011:
---

Integrated in Nutch-trunk #1538 (See 
[https://builds.apache.org/job/Nutch-trunk/1538/])
NUTCH-1011 Remove duplicate slashes from URLs

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1143468
Files : 
* /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
* /nutch/trunk/conf/regex-normalize.xml.template
* /nutch/trunk/CHANGES.txt


> Normalize duplicate slashes in URL's
> 
>
> Key: NUTCH-1011
> URL: https://issues.apache.org/jira/browse/NUTCH-1011
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, 2.0
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch
>
>
> Many websites produce faulty URL's with multiple slashes e.g. 
> http://cocoon.apache.org///1.x/dynamic.html
> This can be really nasty if the number of slashes varies, resulting in many 
> URL's actually pointing to the same page and generating new (unique) URL's to 
> the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1027) Degrade log level of `can't find rules for scope`

2011-07-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063704#comment-13063704
 ] 

Hudson commented on NUTCH-1027:
---

Integrated in Nutch-trunk #1543 (See 
[https://builds.apache.org/job/Nutch-trunk/1543/])
NUTCH-1027 Degrade log level of 'can't find rules for scope'

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145131
Files : 
* 
/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
* /nutch/trunk/CHANGES.txt


> Degrade log level of `can't find rules for scope`
> -
>
> Key: NUTCH-1027
> URL: https://issues.apache.org/jira/browse/NUTCH-1027
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1027-1.4-1.patch
>
>
> The warning for regex.RegexURLNormalizer - can't find rules for scope 
> '', using default should be degraded to info because:
> # new users are unaware of the normalizer
> # the scoping of normalizer is not really documented (meaning wiki/tutorial, 
> not just javadoc)
> # i don't consider it a warning (i.e. this no scope is not bad)
> Thougts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1043) Add pattern for filtering .js in default url filters

2011-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067481#comment-13067481
 ] 

Hudson commented on NUTCH-1043:
---

Integrated in Nutch-trunk #1550 (See 
[https://builds.apache.org/job/Nutch-trunk/1550/])
NUTCH-1043 Add pattern for filtering .js in default url filters

jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147798
Files : 
* /nutch/trunk/conf/automaton-urlfilter.txt.template
* /nutch/trunk/conf/regex-urlfilter.txt.template
* /nutch/trunk/CHANGES.txt


> Add pattern for filtering .js in default url filters
> 
>
> Key: NUTCH-1043
> URL: https://issues.apache.org/jira/browse/NUTCH-1043
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4, 2.0
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1043.patch
>
>
> The Javascript parser is not used by default as it is extremely noisy, 
> however the default URL filters do not filter out URLs ending in .js and the 
> default parser (Tika) can't parse them. In a nutshell we are fetching URLS 
> that we know can't be parsed.
> I suggest that we add a regex to the default URL filters. If people are 
> interested in fetching and parsing .js files they can activate the plugin in 
> their conf and remove the regex in the URL filters.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1055) upgrade package.html file in language identifier plugin

2011-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067482#comment-13067482
 ] 

Hudson commented on NUTCH-1055:
---

Integrated in Nutch-trunk #1550 (See 
[https://builds.apache.org/job/Nutch-trunk/1550/])
commit and close of NUTCH-1055 and changes.txt, this commit does not affect 
functionality it is merely a hyperlink reference to the document used as the 
basis for the language identifier plugin

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147817
Files : 
* 
/nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/package.html
* /nutch/trunk/CHANGES.txt


> upgrade package.html file in language identifier plugin
> ---
>
> Key: NUTCH-1055
> URL: https://issues.apache.org/jira/browse/NUTCH-1055
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.3
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: documentation
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1055-package-html.patch, 
> NUTCH-1055-trunk-package-html.patch, europarl.ps
>
>
> package.html within the language identifier plugin contains the following... 
> however the link is broken.
> 
> 
> Text document language identifier.Language profiles are based on 
> material from
>  href="http://www.isi.edu/~koehn/europarl/";>http://www.isi.edu/~koehn/europarl/.
> 
> 
> The correct link should be
> http://www.homepages.inf.ed.ac.uk/pkoehn/publications/europarl.ps
> I will submit a patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing

2011-07-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068149#comment-13068149
 ] 

Hudson commented on NUTCH-1037:
---

Integrated in Nutch-trunk #1551 (See 
[https://builds.apache.org/job/Nutch-trunk/1551/])
NUTCH-1037 Option to deduplicate anchors prior to indexing

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1148308
Files : 
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java


> Deduplicate anchors before indexing
> ---
>
> Key: NUTCH-1037
> URL: https://issues.apache.org/jira/browse/NUTCH-1037
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1037-1.4-1.patch, NUTCH-1037-1.4-2.patch, 
> NUTCH-1037-1.4-3.patch, NUTCH-1037-2.0-1.patch, NUTCH-1037-2.0-2.patch
>
>
> Anchors are not deduplicated before indexing. This can result in a very high 
> number of similar and identical anchors being indexed. Before indexing, 
> anchors must be deduplicated at least on case.
> Use anchorIndexingFilter.deduplicate=true to deduplicate anchors 
> case-insensitive.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070930#comment-13070930
 ] 

Hudson commented on NUTCH-1045:
---

Integrated in Nutch-trunk #1557 (See 
[https://builds.apache.org/job/Nutch-trunk/1557/])
NUTCH-1045 Mimeutil uses default Tika config unless overriden

jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150670
Files : 
* /nutch/trunk/conf/tika-mimetypes.xml
* /nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/CHANGES.txt


> MimeUtil to rely on default config provided by Tika
> ---
>
> Key: NUTCH-1045
> URL: https://issues.apache.org/jira/browse/NUTCH-1045
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is 
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though 
> but if the user hasn't specified one or if it can't be loaded then we should 
> rely on Tika's default. This way we won't need to provide 
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one 
> whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1065) New mvn.template

2011-08-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079783#comment-13079783
 ] 

Hudson commented on NUTCH-1065:
---

Integrated in Nutch-trunk #1567 (See 
[https://builds.apache.org/job/Nutch-trunk/1567/])
commit to address NUTCH-1065 - New mvn.template and update of changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153833
Files : 
* /nutch/trunk/conf/domain-urlfilter.txt
* /nutch/trunk/ivy/mvn.template
* /nutch/trunk/CHANGES.txt


> New mvn.template
> 
>
> Key: NUTCH-1065
> URL: https://issues.apache.org/jira/browse/NUTCH-1065
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.4, 2.0
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1065-mvn-template-new.patch, 
> NUTCH-1065-trunk-mvn-template-new.patch
>
>
> Removal of Otis from mvn.template file and addition of myself. This does not 
> alter functionality of any mvn or ivy tasks or files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-920) Project Metadata

2011-08-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082894#comment-13082894
 ] 

Hudson commented on NUTCH-920:
--

Integrated in Nutch-trunk #1573 (See 
[https://builds.apache.org/job/Nutch-trunk/1573/])
commit to address NUTCH-920 adding trunk 2.0 DOAP file to svn.

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156101
Files : 
* /nutch/trunk/doap.rdf


> Project Metadata
> 
>
> Key: NUTCH-920
> URL: https://issues.apache.org/jira/browse/NUTCH-920
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Attachments: doap_Apache_Nutch.rdf, doap_Nutch_trunk.rdf
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083534#comment-13083534
 ] 

Hudson commented on NUTCH-623:
--

Integrated in Nutch-trunk-ant #5 (See 
[https://builds.apache.org/job/Nutch-trunk-ant/5/])
commit to revert changes by NUTCH-623 which broke tests.
commit to address NUTCH-623 and changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156712
Files : 
* /nutch/trunk/src/plugin/languageidentifier/plugin.xml
* /nutch/trunk/src/plugin/languageidentifier/build.xml
* /nutch/trunk/CHANGES.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156692
Files : 
* /nutch/trunk/src/plugin/languageidentifier/plugin.xml
* /nutch/trunk/src/plugin/languageidentifier/build.xml
* /nutch/trunk/CHANGES.txt


> Change plugin source directory "languageidentifier" to "language-identifier"
> 
>
> Key: NUTCH-623
> URL: https://issues.apache.org/jira/browse/NUTCH-623
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ignacio J. Ortega
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-623-branch-1.4-20110810.patch, 
> NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-trunk-2.0-20110810.patch
>
>
> When trying to develop and debug Nutch  in eclipse, following the 
> instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant 
> run with languageidentifier is rename to language-identifier, when later 
> issue an svn update, you end having two languageidentifier src dirs, one with 
> the dash and another without it, it's an annoyance only, i know, but it 
> stucks me for 2 weeks..so if can be corrected... 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-08-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083909#comment-13083909
 ] 

Hudson commented on NUTCH-623:
--

Integrated in Nutch-trunk #1575 (See 
[https://builds.apache.org/job/Nutch-trunk/1575/])
commit to revert changes by NUTCH-623 which broke tests.
commit to address NUTCH-623 and changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156712
Files : 
* /nutch/trunk/src/plugin/languageidentifier/plugin.xml
* /nutch/trunk/src/plugin/languageidentifier/build.xml
* /nutch/trunk/CHANGES.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156692
Files : 
* /nutch/trunk/src/plugin/languageidentifier/plugin.xml
* /nutch/trunk/src/plugin/languageidentifier/build.xml
* /nutch/trunk/CHANGES.txt


> Change plugin source directory "languageidentifier" to "language-identifier"
> 
>
> Key: NUTCH-623
> URL: https://issues.apache.org/jira/browse/NUTCH-623
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ignacio J. Ortega
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-623-branch-1.4-20110810.patch, 
> NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-trunk-2.0-20110810.patch
>
>
> When trying to develop and debug Nutch  in eclipse, following the 
> instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant 
> run with languageidentifier is rename to language-identifier, when later 
> issue an svn update, you end having two languageidentifier src dirs, one with 
> the dash and another without it, it's an annoyance only, i know, but it 
> stucks me for 2 weeks..so if can be corrected... 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1099) Add HBase and Cassandra storage properties to nutch-default.xml

2011-09-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102414#comment-13102414
 ] 

Hudson commented on NUTCH-1099:
---

Integrated in Nutch-trunk-ant #32 (See 
[https://builds.apache.org/job/Nutch-trunk-ant/32/])
commit to address NUTCH-1099 and update to changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1169475
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml


> Add HBase and Cassandra storage properties to nutch-default.xml
> ---
>
> Key: NUTCH-1099
> URL: https://issues.apache.org/jira/browse/NUTCH-1099
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.0
> Environment: Ubuntu 11.04 natty
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 2.0
>
> Attachments: NUTCH-1099-20110829.patch
>
>
> I was getting fed up manually adding the properties for HBase and Cassandra 
> to nutch-site.xml manually and thought if we could at least add them to 
> nutch-default.xml then comment them out then it would be a simply copy paste 
> job rather than manually fetching the content from somewhere else I had it 
> stored. N.B. this changes no functionality, just makes people lives a bit 
> easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1099) Add HBase and Cassandra storage properties to nutch-default.xml

2011-09-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102413#comment-13102413
 ] 

Hudson commented on NUTCH-1099:
---

Integrated in Nutch-trunk #1601 (See 
[https://builds.apache.org/job/Nutch-trunk/1601/])
commit to address NUTCH-1099 and update to changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1169475
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml


> Add HBase and Cassandra storage properties to nutch-default.xml
> ---
>
> Key: NUTCH-1099
> URL: https://issues.apache.org/jira/browse/NUTCH-1099
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: 2.0
> Environment: Ubuntu 11.04 natty
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 2.0
>
> Attachments: NUTCH-1099-20110829.patch
>
>
> I was getting fed up manually adding the properties for HBase and Cassandra 
> to nutch-site.xml manually and thought if we could at least add them to 
> nutch-default.xml then comment them out then it would be a simply copy paste 
> job rather than manually fetching the content from somewhere else I had it 
> stored. N.B. this changes no functionality, just makes people lives a bit 
> easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1114) Attr file missing in domain filter

2011-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108360#comment-13108360
 ] 

Hudson commented on NUTCH-1114:
---

Integrated in Nutch-branch-1.4 #11 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/11/])
NUTCH-1114 Attr file missing in domain filter

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172637
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt
* /nutch/branches/branch-1.4/src/plugin/urlfilter-domain/plugin.xml


> Attr file missing in domain filter
> --
>
> Key: NUTCH-1114
> URL: https://issues.apache.org/jira/browse/NUTCH-1114
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.4
>
>
> WARN org.apache.nutch.urlfilter.domain.DomainURLFilter: Attribute "file" is 
> not defined in plugin.xml for plugin urlfilter-domain
> File element in plugin.xml is commented out but should not. Uncommenting 
> results in an INFO message.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher

2011-09-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108359#comment-13108359
 ] 

Hudson commented on NUTCH-1067:
---

Integrated in Nutch-branch-1.4 #11 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/11/])
NUTCH-1067 Nutch-default configuration directives missing

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172585
Files : 
* /nutch/branches/branch-1.4/conf/nutch-default.xml


> Configure minimum throughput for fetcher
> 
>
> Key: NUTCH-1067
> URL: https://issues.apache.org/jira/browse/NUTCH-1067
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, 
> NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be 
> very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If 
> all other url's have been fetched, these queue's can stall the entire 
> fetcher, 60 url's can then take 10 minutes or even more. This can usually be 
> dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum 
> number of pages per second before the fetcher gives up. It doesn't use the 
> global number of pages / running time but records the actual pages processed 
> in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number 
> of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils

2011-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113148#comment-13113148
 ] 

Hudson commented on NUTCH-1115:
---

Integrated in Nutch-branch-1.4 #14 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/14/])
Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 
commit
NUTCH-1115 Option to disable fixing of URL embedded parameters in 
DomContentUtils

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174222
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174147
Files : 
* /nutch/branches/branch-1.4/conf/nutch-default.xml
* 
/nutch/branches/branch-1.4/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* 
/nutch/branches/branch-1.4/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java


> Option to disable fixing of embedded params in DomContentUtils
> --
>
> Key: NUTCH-1115
> URL: https://issues.apache.org/jira/browse/NUTCH-1115
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch
>
>
> Add option to disable fixing of embedded params:
> http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html
> When enabled, millions of crap url's are output as outlink. This results in 
> many 404 in the DB and many very long URL's that actually lead to the same 
> page.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)

2011-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113147#comment-13113147
 ] 

Hudson commented on NUTCH-1078:
---

Integrated in Nutch-branch-1.4 #14 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/14/])
Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 
commit
commit to address NUTCH-1078 and update of changes.txt

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174222
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt

lewismc : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174191
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Crawl.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDb.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbFilter.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbMerger.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbReader.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/FetchScheduleFactory.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Injector.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDb.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbFilter.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbMerger.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbReader.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/MapWritable.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/SignatureFactory.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/URLPartitioner.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/Fetcher.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/OldFetcher.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexingFilters.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/net/URLNormalizers.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/OutlinkExtractor.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseOutputFormat.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParsePluginsReader.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseResult.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseSegment.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseUtil.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParserFactory.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginDescriptor.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginManifestParser.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginRepository.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/protocol/ProtocolFactory.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/Loops.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentMergeFilters.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentMerger.java
* 
/nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentReader.java
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/tools/Cra

[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments

2011-09-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113890#comment-13113890
 ] 

Hudson commented on NUTCH-1074:
---

Integrated in Nutch-branch-1.4 #15 (See 
[https://builds.apache.org/job/Nutch-branch-1.4/15/])
NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count

markus : 
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174689
Files : 
* /nutch/branches/branch-1.4/CHANGES.txt
* /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java


> topN is ignored with maxNumSegments
> ---
>
> Key: NUTCH-1074
> URL: https://issues.apache.org/jira/browse/NUTCH-1074
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: generator_fix.patch
>
>
> When generating segments with topN and maxNumSegments, topN is not respected. 
> It looks like the first generated segment contains topN * maxNumSegments of 
> URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-09-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114129#comment-13114129
 ] 

Hudson commented on NUTCH-623:
--

Integrated in Nutch-trunk #1611 (See 
[https://builds.apache.org/job/Nutch-trunk/1611/])
commit to address NUTCH-623 and update to changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175188
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/language-identifier
* /nutch/trunk/src/plugin/language-identifier/build.xml
* /nutch/trunk/src/plugin/language-identifier/ivy.xml
* /nutch/trunk/src/plugin/language-identifier/plugin.xml
* /nutch/trunk/src/plugin/language-identifier/src
* /nutch/trunk/src/plugin/languageidentifier


> Change plugin source directory "languageidentifier" to "language-identifier"
> 
>
> Key: NUTCH-623
> URL: https://issues.apache.org/jira/browse/NUTCH-623
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ignacio J. Ortega
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-623-branch-1.4-20110810.patch, 
> NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-branch-1.4-20110910-v2.patch, 
> NUTCH-623-trunk-1.4-20110924.patch, NUTCH-623-trunk-2.0-20110810.patch
>
>
> When trying to develop and debug Nutch  in eclipse, following the 
> instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant 
> run with languageidentifier is rename to language-identifier, when later 
> issue an svn update, you end having two languageidentifier src dirs, one with 
> the dash and another without it, it's an annoyance only, i know, but it 
> stucks me for 2 weeks..so if can be corrected... 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"

2011-09-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114785#comment-13114785
 ] 

Hudson commented on NUTCH-623:
--

Integrated in Nutch-trunk #1613 (See 
[https://builds.apache.org/job/Nutch-trunk/1613/])
NUTCH-623 fix source directory

siren : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175739
Files : 
* /nutch/trunk/build.xml


> Change plugin source directory "languageidentifier" to "language-identifier"
> 
>
> Key: NUTCH-623
> URL: https://issues.apache.org/jira/browse/NUTCH-623
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ignacio J. Ortega
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-623-branch-1.4-20110810.patch, 
> NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-branch-1.4-20110910-v2.patch, 
> NUTCH-623-trunk-1.4-20110924.patch, NUTCH-623-trunk-2.0-20110810.patch
>
>
> When trying to develop and debug Nutch  in eclipse, following the 
> instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant 
> run with languageidentifier is rename to language-identifier, when later 
> issue an svn update, you end having two languageidentifier src dirs, one with 
> the dash and another without it, it's an annoyance only, i know, but it 
> stucks me for 2 weeks..so if can be corrected... 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files

2012-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263362#comment-13263362
 ] 

Hudson commented on NUTCH-1189:
---

Integrated in Nutch-nutchgora #240 (See 
[https://builds.apache.org/job/Nutch-nutchgora/240/])
NUTCH-1189 (Update gora.properties for HBase to reflect Gora 0.2) (Revision 
1330744)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/conf/gora.properties


> add commented out default settings to gora.properties files 
> 
>
> Key: NUTCH-1189
> URL: https://issues.apache.org/jira/browse/NUTCH-1189
> Project: Nutch
>  Issue Type: Sub-task
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, 
> NUTCH-1189-v4.patch, NUTCH-1189.patch
>
>
> This issues should have been dealt with as part of its parent issue, however 
> I think as it is a fairly lareg task in itself, it needs to be done 
> independently. The gora.properties file should, amongst other settings, and 
> beside the extreme basic defaults for sqlstore, include defaults for opening 
> HBase, Cassandra, etc servers on their default ports etc. Leaving this down 
> to individual interpretation puts a huge owness of the user, hence 
> constructing a barrier to entry for getting the configuration settings up and 
> running.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-882) Design a Host table in GORA

2012-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263363#comment-13263363
 ] 

Hudson commented on NUTCH-882:
--

Integrated in Nutch-nutchgora #240 (See 
[https://builds.apache.org/job/Nutch-nutchgora/240/])
NUTCH-882 Design a Host table in GORA (Revision 1330728)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/build.xml
* /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml
* /nutch/branches/nutchgora/default.properties
* /nutch/branches/nutchgora/ivy/ivy.xml
* /nutch/branches/nutchgora/src/gora/host.avsc
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbReader.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateJob.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostInjectorJob.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Host.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/storage/WebTableCreator.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/util/Histogram.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: nutchgora
>Reporter: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, 
> hostdb.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box

2012-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263365#comment-13263365
 ] 

Hudson commented on NUTCH-902:
--

Integrated in Nutch-nutchgora #240 (See 
[https://builds.apache.org/job/Nutch-nutchgora/240/])
NUTCH-902 (merge different "storage.data.store.class" entries into one) 
(Revision 1330807)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/conf/nutch-default.xml


> Add all necessary files and configuration so that nutch can be used with 
> different backends out-of-the-box
> --
>
> Key: NUTCH-902
> URL: https://issues.apache.org/jira/browse/NUTCH-902
> Project: Nutch
>  Issue Type: New Feature
>  Components: documentation, storage
>Affects Versions: nutchbase
>Reporter: Enis Soztutar
>Assignee: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch
>
>
> As per the discussion in the mailing list and 
> http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the 
> necessary files and configuration. I propose that we maintain configuration 
> for at least SQL, HBase and Cassandra. 
> The following changes are needed:
> conf/gora-sql-mapping.xml
> conf/gora-hbase-mapping.xml
> conf/gora-cassandra-mapping.xml
> comments on nutch-default and ivy.xml 
> Shall we also include jars from gora-hbase, gora-cassandra and their 
> dependencies ? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer

2012-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263364#comment-13263364
 ] 

Hudson commented on NUTCH-1340:
---

Integrated in Nutch-nutchgora #240 (See 
[https://builds.apache.org/job/Nutch-nutchgora/240/])
NUTCH-1340 Increase scalability by only removing markers when they actually 
exist for DbUpdaterReducer (Revision 1330722)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Mark.java


> Increase scalability by only removing markers when they actually exist for 
> DbUpdaterReducer
> ---
>
> Key: NUTCH-1340
> URL: https://issues.apache.org/jira/browse/NUTCH-1340
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora
>
> Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt
>
>
> After applying GORA-120 (this already is a huge performance boost by itself) 
> one of the major bottlenecks of the DbUpdaterReducer is the deletion of the 
> markers. The update reducer simply sets every row to delete its markers. A 
> lot of rows do not actually have the markers but the deletes are fired away 
> in any case. Because the markers are already always on the input, a simple 
> check to see if they exist greaty improves performance.
> In particular it is very expensive in HBase, because every single Delete 
> inmediately triggers a connection to the regionservers. (They ignore the 
> "autoflush=false" directive). Although deletes can be done in batch, this is 
> currently not supported by Gora. For one it is very difficult to implement in 
> the current HBaseStore with regard to multithreading, and secondly I noticed 
> performance did not increase significantly.
> By performance debugging on a real life cluster this currently seems to be 
> the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying 
> GORA-120)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml

2012-05-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268112#comment-13268112
 ] 

Hudson commented on NUTCH-1205:
---

Integrated in Nutch-nutchgora #244 (See 
[https://builds.apache.org/job/Nutch-nutchgora/244/])
NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (addition) (Revision 
1333551)
NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (addition) (Revision 
1333547)
NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (Revision 1333435)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/ivy/ivy.xml

ferdy : 
Files : 
* /nutch/branches/nutchgora/ivy/ivy.xml

ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/build.xml
* /nutch/branches/nutchgora/conf/gora.properties
* /nutch/branches/nutchgora/ivy/ivy.xml
* /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java
* /nutch/branches/nutchgora/src/test/gora.properties
* 
/nutch/branches/nutchgora/src/test/org/apache/nutch/storage/TestGoraStorage.java
* 
/nutch/branches/nutchgora/src/test/org/apache/nutch/util/AbstractNutchTest.java
* /nutch/branches/nutchgora/src/testprocess
* /nutch/branches/nutchgora/src/testprocess/gora.properties


> Upgrade gora modules to 0.2 in ivy/ivy.xml
> --
>
> Key: NUTCH-1205
> URL: https://issues.apache.org/jira/browse/NUTCH-1205
> Project: Nutch
>  Issue Type: Improvement
>  Components: storage
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: nutchgora
>
> Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11-addition.patch, 
> NUTCH-1205-v11.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, 
> NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, 
> NUTCH-1205-v6.patch, NUTCH-1205.patch
>
>
> Although gora trunk is unstable, work is ongoing to get this fixed. For the 
> time being, I think Nutchgora should use gora trunk as this will identify 
> more vulnerabilities. I'll get the trivial patch submitted shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1350) remove unused dependancy because of access restriction

2012-05-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268894#comment-13268894
 ] 

Hudson commented on NUTCH-1350:
---

Integrated in Nutch-nutchgora #245 (See 
[https://builds.apache.org/job/Nutch-nutchgora/245/])
NUTCH-1350 remove unused dependancy because of access restriction (Revision 
1333803)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/test/org/apache/nutch/util/CrawlTestUtil.java


> remove unused dependancy because of access restriction
> --
>
> Key: NUTCH-1350
> URL: https://issues.apache.org/jira/browse/NUTCH-1350
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Trivial
> Fix For: nutchgora
>
>
> CrawlTestUtil has an unused dependancy com.sun.net.httpserver.HttpContext 
> that sometimes causes an "access restriction" error when used with certain 
> jdks. I figured since it isn't used anyway I can just remove it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271069#comment-13271069
 ] 

Hudson commented on NUTCH-1352:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 
1335066)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* 
/nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
* 
/nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* 
/nutch/branches/nutchgora/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
* 
/nutch/branches/nutchgora/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* 
/nutch/branches/nutchgora/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java


> Improve regex urlfilters/normalizers synchronization
> 
>
> Key: NUTCH-1352
> URL: https://issues.apache.org/jira/browse/NUTCH-1352
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch
>
>
> I noticed that during fetching a lot of the time the fetcherthreads are 
> blocking on a monitor because of outlink normalizing/filtering. The cause of 
> this: Some of the regex plugins use single lock synchronization.
> This patch improves throughput by removing synchronization locks and replace 
> them with threadlocals were needed.
> It has been extensively tested in production. I will commit this later today 
> when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging and improve CLI

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271068#comment-13271068
 ] 

Hudson commented on NUTCH-1349:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
Commit to address NUTCH-1349 and update to CHANGES.txt (Revision 1335436)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/log4j.properties
* /nutch/branches/nutchgora/src/bin/nutch
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/WebTableReader.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java


> Make batchId explcit within debug logging and improve CLI
> -
>
> Key: NUTCH-1349
> URL: https://issues.apache.org/jira/browse/NUTCH-1349
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1349-v2.patch, NUTCH-1349-v2.patch, 
> NUTCH-1349.patch
>
>
> I find this a pain when trying to locate the batchId of some urls which are 
> skipped when going to the Solr index. My DEBUG log output gives me
> {code}
> 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
> Skipping http://www.glasgowwheelers.com/; different batch id
> 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
> Skipping http://www.heraldscotland.com/; different batch id
> {code}
> when I would actually like
> {code}
> 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
> Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID)
> 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - 
> Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID)
> {code} 
> patch coming up soon

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271070#comment-13271070
 ] 

Hudson commented on NUTCH-1353:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
NUTCH-1353 nutchgora DomainStatistics support crawlId, counter bug and 
reformatting (Revision 1334936)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> nutchgora DomainStatistics support crawlId, counter bug and reformatting
> 
>
> Key: NUTCH-1353
> URL: https://issues.apache.org/jira/browse/NUTCH-1353
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1353.patch
>
>
> This patch fixes three issues about nutchgora DomainStatistics:
> -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics 
> was already fixed. This was not the case.)
> -A counter bug (NOT_FETCHED should be increased instead of FETCHED)
> -reformatting (convert tabs to spaces and clear unused imports)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271071#comment-13271071
 ] 

Hudson commented on NUTCH-1354:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
NUTCH-1354 nutchgora support fetcher.queue.depth.multiplier property 
(Revision 1334945)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/nutch-default.xml
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java


> nutchgora support fetcher.queue.depth.multiplier property
> -
>
> Key: NUTCH-1354
> URL: https://issues.apache.org/jira/browse/NUTCH-1354
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1354.patch
>
>
> Like trunk, nutchgora should support fetcher.queue.depth.multiplier property 
> too.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271072#comment-13271072
 ] 

Hudson commented on NUTCH-1355:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
NUTCH-1355 nutchgora Configure minimum throughput for fetcher (Revision 
1335063)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/nutch-default.xml
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java


> nutchgora Configure minimum throughput for fetcher
> --
>
> Key: NUTCH-1355
> URL: https://issues.apache.org/jira/browse/NUTCH-1355
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Ferdy Galema
> Fix For: nutchgora
>
> Attachments: NUTCH-1355.patch
>
>
> Like trunk, nutchgora should also have a feature to configure the fetcher 
> with a minimum throughput. (See NUTCH-1067 for the work done by Markus).
> It's implemented in almost the same way, except that the number of times 
> throughput falls below threshold is measured sequentially. (The counter is 
> reset when throughput is healthy again; this should work even better against 
> temporary dips).
> Defaults to disabled. Will commit later today if there is no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-05-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271073#comment-13271073
 ] 

Hudson commented on NUTCH-1356:
---

Integrated in Nutch-nutchgora #248 (See 
[https://builds.apache.org/job/Nutch-nutchgora/248/])
NUTCH-1356 ParseUtil use ExecutorService instead of manually thread 
handling. (Revision 1335065)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParseUtil.java


> ParseUtil use ExecutorService instead of manually thread handling.
> --
>
> Key: NUTCH-1356
> URL: https://issues.apache.org/jira/browse/NUTCH-1356
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
> NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for 
> every parse it sometimes happens that specific parsers are very expensive. 
> For example, parsers that have threadlocal fields will initialize them for 
> every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to 
> cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1358) Do not accept bogus arguments

2012-05-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273026#comment-13273026
 ] 

Hudson commented on NUTCH-1358:
---

Integrated in Nutch-nutchgora #249 (See 
[https://builds.apache.org/job/Nutch-nutchgora/249/])
NUTCH-1358 Do not accept bogus arguments (Revision 1336204)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/DbUpdaterJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/InjectorJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java


> Do not accept bogus arguments
> -
>
> Key: NUTCH-1358
> URL: https://issues.apache.org/jira/browse/NUTCH-1358
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
>Priority: Minor
> Fix For: nutchgora
>
> Attachments: NUTCH-1358.patch
>
>
> Some of the tools do not explicitely check every passed argument for 
> validity. This can mask very frustrating issues because one passes wrong 
> arguments and the tool does not fail fast.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints

2012-05-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273027#comment-13273027
 ] 

Hudson commented on NUTCH-1026:
---

Integrated in Nutch-nutchgora #249 (See 
[https://builds.apache.org/job/Nutch-nutchgora/249/])
NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/log4j.properties
* 
/nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java


> Strip UTF-8 non-character codepoints
> 
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: nutchgora
>Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character 
> codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class 
> java.io.CharConversionException] Invalid UTF-8 character 0x at char 
> #1142033, byte #1155068)
> at 
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at 
> com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the 
> content field to a method to strip away non-characters. I'm not too sure 
> about this implementation but the tests i've done locally with a huge dataset 
> now passes correctly. Here's a list of codepoints to strip away: 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1362) Fix error handling of urls with empty fields

2012-05-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273829#comment-13273829
 ] 

Hudson commented on NUTCH-1362:
---

Integrated in Nutch-nutchgora #250 (See 
[https://builds.apache.org/job/Nutch-nutchgora/250/])
NUTCH-1362 Fix error handling of urls with empty fields (Revision 1337091)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java


> Fix error handling of urls with empty fields 
> -
>
> Key: NUTCH-1362
> URL: https://issues.apache.org/jira/browse/NUTCH-1362
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1362.patch
>
>
> Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length > 
> 0) block enables us to address this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1366) speed up indexing by eliminating the indexreducer

2012-05-14 Thread Hudson (JIRA)














































Hudson
 commented on  NUTCH-1366


speed up indexing by eliminating the indexreducer















Integrated in Nutch-nutchgora #253 (See https://builds.apache.org/job/Nutch-nutchgora/253/)
NUTCH-1366 speed up indexing by eliminating the indexreducer (Revision 1338217)

 Result = SUCCESS
ferdy : 
Files : 

	/nutch/branches/nutchgora/CHANGES.txt
	/nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexUtil.java
	/nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerJob.java
	/nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java





























This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira






[jira] [Commented] (NUTCH-1378) HostDb NullPointerException

2012-05-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282201#comment-13282201
 ] 

Hudson commented on NUTCH-1378:
---

Integrated in Nutch-nutchgora #262 (See 
[https://builds.apache.org/job/Nutch-nutchgora/262/])
NUTCH-1378 HostDb NullPointerException (Revision 1341879)

 Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java


> HostDb NullPointerException
> ---
>
> Key: NUTCH-1378
> URL: https://issues.apache.org/jira/browse/NUTCH-1378
> Project: Nutch
>  Issue Type: Bug
>Reporter: Ferdy Galema
> Fix For: nutchgora
>
> Attachments: NUTCH-1378.patch
>
>
> This is a no-brainer to fix a NPE when using the HostDb functionality. Will 
> attach patch and commit right away.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1381) Allow to override default subcollection field name

2012-06-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291218#comment-13291218
 ] 

Hudson commented on NUTCH-1381:
---

Integrated in nutch-trunk-maven #299 (See 
[https://builds.apache.org/job/nutch-trunk-maven/299/])
NUTCH-1381 Allow to override default subcollection field name (Revision 
1347744)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java


> Allow to override default subcollection field name
> --
>
> Key: NUTCH-1381
> URL: https://issues.apache.org/jira/browse/NUTCH-1381
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1381-1.6-1.patch
>
>
> The subcollection filter by default uses the subcollection field name but 
> since NUTCH-1266 allows to override it per subcollection. This issue should 
> introduce a configuration directive to override the default field name 
> globally.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-06-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291219#comment-13291219
 ] 

Hudson commented on NUTCH-1320:
---

Integrated in nutch-trunk-maven #299 (See 
[https://builds.apache.org/job/nutch-trunk-maven/299/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java


> IndexChecker and ParseChecker choke on IDN's
> 
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1351) DomainStatistics to aggregate by TLD

2012-06-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291220#comment-13291220
 ] 

Hudson commented on NUTCH-1351:
---

Integrated in nutch-trunk-maven #299 (See 
[https://builds.apache.org/job/nutch-trunk-maven/299/])
NUTCH-1351 DomainStatistics to aggregate by TLD (Revision 1347747)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
* /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> DomainStatistics to aggregate by TLD
> 
>
> Key: NUTCH-1351
> URL: https://issues.apache.org/jira/browse/NUTCH-1351
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1351-1.6-1.patch
>
>
> The DomainStatistics tool aggregates counts by host, domain or suffix but tld 
> is missing. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1346) Follow outlinks to ignore external

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291596#comment-13291596
 ] 

Hudson commented on NUTCH-1346:
---

Integrated in nutch-trunk-maven #301 (See 
[https://builds.apache.org/job/nutch-trunk-maven/301/])
NUTCH-1346 Follow outlinks to ignore external (Revision 1347897)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


> Follow outlinks to ignore external
> --
>
> Key: NUTCH-1346
> URL: https://issues.apache.org/jira/browse/NUTCH-1346
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1346-1.6-1.patch
>
>
> The follow outlinks feature already respects the db.ignore.external.links 
> setting. However, this means that outlinks of fetched pages that are external 
> are not saved in parse data. There should be a new setting to prevent the 
> outlink follower from going external but still storing external outlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1336) Optionally not index db_notmodified pages

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291628#comment-13291628
 ] 

Hudson commented on NUTCH-1336:
---

Integrated in nutch-trunk-maven #302 (See 
[https://builds.apache.org/job/nutch-trunk-maven/302/])
NUTCH-1336 Optionally not index db_notmodified pages (Revision 1347909)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java


> Optionally not index db_notmodified pages
> -
>
> Key: NUTCH-1336
> URL: https://issues.apache.org/jira/browse/NUTCH-1336
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1336-1.6-1.patch
>
>
> IndexerMapReduce already skips pages with fetch_notmodified as status. 
> However, despite the fetch status, we may still consider a page not modified 
> if status is db_notmodified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1381) Allow to override default subcollection field name

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291682#comment-13291682
 ] 

Hudson commented on NUTCH-1381:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1381 Allow to override default subcollection field name (Revision 
1347744)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347744
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java


> Allow to override default subcollection field name
> --
>
> Key: NUTCH-1381
> URL: https://issues.apache.org/jira/browse/NUTCH-1381
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1381-1.6-1.patch
>
>
> The subcollection filter by default uses the subcollection field name but 
> since NUTCH-1266 allows to override it per subcollection. This issue should 
> introduce a configuration directive to override the default field name 
> globally.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1336) Optionally not index db_notmodified pages

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291686#comment-13291686
 ] 

Hudson commented on NUTCH-1336:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1336 Optionally not index db_notmodified pages (Revision 1347909)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347909
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java


> Optionally not index db_notmodified pages
> -
>
> Key: NUTCH-1336
> URL: https://issues.apache.org/jira/browse/NUTCH-1336
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1336-1.6-1.patch
>
>
> IndexerMapReduce already skips pages with fetch_notmodified as status. 
> However, despite the fetch status, we may still consider a page not modified 
> if status is db_notmodified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1346) Follow outlinks to ignore external

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291684#comment-13291684
 ] 

Hudson commented on NUTCH-1346:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1346 Follow outlinks to ignore external (Revision 1347897)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347897
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java


> Follow outlinks to ignore external
> --
>
> Key: NUTCH-1346
> URL: https://issues.apache.org/jira/browse/NUTCH-1346
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1346-1.6-1.patch
>
>
> The follow outlinks feature already respects the db.ignore.external.links 
> setting. However, this means that outlinks of fetched pages that are external 
> are not saved in parse data. There should be a new setting to prevent the 
> outlink follower from going external but still storing external outlinks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291683#comment-13291683
 ] 

Hudson commented on NUTCH-1320:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java


> IndexChecker and ParseChecker choke on IDN's
> 
>
> Key: NUTCH-1320
> URL: https://issues.apache.org/jira/browse/NUTCH-1320
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1320-1.5-1.patch
>
>
> These handy debug tools do not handle IDN's and throw an NPE
> bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at 
> org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1351) DomainStatistics to aggregate by TLD

2012-06-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291685#comment-13291685
 ] 

Hudson commented on NUTCH-1351:
---

Integrated in Nutch-trunk #1865 (See 
[https://builds.apache.org/job/Nutch-trunk/1865/])
NUTCH-1351 DomainStatistics to aggregate by TLD (Revision 1347747)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347747
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java
* /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java


> DomainStatistics to aggregate by TLD
> 
>
> Key: NUTCH-1351
> URL: https://issues.apache.org/jira/browse/NUTCH-1351
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1351-1.6-1.patch
>
>
> The DomainStatistics tool aggregates counts by host, domain or suffix but tld 
> is missing. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293039#comment-13293039
 ] 

Hudson commented on NUTCH-1262:
---

Integrated in nutch-trunk-maven #306 (See 
[https://builds.apache.org/job/nutch-trunk-maven/306/])
NUTCH-1262 Map `duplicating` content-types to a single type (Revision 
1348785)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


> Map `duplicating` content-types to a single type
> 
>
> Key: NUTCH-1262
> URL: https://issues.apache.org/jira/browse/NUTCH-1262
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch
>
>
> Similar or duplicating content-types can end-up differently in an index. 
> With, for example, both application/xhtml+xml and text/html it is impossible 
> to use a single filter to select `web pages`.
> See also: 
> http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html
> Content-Type mapping is disabled by default and is enabled via 
> moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.
> {code}
> # target MIME-type  type1 [ type2 ...]
> # Map XHTML to HTML
> text/html   application/xhtml+xml
> # Map XHTML and HTML to something else
> Web pagetext/html   application/xhtml+xml
> # Map some office documents to each other
> Office document application/vnd.oasis.opendocument.text 
> application/x-tika-msoffice
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1385) More robust plug-in order properties in "nutch-site.xml"

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293040#comment-13293040
 ] 

Hudson commented on NUTCH-1385:
---

Integrated in nutch-trunk-maven #306 (See 
[https://builds.apache.org/job/nutch-trunk-maven/306/])
NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Revision 
1348764)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
* /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java


> More robust plug-in order properties in "nutch-site.xml"
> 
>
> Key: NUTCH-1385
> URL: https://issues.apache.org/jira/browse/NUTCH-1385
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: 1.5
>Reporter: Andy Xue
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: filter
> Fix For: 1.6
>
> Attachments: nutch-1385.txt
>
>
> When listing multiple scoring filters in certain properties (listed below) in 
> "nutch-site.xml", it is vital that no spaces/newlines/tabs are placed in 
> front of the value content.
> E.g.:
> This is fine:
> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter
> Either of these will generate an exception:
>  org.apache.nutch.scoring.opic.OPICScoringFilter myFilter
> 
> org.apache.nutch.scoring.opic.OPICScoringFilter
> myFilter
> 
> Affects these properties in "nutch-site.xml":
> * indexingfilter.order
> * urlnormalizer.order
> * urlfilter.order
> * htmlparsefilter.order
> * scoring.filter.order
> Solution: replaced {order.split("\\s+")} to {order.trim().split("\\s+")}. 
> Patch provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1384) Typo in ParseSegment's run-method

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293041#comment-13293041
 ] 

Hudson commented on NUTCH-1384:
---

Integrated in nutch-trunk-maven #306 (See 
[https://builds.apache.org/job/nutch-trunk-maven/306/])
NUTCH-1384 Typo in ParseSegments's run-method (Revision 1348766)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java


> Typo in ParseSegment's run-method
> -
>
> Key: NUTCH-1384
> URL: https://issues.apache.org/jira/browse/NUTCH-1384
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Agethle
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.6
>
>
> In the class org.apache.nutch.parse.ParseSegments there's a typo in the 
> run-method: instead of checking wheter "-noFilter" was specified on the 
> command-line, the code looks for "-noilter" (missing f, line 234).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293085#comment-13293085
 ] 

Hudson commented on NUTCH-1360:
---

Integrated in nutch-trunk-maven #307 (See 
[https://builds.apache.org/job/nutch-trunk-maven/307/])
commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1360-nutchgora-v2.patch, 
> NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1364) Add a counter in Generator for malformed urls

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293245#comment-13293245
 ] 

Hudson commented on NUTCH-1364:
---

Integrated in nutch-trunk-maven #308 (See 
[https://builds.apache.org/job/nutch-trunk-maven/308/])
commit to address NUTCH-1364 and update to CHANGES.txt (Revision 1349076)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java


> Add a counter in Generator for malformed urls
> -
>
> Key: NUTCH-1364
> URL: https://issues.apache.org/jira/browse/NUTCH-1364
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1364-nutchgora.patch, NUTCH-1364-trunk.patch
>
>
> This is a simple mechanism for counting the number of malformed urls we 
> encounter within the Generator. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1329#comment-1329
 ] 

Hudson commented on NUTCH-1360:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993)

 Result = SUCCESS
lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348993
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* 
/nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> Suport the storing of IP address connected to when web crawling
> ---
>
> Key: NUTCH-1360
> URL: https://issues.apache.org/jira/browse/NUTCH-1360
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1360-nutchgora-v2.patch, 
> NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch
>
>
> Simple issue enabling us to capture the specific IP address of the host which 
> we connect to to fetch a page.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1385) More robust plug-in order properties in "nutch-site.xml"

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293336#comment-13293336
 ] 

Hudson commented on NUTCH-1385:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Revision 
1348764)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348764
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java
* /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java
* /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java


> More robust plug-in order properties in "nutch-site.xml"
> 
>
> Key: NUTCH-1385
> URL: https://issues.apache.org/jira/browse/NUTCH-1385
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, parser
>Affects Versions: 1.5
>Reporter: Andy Xue
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: filter
> Fix For: 1.6
>
> Attachments: nutch-1385.txt
>
>
> When listing multiple scoring filters in certain properties (listed below) in 
> "nutch-site.xml", it is vital that no spaces/newlines/tabs are placed in 
> front of the value content.
> E.g.:
> This is fine:
> org.apache.nutch.scoring.opic.OPICScoringFilter myFilter
> Either of these will generate an exception:
>  org.apache.nutch.scoring.opic.OPICScoringFilter myFilter
> 
> org.apache.nutch.scoring.opic.OPICScoringFilter
> myFilter
> 
> Affects these properties in "nutch-site.xml":
> * indexingfilter.order
> * urlnormalizer.order
> * urlfilter.order
> * htmlparsefilter.order
> * scoring.filter.order
> Solution: replaced {order.split("\\s+")} to {order.trim().split("\\s+")}. 
> Patch provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293334#comment-13293334
 ] 

Hudson commented on NUTCH-1262:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
NUTCH-1262 Map `duplicating` content-types to a single type (Revision 
1348785)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348785
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/nutch-default.xml
* 
/nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java


> Map `duplicating` content-types to a single type
> 
>
> Key: NUTCH-1262
> URL: https://issues.apache.org/jira/browse/NUTCH-1262
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch
>
>
> Similar or duplicating content-types can end-up differently in an index. 
> With, for example, both application/xhtml+xml and text/html it is impossible 
> to use a single filter to select `web pages`.
> See also: 
> http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html
> Content-Type mapping is disabled by default and is enabled via 
> moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/.
> {code}
> # target MIME-type  type1 [ type2 ...]
> # Map XHTML to HTML
> text/html   application/xhtml+xml
> # Map XHTML and HTML to something else
> Web pagetext/html   application/xhtml+xml
> # Map some office documents to each other
> Office document application/vnd.oasis.opendocument.text 
> application/x-tika-msoffice
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1384) Typo in ParseSegment's run-method

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293337#comment-13293337
 ] 

Hudson commented on NUTCH-1384:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
NUTCH-1384 Typo in ParseSegments's run-method (Revision 1348766)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348766
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java


> Typo in ParseSegment's run-method
> -
>
> Key: NUTCH-1384
> URL: https://issues.apache.org/jira/browse/NUTCH-1384
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Matthias Agethle
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.6
>
>
> In the class org.apache.nutch.parse.ParseSegments there's a typo in the 
> run-method: instead of checking wheter "-noFilter" was specified on the 
> command-line, the code looks for "-noilter" (missing f, line 234).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1364) Add a counter in Generator for malformed urls

2012-06-11 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293335#comment-13293335
 ] 

Hudson commented on NUTCH-1364:
---

Integrated in Nutch-trunk #1868 (See 
[https://builds.apache.org/job/Nutch-trunk/1868/])
commit to address NUTCH-1364 and update to CHANGES.txt (Revision 1349076)

 Result = SUCCESS
lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349076
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java


> Add a counter in Generator for malformed urls
> -
>
> Key: NUTCH-1364
> URL: https://issues.apache.org/jira/browse/NUTCH-1364
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1364-nutchgora.patch, NUTCH-1364-trunk.patch
>
>
> This is a simple mechanism for counting the number of malformed urls we 
> encounter within the Generator. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1330) OutlinkDB to preserve back up

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293545#comment-13293545
 ] 

Hudson commented on NUTCH-1330:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1330 WebGraph OutlinkDB to preserve back up (Revision 1349240)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java


> OutlinkDB to preserve back up
> -
>
> Key: NUTCH-1330
> URL: https://issues.apache.org/jira/browse/NUTCH-1330
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch
>
>
> The webgraph's outlinkDB is the single source for all scoring jobs and GB's 
> that eventually come out. In case of disaster, that didn't happen yet, it 
> should be able to preserve back up just like other DB's. This means users 
> with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to 
> crawl/webgraphdb/outlinks/current/.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293543#comment-13293543
 ] 

Hudson commented on NUTCH-1024:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/adaptive-mimetypes.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java


> Dynamically set fetchInterval by MIME-type
> --
>
> Key: NUTCH-1024
> URL: https://issues.apache.org/jira/browse/NUTCH-1024
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: AdaptiveFetchSchedule.patch, 
> MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
> NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
> adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. 
> This is useful for conserving resources for files that are known to change 
> frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293546#comment-13293546
 ] 

Hudson commented on NUTCH-1352:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 
1349227)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
* 
/nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* 
/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
* 
/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* 
/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java


> Improve regex urlfilters/normalizers synchronization
> 
>
> Key: NUTCH-1352
> URL: https://issues.apache.org/jira/browse/NUTCH-1352
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch
>
>
> I noticed that during fetching a lot of the time the fetcherthreads are 
> blocking on a monitor because of outlink normalizing/filtering. The cause of 
> this: Some of the regex plugins use single lock synchronization.
> This patch improves throughput by removing synchronization locks and replace 
> them with threadlocals were needed.
> It has been extensively tested in production. I will commit this later today 
> when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293544#comment-13293544
 ] 

Hudson commented on NUTCH-1300:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java


> Indexer to normalize URL's
> --
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1386) Headings filter not to add empty values

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293547#comment-13293547
 ] 

Hudson commented on NUTCH-1386:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1386 Headings filter not to add empty values (Revision 1349233)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java


> Headings filter not to add empty values
> ---
>
> Key: NUTCH-1386
> URL: https://issues.apache.org/jira/browse/NUTCH-1386
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
>
> Headings filter can add empty values and doesn't trim the headings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293548#comment-13293548
 ] 

Hudson commented on NUTCH-1356:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1356 ParseUtil use ExecutorService instead of manually thread 
handling (Revision 1349230)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java


> ParseUtil use ExecutorService instead of manually thread handling.
> --
>
> Key: NUTCH-1356
> URL: https://issues.apache.org/jira/browse/NUTCH-1356
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
> NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for 
> every parse it sometimes happens that specific parsers are very expensive. 
> For example, parsers that have threadlocal fields will initialize them for 
> every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to 
> cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1319) HostNormalizer

2012-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293549#comment-13293549
 ] 

Hudson commented on NUTCH-1319:
---

Integrated in nutch-trunk-maven #310 (See 
[https://builds.apache.org/job/nutch-trunk-maven/310/])
NUTCH-1319 HostNormalizer plugin (Revision 1349236)

 Result = SUCCESS
markus : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/host-urlnormalizer.txt
* /nutch/trunk/src/plugin/urlnormalizer-host
* /nutch/trunk/src/plugin/urlnormalizer-host/build.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/data
* /nutch/trunk/src/plugin/urlnormalizer-host/data/hosts.txt
* /nutch/trunk/src/plugin/urlnormalizer-host/ivy.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/plugin.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/src
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host/TestHostURLNormalizer.java


> HostNormalizer
> --
>
> Key: NUTCH-1319
> URL: https://issues.apache.org/jira/browse/NUTCH-1319
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1319-1.5-1.patch
>
>
> Nutch would benefit from having a host normalizer. A host normalizer maps a 
> given host to the desired host. A basic example is to map www.apache.org to 
> apache.org. The Apache website is one of many on the internet that has a 
> duplicate website on the same domain just because it allows both www and 
> non-www to return HTTP 200 and proper content.
> It is also able to handle wildcards such as *.example.org to example.org if 
> there are multiple sub domains that actually point to the same website.
> Large internet crawls tend to get polluted very quickly due to these 
> problems. It also leads to skewed scores in the webgraph as different 
> websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1398) Upgrade to Hadoop 1.0.3

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295718#comment-13295718
 ] 

Hudson commented on NUTCH-1398:
---

Integrated in nutch-trunk-maven #314 (See 
[https://builds.apache.org/job/nutch-trunk-maven/314/])
NUTCH-1398 Upgrade to Hadoop 1.0.3 (Revision 1350630)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml


> Upgrade to Hadoop 1.0.3
> ---
>
> Key: NUTCH-1398
> URL: https://issues.apache.org/jira/browse/NUTCH-1398
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora, 1.5
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1396) Upgrade to Tika 1.1

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295766#comment-13295766
 ] 

Hudson commented on NUTCH-1396:
---

Integrated in Nutch-nutchgora #281 (See 
[https://builds.apache.org/job/Nutch-nutchgora/281/])
Upgrade to Tika 1.1 NUTCH-1396 (Revision 1350580)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/ivy/ivy.xml
* /nutch/branches/nutchgora/src/java/org/apache/nutch/util/MimeUtil.java
* 
/nutch/branches/nutchgora/src/plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
* 
/nutch/branches/nutchgora/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
* 
/nutch/branches/nutchgora/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* 
/nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestMSWordParser.java
* 
/nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestOOParser.java
* 
/nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestPdfParser.java
* 
/nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRSSParser.java
* 
/nutch/branches/nutchgora/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java


> Upgrade to Tika 1.1
> ---
>
> Key: NUTCH-1396
> URL: https://issues.apache.org/jira/browse/NUTCH-1396
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-1396.patch
>
>
> Copied code from trunk for MimeUtil and upgraded dependency to Tika 1.1 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295767#comment-13295767
 ] 

Hudson commented on NUTCH-1392:
---

Integrated in Nutch-nutchgora #281 (See 
[https://builds.apache.org/job/Nutch-nutchgora/281/])
-force and -resume arguments being ignored in ParserJob NUTCH-1392 
(Revision 1350213)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java


> -force and -resume arguments being ignored in ParserJob
> ---
>
> Key: NUTCH-1392
> URL: https://issues.apache.org/jira/browse/NUTCH-1392
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
> Fix For: nutchgora
>
> Attachments: NUTCH-1392.patch
>
>
> From the log below there is obviously something not right here as both 
> -resume and -force are passed to the CLI but blatantly ignored within the log 
> output.
> lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse
> Usage: ParserJob ( | -all) [-crawlId ] [-resume] [-force]
>  - symbolic batch ID created by Generator
> -crawlId  - the id to prefix the schemas to operate on, 
>   (default: storage.crawl.id)
> -all  - consider pages from all crawl jobs
> -resume   - resume a previous incomplete job
> -force- force re-parsing even if a page is already parsed
> lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse -all -resume 
> -force
> ParserJob: starting
> ParserJob: resuming:  false
> ParserJob: forced reparse:false
> ParserJob: parsing all
> Parsing http://www.trancearoundtheworld.com/
> ParserJob: success

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295798#comment-13295798
 ] 

Hudson commented on NUTCH-1300:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java
* /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java


> Indexer to normalize URL's
> --
>
> Key: NUTCH-1300
> URL: https://issues.apache.org/jira/browse/NUTCH-1300
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1300-1.5-1.patch
>
>
> Indexers should be able to normalize URL's. This is useful when a new 
> normalizer is applied to the entire CrawlDB. Without it, some or all records 
> in a segment cannot be indexed at all.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295803#comment-13295803
 ] 

Hudson commented on NUTCH-1356:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1356 ParseUtil use ExecutorService instead of manually thread 
handling (Revision 1349230)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349230
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java


> ParseUtil use ExecutorService instead of manually thread handling.
> --
>
> Key: NUTCH-1356
> URL: https://issues.apache.org/jira/browse/NUTCH-1356
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, 
> NUTCH-1356.patch
>
>
> Because ParseUtil manages it's own parser threads by creating a thread for 
> every parse it sometimes happens that specific parsers are very expensive. 
> For example, parsers that have threadlocal fields will initialize them for 
> every item to be parsed.
> By simply introducing a caching ExecutorService the ParseUtil will be able to 
> cache threads therefore parsing more efficient. See attached patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1319) HostNormalizer

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295804#comment-13295804
 ] 

Hudson commented on NUTCH-1319:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1319 HostNormalizer plugin (Revision 1349236)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349236
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/host-urlnormalizer.txt
* /nutch/trunk/src/plugin/urlnormalizer-host
* /nutch/trunk/src/plugin/urlnormalizer-host/build.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/data
* /nutch/trunk/src/plugin/urlnormalizer-host/data/hosts.txt
* /nutch/trunk/src/plugin/urlnormalizer-host/ivy.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/plugin.xml
* /nutch/trunk/src/plugin/urlnormalizer-host/src
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch
* /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host
* 
/nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host/TestHostURLNormalizer.java


> HostNormalizer
> --
>
> Key: NUTCH-1319
> URL: https://issues.apache.org/jira/browse/NUTCH-1319
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1319-1.5-1.patch
>
>
> Nutch would benefit from having a host normalizer. A host normalizer maps a 
> given host to the desired host. A basic example is to map www.apache.org to 
> apache.org. The Apache website is one of many on the internet that has a 
> duplicate website on the same domain just because it allows both www and 
> non-www to return HTTP 200 and proper content.
> It is also able to handle wildcards such as *.example.org to example.org if 
> there are multiple sub domains that actually point to the same website.
> Large internet crawls tend to get polluted very quickly due to these 
> problems. It also leads to skewed scores in the webgraph as different 
> websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1398) Upgrade to Hadoop 1.0.3

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295800#comment-13295800
 ] 

Hudson commented on NUTCH-1398:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1398 Upgrade to Hadoop 1.0.3 (Revision 1350630)

 Result = SUCCESS
jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1350630
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/ivy/ivy.xml


> Upgrade to Hadoop 1.0.3
> ---
>
> Key: NUTCH-1398
> URL: https://issues.apache.org/jira/browse/NUTCH-1398
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora, 1.5
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295801#comment-13295801
 ] 

Hudson commented on NUTCH-1352:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 
1349227)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349227
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
* 
/nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* 
/nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
* 
/nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java
* 
/nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java


> Improve regex urlfilters/normalizers synchronization
> 
>
> Key: NUTCH-1352
> URL: https://issues.apache.org/jira/browse/NUTCH-1352
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: nutchgora, 1.6
>
> Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch
>
>
> I noticed that during fetching a lot of the time the fetcherthreads are 
> blocking on a monitor because of outlink normalizing/filtering. The cause of 
> this: Some of the regex plugins use single lock synchronization.
> This patch improves throughput by removing synchronization locks and replace 
> them with threadlocals were needed.
> It has been extensively tested in production. I will commit this later today 
> when no objection.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295797#comment-13295797
 ] 

Hudson commented on NUTCH-1024:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349226
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/adaptive-mimetypes.txt
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java
* /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java


> Dynamically set fetchInterval by MIME-type
> --
>
> Key: NUTCH-1024
> URL: https://issues.apache.org/jira/browse/NUTCH-1024
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: AdaptiveFetchSchedule.patch, 
> MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, 
> NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, 
> adaptive-mimetypes.txt
>
>
> Add facility to configure default or fixed fetchInterval values by MIME-type. 
> This is useful for conserving resources for files that are known to change 
> frequently or never and everything in between.
> * simple key\tvalue\n configuration file
> * only set fetchInterval for new documents
> * keep max fetchInterval fixed by current config

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1386) Headings filter not to add empty values

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295802#comment-13295802
 ] 

Hudson commented on NUTCH-1386:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1386 Headings filter not to add empty values (Revision 1349233)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349233
Files : 
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java


> Headings filter not to add empty values
> ---
>
> Key: NUTCH-1386
> URL: https://issues.apache.org/jira/browse/NUTCH-1386
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
>
> Headings filter can add empty values and doesn't trim the headings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1330) OutlinkDB to preserve back up

2012-06-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295799#comment-13295799
 ] 

Hudson commented on NUTCH-1330:
---

Integrated in Nutch-trunk #1869 (See 
[https://builds.apache.org/job/Nutch-trunk/1869/])
NUTCH-1330 WebGraph OutlinkDB to preserve back up (Revision 1349240)

 Result = SUCCESS
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349240
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java


> OutlinkDB to preserve back up
> -
>
> Key: NUTCH-1330
> URL: https://issues.apache.org/jira/browse/NUTCH-1330
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch
>
>
> The webgraph's outlinkDB is the single source for all scoring jobs and GB's 
> that eventually come out. In case of disaster, that didn't happen yet, it 
> should be able to preserve back up just like other DB's. This means users 
> with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to 
> crawl/webgraphdb/outlinks/current/.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1404) Nutch script fails to find job file in deploy mode

2012-06-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396778#comment-13396778
 ] 

Hudson commented on NUTCH-1404:
---

Integrated in nutch-trunk-maven #319 (See 
[https://builds.apache.org/job/nutch-trunk-maven/319/])
NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, 
jnioche) (Revision 1351709)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/bin/nutch


> Nutch script fails to find job file in deploy mode
> --
>
> Key: NUTCH-1404
> URL: https://issues.apache.org/jira/browse/NUTCH-1404
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.5
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora, 1.5.1
>
>
> See 
> http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1399) TestProtocolHttpClient fails

2012-06-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397249#comment-13397249
 ] 

Hudson commented on NUTCH-1399:
---

Integrated in Nutch-nutchgora #286 (See 
[https://builds.apache.org/job/Nutch-nutchgora/286/])
TestProtocolHttpClient fails NUTCH-1399 (Revision 1351730)

 Result = SUCCESS
lewismc : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* 
/nutch/branches/nutchgora/src/plugin/protocol-httpclient/src/test/org/apache/nutch/protocol/httpclient/TestProtocolHttpClient.java


> TestProtocolHttpClient fails
> 
>
> Key: NUTCH-1399
> URL: https://issues.apache.org/jira/browse/NUTCH-1399
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora
>
> Attachments: NUTCH-1399.patch
>
>
> the test fails because the http servers are not closed between tests

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1401) Upgrade to Hadoop 1.0.3

2012-06-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397250#comment-13397250
 ] 

Hudson commented on NUTCH-1401:
---

Integrated in Nutch-nutchgora #286 (See 
[https://builds.apache.org/job/Nutch-nutchgora/286/])
NUTCH-1401 (Revision 1351705)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/ivy/ivy.xml


> Upgrade to Hadoop 1.0.3
> ---
>
> Key: NUTCH-1401
> URL: https://issues.apache.org/jira/browse/NUTCH-1401
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1404) Nutch script fails to find job file in deploy mode

2012-06-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397248#comment-13397248
 ] 

Hudson commented on NUTCH-1404:
---

Integrated in Nutch-nutchgora #286 (See 
[https://builds.apache.org/job/Nutch-nutchgora/286/])
NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, 
jnioche) (Revision 1351707)

 Result = SUCCESS
jnioche : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/bin/nutch


> Nutch script fails to find job file in deploy mode
> --
>
> Key: NUTCH-1404
> URL: https://issues.apache.org/jira/browse/NUTCH-1404
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.5
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: nutchgora, 1.5.1
>
>
> See 
> http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   3   4   5   6   7   8   9   10   >