[jira] Commented: (NUTCH-826) Mailing list is broken.
[ https://issues.apache.org/jira/browse/NUTCH-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870980#action_12870980 ] Hudson commented on NUTCH-826: -- Integrated in Nutch-trunk #1163 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1163/]) NUTCH-826 : update mailing list and version control pages on wesite after move to TLP > Mailing list is broken. > --- > > Key: NUTCH-826 > URL: https://issues.apache.org/jira/browse/NUTCH-826 > Project: Nutch > Issue Type: Bug >Reporter: John Sherwood >Assignee: Julien Nioche >Priority: Blocker > Fix For: 1.1 > > > All of the following addresses are failing: > nutch-u...@nutch.apache.org > nutch-user-subscr...@nutch.apache.org > nutch-user-subscr...@lucene.apache.org > For the last one, the mailer daemon said > "This mailing list has moved to user at nutch.apache.org." > Below is the message I tried to send: > Hi people, > I've been banging my head against this problem for two days now. > Simply, I want to add a field with the value of a given meta tag. > I've been trying the parse-xml plugin, but that seems that it doesn't > work with version 1.0. I've tried the code at > http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html > and it hasn't worked. I don't even know why. I don't even know if my > plugin is being used... or even looked for! Nutch seems to have a > infuriating "Fail silently" policy for plugins. I put a > System.exit(1) in my filters just to see if my code is even being > encountered. It has not in spite of my config telling it to. > Here's my config: > nutch-site.xml > ... > > plugin.includes > > protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|metadata > > ... > parse-plugins.xml > ... > > > > > > > > > > > > > > > > > > > ... > extension-id="com.example.website.nutch.parsing.MetaTagExtractorParseFilter" > /> > ... > I've also copied the plugin.xml and jar from my build/metadata to the > plugins root dir. > Nonetheless, Nutch runs and puts data in solr for me. Afaik, Nutch is > completely unaware of my plugin despite my config options. Is the > some other place I need to tell Nutch to use my plugin? Is there some > other approach to do this without having to write a plugin? This does > seem like a lot of work to simply get a meta tag into a field. Any > help would be appreciated. > Sincerely, > John Sherwood -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-278) Fetcher-status might need clarification: kbit/s instead of kb/s shown
[ https://issues.apache.org/jira/browse/NUTCH-278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882835#action_12882835 ] Hudson commented on NUTCH-278: -- Integrated in Nutch-trunk #1189 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/]) - fix for NUTCH-278 Fetcher-status might need clarification: kbit/s instead of kb/s shown > Fetcher-status might need clarification: kbit/s instead of kb/s shown > - > > Key: NUTCH-278 > URL: https://issues.apache.org/jira/browse/NUTCH-278 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 0.8 >Reporter: Stefan Neufeind >Assignee: Chris A. Mattmann >Priority: Trivial > Fix For: 1.2 > > Attachments: PATCH.NUTCH-278 > > > In Fetcher.java, method reportStatus() there is > + Math.round(float)bytes)*8)/1024)/elapsed)+" kb/s, "; > Is that a bit misleading, since the user reading the status might guess it's > "kilobytes" (kb) whereas "kbit/s" would be more clear in this case? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-832) Website menu has lots of broken links - in particular the API docs
[ https://issues.apache.org/jira/browse/NUTCH-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882837#action_12882837 ] Hudson commented on NUTCH-832: -- Integrated in Nutch-trunk #1189 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/]) - fix for NUTCH-832 Website menu has lots of broken links - in particular the API docs > Website menu has lots of broken links - in particular the API docs > -- > > Key: NUTCH-832 > URL: https://issues.apache.org/jira/browse/NUTCH-832 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.1 > Environment: Web >Reporter: Alex McLintock >Assignee: Chris A. Mattmann > Fix For: 1.2 > > Attachments: PATCH.NUTCH-832 > > > The website seems to have lots of broken links. eg the menu on the left > points to various URLs of the form > http://nutch.apache.org/apidocs-1.0/index.html > but these don't seem to exist on the server. > Also > http://nutch.apache.org/release/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-833) Website is still Lucene branded
[ https://issues.apache.org/jira/browse/NUTCH-833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882836#action_12882836 ] Hudson commented on NUTCH-833: -- Integrated in Nutch-trunk #1189 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1189/]) - changelog for NUTCH-833 Website is still Lucene branded - progress towards NUTCH-833 Website is still Lucene branded - progress towards NUTCH-833 Website is still Lucene branded - progress towards NUTCH-833 Website is still Lucene branded - progress towards NUTCH-833 Website is still Lucene branded - progress towards NUTCH-833 Website is still Lucene branded fix for NUTCH-833 Website is still Lucene branded. fix for NUTCH-833 Website is still Lucene branded. > Website is still Lucene branded > --- > > Key: NUTCH-833 > URL: https://issues.apache.org/jira/browse/NUTCH-833 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.1 > Environment: Web >Reporter: Alex McLintock >Assignee: Chris A. Mattmann >Priority: Trivial > Fix For: 1.2 > > > The Nutch website still has a lot of Lucene branding and links which are > confusing. eg the breadcrumbs > Apache > Lucene > Nutch > > appear at the top of most pages, along with the lucene logo and link to their > home page. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-834) Separate the Nutch web site from trunk
[ https://issues.apache.org/jira/browse/NUTCH-834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884177#action_12884177 ] Hudson commented on NUTCH-834: -- Integrated in Nutch-trunk #1194 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1194/]) (NUTCH-834) Separate the Nutch web site from trunk > Separate the Nutch web site from trunk > -- > > Key: NUTCH-834 > URL: https://issues.apache.org/jira/browse/NUTCH-834 > Project: Nutch > Issue Type: Task > Components: documentation >Affects Versions: 1.1 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > > As discussed on dev@, it would be useful to move the -PDFBox- Nutch web site > sources from .../asf/nutch/trunk to .../asf/nutch/site and to use the > svnpubsub mechanism for instant deployment of site changes. > The related issue for infra is > https://issues.apache.org/jira/browse/INFRA-2822 > See also https://issues.apache.org/jira/browse/PDFBOX-623 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-835) document deduplication (exact duplicates) failed using MD5Signature
[ https://issues.apache.org/jira/browse/NUTCH-835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884540#action_12884540 ] Hudson commented on NUTCH-835: -- Integrated in Nutch-trunk #1195 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1195/]) NUTCH-835 Document deduplication failed using MD5Signature (Sebastian Nagel via ab) > document deduplication (exact duplicates) failed using MD5Signature > --- > > Key: NUTCH-835 > URL: https://issues.apache.org/jira/browse/NUTCH-835 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.0.0, 1.1 > Environment: Linux, Ubuntu 10.04, Java 1.6.0_20 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.2, 2.0 > > > The MD5Signature class calculates different signatures for identical > documents. > The reason is that > byte[] data = content.getContent(); > ... StringBuilder().append(data) ... > uses java.lang.Object.toString() to get a string representation of the > (binary) content > which results in unique hash codes (e.g., [...@30dc9065) even for two byte > arrays > with identical content. > A solution would be to take the MD5 sum of the binary content as first part > of the > final signature calculation (the parsed content is the second part): > ... > .append(StringUtil.toHexString(MD5Hash.digest(data).getDigest())).append(parse.getText()); > Of course, there are many other solutions... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-837) Remove search servers and Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884996#action_12884996 ] Hudson commented on NUTCH-837: -- Integrated in Nutch-trunk #1197 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/]) > Remove search servers and Lucene dependencies > -- > > Key: NUTCH-837 > URL: https://issues.apache.org/jira/browse/NUTCH-837 > Project: Nutch > Issue Type: Task > Components: searcher, web gui >Affects Versions: 1.1 >Reporter: Julien Nioche >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-837.patch > > > One of the main aspects of 2.0 is the delegation of the indexing and search > to external resources like SOLR. We can simplify the code a lot by getting > rid of the : > * search servers > * indexing and analysis with Lucene > * search side functionalities : ontologies / clustering etc... > In the short term only SOLR / SOLRCloud will be supported but the plan would > be to add other systems as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-838) Add timing information to all Tool classes
[ https://issues.apache.org/jira/browse/NUTCH-838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884997#action_12884997 ] Hudson commented on NUTCH-838: -- Integrated in Nutch-trunk #1197 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/]) - fix for NUTCH-838 Add timing information to all Tool classes > Add timing information to all Tool classes > -- > > Key: NUTCH-838 > URL: https://issues.apache.org/jira/browse/NUTCH-838 > Project: Nutch > Issue Type: New Feature > Components: fetcher, generator, indexer, linkdb, parser >Affects Versions: 1.1 > Environment: JDK 1.6, Linux & Windows >Reporter: Jeroen van Vianen >Assignee: Chris A. Mattmann > Fix For: 1.2, 2.0 > > Attachments: timings.patch > > > Am happily trying to crawl a few hundred URLs incrementally. Performance is > degrading suddenly after the index reaches approximately 25000 URLs. > At first each inject (generate, fetch, parse, updatedb) * 3, invertlinks, > solrindex, solrdedup batch takes approximately half an hour with topN 500, > but elapsed times now increase to 00h45m, 01h15m, 01h30m with every batch. > As I'm uncertain which of the phases takes so much time I decided to add > start and finish times to al classes that implement Tool so I at least have a > feeling and can review them in a log file. > Am using pretty old hardware, but I am planning to recrawl these URLs on a > regular basis and if every iteration is going to take more and more time, > index updates will be few and far between :-( > I added timing information to *all* Tool classes for consistency whereas > there are only 10 or so Tools that are really interesting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-836) Remove deprecated parse plugins
[ https://issues.apache.org/jira/browse/NUTCH-836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884995#action_12884995 ] Hudson commented on NUTCH-836: -- Integrated in Nutch-trunk #1197 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1197/]) > Remove deprecated parse plugins > --- > > Key: NUTCH-836 > URL: https://issues.apache.org/jira/browse/NUTCH-836 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.1 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-836-2.patch > > > Some of the parser plugins in 1.1 are covered by the parse-tika plugin. These > plugins have been kept in 1.1 but should be removed from 2.0 where we'll rely > on parse-tika almost exclusively. Some existing plugins might be kept when > there is no equivalent in Tika (to be discussed). The following plugins are > removed : > * parse-html > * parse-msexcel > * parse-mspowerpoint > * parse-msword > * parse-pdf > * parse-oo > * parse-text > * lib-jakarta-poi > * lib-parsems > The patch does not (yet) remove : > * parse-ext > * parse-js > * parse-rss > * parse-swf > * parse-zip > * feed > Please review the patch and vote for its inclusion in the trunk. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] [Commented] (NUTCH-994) Fine tune Solr schema
[ https://issues.apache.org/jira/browse/NUTCH-994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056983#comment-13056983 ] Hudson commented on NUTCH-994: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Fine tune Solr schema > - > > Key: NUTCH-994 > URL: https://issues.apache.org/jira/browse/NUTCH-994 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.3, 2.0 > > Attachments: NUTCH-994-all.patch > > > The supplied schema is old and doesn't use more advanced fieldTypes such as > Trie based (since Solr 1.4) and perhaps other improvements. We need to fine > tune the schema. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset $charset
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056986#comment-13056986 ] Hudson commented on NUTCH-1012: --- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) NUTCH-1012 Cannot handle illegal charset markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140696 Files : * /nutch/trunk/src/java/org/apache/nutch/util/EncodingDetector.java * /nutch/trunk/CHANGES.txt > Cannot handle illegal charset $charset > -- > > Key: NUTCH-1012 > URL: https://issues.apache.org/jira/browse/NUTCH-1012 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1012-1.4.patch > > > Pages returning: > {code} > Content-Type: text/html; charset=$charset > {code} > cause: > {code} > Error parsing: http://host/: failed(2,200): > java.nio.charset.IllegalCharsetNameException: $charset > Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: > Followed by 3999 > ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12 > {code} > Stack trace: > {code} > 2011-06-24 01:14:23,442 WARN parse.html - > java.nio.charset.IllegalCharsetNameException: $charset > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.checkName(Charset.java:284) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.lookup2(Charset.java:458) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.lookup(Charset.java:437) > 2011-06-24 01:14:23,442 WARN parse.html - at > java.nio.charset.Charset.isSupported(Charset.java:479) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138) > 2011-06-24 01:14:23,442 WARN parse.html - at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > 2011-06-24 01:14:23,443 WARN parse.html - at > org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.util.concurrent.FutureTask.run(FutureTask.java:138) > 2011-06-24 01:14:23,443 WARN parse.html - at > java.lang.Thread.run(Thread.java:662) > 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: > http://host/: failed(2,200): java.nio.charset.Ill > egalCharsetNameException: $charset > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-986) Dedup fails due to date format (long)
[ https://issues.apache.org/jira/browse/NUTCH-986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056984#comment-13056984 ] Hudson commented on NUTCH-986: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Dedup fails due to date format (long) > - > > Key: NUTCH-986 > URL: https://issues.apache.org/jira/browse/NUTCH-986 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.3, 2.0 > > Attachments: NUTCH-986-1.3-1.patch, NUTCH-986-1.3-2.patch, > NUTCH-986-trunk-1.patch, NUTCH-986-trunk-2.patch > > > As already mentioned on the list, dedup also failes because of invalid date > formats. > Apr 19, 2011 10:34:50 AM > org.apache.solr.request.BinaryResponseWriter$Resolver > getDoc > WARNING: Error reading a field from document : > SolrDocument[{digest=7ff92a31c58e43a34fd45bc6d87cda03}] > java.lang.NumberFormatException: For input string: "2011-04-19T08:16:31.675Z" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > at java.lang.Long.parseLong(Long.java:419) > at java.lang.Long.valueOf(Long.java:525) > at org.apache.solr.schema.LongField.toObject(LongField.java:82) > > Strange enough, Solr seems to allow updates of long fields with a formatted > date. In Nutch 1.2 the tstamp field is actually a long but in 1.3 the field > is > a valid Solr date format. This exception is only triggered using the javabin > response writer so there's something weird in Solr too. > We need to either change the tstamp field back to a long or update the Solr > example schema and fix SolrDeleteDuplicates to use the formatted date instead > of the long. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-989) index-basic plugin doesn't use Solr date fieldType
[ https://issues.apache.org/jira/browse/NUTCH-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056982#comment-13056982 ] Hudson commented on NUTCH-989: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > index-basic plugin doesn't use Solr date fieldType > -- > > Key: NUTCH-989 > URL: https://issues.apache.org/jira/browse/NUTCH-989 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.3, 2.0 > > > The index-basic plugin actually sends over a properly formatted date with > millis but the schema isn't configured to use the dateField. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-995) Generate POM file using the Ivy makepom task
[ https://issues.apache.org/jira/browse/NUTCH-995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056985#comment-13056985 ] Hudson commented on NUTCH-995: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Generate POM file using the Ivy makepom task > - > > Key: NUTCH-995 > URL: https://issues.apache.org/jira/browse/NUTCH-995 > Project: Nutch > Issue Type: Improvement >Reporter: Julien Nioche >Assignee: Chris A. Mattmann > Fix For: 1.3 > > Attachments: NUTCH-955-1.3.patch, NUTCH-997.branch-1.3.v2.patch, > mvn-template-build.patch > > > We currently have a pom.xml file in the SVN repository and use it for > publishing our artefacts. The trouble with this is that we need to keep its > content in sync with our ivy file. Instead we could use the makepom task > (http://ant.apache.org/ivy/history/2.2.0/use/makepom.html) to generate the > pom.xml automatically. > The existing pom.xml for 1.3 needs fixing anyway as it declares dependencies > to GORA and has the wrong versions for some dependencies. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1010) ContentLength not trimmed
[ https://issues.apache.org/jira/browse/NUTCH-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056987#comment-13056987 ] Hudson commented on NUTCH-1010: --- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > ContentLength not trimmed > - > > Key: NUTCH-1010 > URL: https://issues.apache.org/jira/browse/NUTCH-1010 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3, 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1010-1.4.patch, NUTCH-1010-2.0.patch > > > Somewhere in some component the ContentLength field is not trimmed. This > allows a seemingly numeric field to be treated as a string by the indexer in > cases one or more leading or trailing whitespace is added. The result is a > hard to debug exception with no way to identify the bad document (amongst > thousands) or the bad field. > {code} > Jun 22, 2011 1:03:42 PM org.apache.solr.common.SolrException log > SEVERE: java.lang.NumberFormatException: For input string: "32717 " > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) > at java.lang.Long.parseLong(Long.java:419) > at java.lang.Long.parseLong(Long.java:468) > {code} > This can be quickly fixed in the index-more plugin by simply using the trim() > when adding the field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-967) Upgrade to Tika 0.9
[ https://issues.apache.org/jira/browse/NUTCH-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056992#comment-13056992 ] Hudson commented on NUTCH-967: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Upgrade to Tika 0.9 > --- > > Key: NUTCH-967 > URL: https://issues.apache.org/jira/browse/NUTCH-967 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Assignee: Julien Nioche > Fix For: 1.3, 2.0 > > Attachments: NUTCH-967-1.3-2.patch, NUTCH-967-1.3-3.patch, > NUTCH-967-1.3.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1006) meta equiv with single quotes not accepted
[ https://issues.apache.org/jira/browse/NUTCH-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056990#comment-13056990 ] Hudson commented on NUTCH-1006: --- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > meta equiv with single quotes not accepted > -- > > Key: NUTCH-1006 > URL: https://issues.apache.org/jira/browse/NUTCH-1006 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.2, 1.3, 1.4, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1006-104.patch, NUTCH-1006-2.0.patch > > > As posted by Alex F: > the regex metaPattern inside org.apache.nutch.parse.html.HtmlParser is not > suitable for sites using single quotes for > Example: > We experienced a couple of pages with that kind of quotes and Nutch-1.2 > was not able to handle it. > Is there any fallback or would it be good to use the following > regex: "]*http-equiv=(\"|')?content-type(\"|')?[^>]*)>" (single > or regular quotes are accepted)? > See this thread: > http://lucene.472066.n3.nabble.com/Character-encoding-on-Html-Pages-td3034850.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-999) Normalise String representation for Dates in IndexingFilters
[ https://issues.apache.org/jira/browse/NUTCH-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056988#comment-13056988 ] Hudson commented on NUTCH-999: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Normalise String representation for Dates in IndexingFilters > > > Key: NUTCH-999 > URL: https://issues.apache.org/jira/browse/NUTCH-999 > Project: Nutch > Issue Type: Task > Components: indexer >Affects Versions: 2.0 >Reporter: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-999.patch > > > NUTCH-997 has been applied to Nutch-1.3 so that various indexing filters > store Date objects as value for fields. However in trunk NutchDocuments can > have only String values which means that we will have to convert the Dates to > Strings in each indexing filter. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-983) Upgrade SolrJ
[ https://issues.apache.org/jira/browse/NUTCH-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056989#comment-13056989 ] Hudson commented on NUTCH-983: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Upgrade SolrJ > - > > Key: NUTCH-983 > URL: https://issues.apache.org/jira/browse/NUTCH-983 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Priority: Minor > Fix For: 1.3, 2.0 > > > Solr 3.1 has been released a while ago. The Javabin format between 1.4.1 and > 3.1 has been changed so our SolrJ 1.4.1 cannot send documents to 3.1. Since > Nutch 2.0 won't be released within a short period i believe it would be a > good idea to upgrade our SolrJ to 3.1. New Solr users are encouraged to use > Solr 3.1 or upgrade so i expect more users wanting to use 3.1 as well. Any > thoughts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-991) SolrDedup must issue a commit
[ https://issues.apache.org/jira/browse/NUTCH-991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056991#comment-13056991 ] Hudson commented on NUTCH-991: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > SolrDedup must issue a commit > - > > Key: NUTCH-991 > URL: https://issues.apache.org/jira/browse/NUTCH-991 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.3, 2.0 > > Attachments: NUTCH-991-1.3-1.patch, NUTCH-991-trunk-1.patch > > > Title says it all. SolrDedup job doesn't commit but it should. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-888) Remove parse-rss
[ https://issues.apache.org/jira/browse/NUTCH-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056993#comment-13056993 ] Hudson commented on NUTCH-888: -- Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/]) > Remove parse-rss > > > Key: NUTCH-888 > URL: https://issues.apache.org/jira/browse/NUTCH-888 > Project: Nutch > Issue Type: Task > Components: parser >Affects Versions: 1.3, 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.3, 2.0 > > > See https://issues.apache.org/jira/browse/NUTCH-887 > {quote} > CM : I wrote parse-rss back in 2005, and used commons-feedparser from Kevin > Burton and his crew. At the time it was well developed, and a little more > flexible and easier for me to pick up than Rome. Since then however, its > development has really become stagnant and it is no longer maintained. > In terms of real differences in terms of functionality, they are roughly > equivalent so there isn't much difference. > {quote} > Already +1 from Andrzej and Chris. Will remove it tomorrow if there aren't > any objections in the meantime -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1013) Migrate RegexURLNormalizer from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059677#comment-13059677 ] Hudson commented on NUTCH-1013: --- Integrated in Nutch-trunk #1536 (See [https://builds.apache.org/job/Nutch-trunk/1536/]) NUTCH-1013 Migrate RegexURLNormalizer from Apache ORO to java.util.regex markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1142687 Files : * /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java * /nutch/trunk/CHANGES.txt > Migrate RegexURLNormalizer from Apache ORO to java.util.regex > - > > Key: NUTCH-1013 > URL: https://issues.apache.org/jira/browse/NUTCH-1013 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1013-1.4.patch > > > Apache ORO uses old Perl 5-style regular expressions. Features such as the > powerful lookbehind are not available. The project has become retired as > well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061041#comment-13061041 ] Hudson commented on NUTCH-1011: --- Integrated in Nutch-trunk #1538 (See [https://builds.apache.org/job/Nutch-trunk/1538/]) NUTCH-1011 Remove duplicate slashes from URLs markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1143468 Files : * /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java * /nutch/trunk/conf/regex-normalize.xml.template * /nutch/trunk/CHANGES.txt > Normalize duplicate slashes in URL's > > > Key: NUTCH-1011 > URL: https://issues.apache.org/jira/browse/NUTCH-1011 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4, 2.0 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch > > > Many websites produce faulty URL's with multiple slashes e.g. > http://cocoon.apache.org///1.x/dynamic.html > This can be really nasty if the number of slashes varies, resulting in many > URL's actually pointing to the same page and generating new (unique) URL's to > the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1027) Degrade log level of `can't find rules for scope`
[ https://issues.apache.org/jira/browse/NUTCH-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063704#comment-13063704 ] Hudson commented on NUTCH-1027: --- Integrated in Nutch-trunk #1543 (See [https://builds.apache.org/job/Nutch-trunk/1543/]) NUTCH-1027 Degrade log level of 'can't find rules for scope' markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1145131 Files : * /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java * /nutch/trunk/CHANGES.txt > Degrade log level of `can't find rules for scope` > - > > Key: NUTCH-1027 > URL: https://issues.apache.org/jira/browse/NUTCH-1027 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1027-1.4-1.patch > > > The warning for regex.RegexURLNormalizer - can't find rules for scope > '', using default should be degraded to info because: > # new users are unaware of the normalizer > # the scoping of normalizer is not really documented (meaning wiki/tutorial, > not just javadoc) > # i don't consider it a warning (i.e. this no scope is not bad) > Thougts? -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1043) Add pattern for filtering .js in default url filters
[ https://issues.apache.org/jira/browse/NUTCH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067481#comment-13067481 ] Hudson commented on NUTCH-1043: --- Integrated in Nutch-trunk #1550 (See [https://builds.apache.org/job/Nutch-trunk/1550/]) NUTCH-1043 Add pattern for filtering .js in default url filters jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147798 Files : * /nutch/trunk/conf/automaton-urlfilter.txt.template * /nutch/trunk/conf/regex-urlfilter.txt.template * /nutch/trunk/CHANGES.txt > Add pattern for filtering .js in default url filters > > > Key: NUTCH-1043 > URL: https://issues.apache.org/jira/browse/NUTCH-1043 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4, 2.0 >Reporter: Julien Nioche >Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1043.patch > > > The Javascript parser is not used by default as it is extremely noisy, > however the default URL filters do not filter out URLs ending in .js and the > default parser (Tika) can't parse them. In a nutshell we are fetching URLS > that we know can't be parsed. > I suggest that we add a regex to the default URL filters. If people are > interested in fetching and parsing .js files they can activate the plugin in > their conf and remove the regex in the URL filters. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1055) upgrade package.html file in language identifier plugin
[ https://issues.apache.org/jira/browse/NUTCH-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067482#comment-13067482 ] Hudson commented on NUTCH-1055: --- Integrated in Nutch-trunk #1550 (See [https://builds.apache.org/job/Nutch-trunk/1550/]) commit and close of NUTCH-1055 and changes.txt, this commit does not affect functionality it is merely a hyperlink reference to the document used as the basis for the language identifier plugin lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1147817 Files : * /nutch/trunk/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/package.html * /nutch/trunk/CHANGES.txt > upgrade package.html file in language identifier plugin > --- > > Key: NUTCH-1055 > URL: https://issues.apache.org/jira/browse/NUTCH-1055 > Project: Nutch > Issue Type: Improvement > Components: documentation >Affects Versions: 1.3 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Labels: documentation > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1055-package-html.patch, > NUTCH-1055-trunk-package-html.patch, europarl.ps > > > package.html within the language identifier plugin contains the following... > however the link is broken. > > > Text document language identifier.Language profiles are based on > material from > href="http://www.isi.edu/~koehn/europarl/";>http://www.isi.edu/~koehn/europarl/. > > > The correct link should be > http://www.homepages.inf.ed.ac.uk/pkoehn/publications/europarl.ps > I will submit a patch. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1037) Deduplicate anchors before indexing
[ https://issues.apache.org/jira/browse/NUTCH-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068149#comment-13068149 ] Hudson commented on NUTCH-1037: --- Integrated in Nutch-trunk #1551 (See [https://builds.apache.org/job/Nutch-trunk/1551/]) NUTCH-1037 Option to deduplicate anchors prior to indexing markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1148308 Files : * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/index-anchor/src/java/org/apache/nutch/indexer/anchor/AnchorIndexingFilter.java > Deduplicate anchors before indexing > --- > > Key: NUTCH-1037 > URL: https://issues.apache.org/jira/browse/NUTCH-1037 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1037-1.4-1.patch, NUTCH-1037-1.4-2.patch, > NUTCH-1037-1.4-3.patch, NUTCH-1037-2.0-1.patch, NUTCH-1037-2.0-2.patch > > > Anchors are not deduplicated before indexing. This can result in a very high > number of similar and identical anchors being indexed. Before indexing, > anchors must be deduplicated at least on case. > Use anchorIndexingFilter.deduplicate=true to deduplicate anchors > case-insensitive. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika
[ https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070930#comment-13070930 ] Hudson commented on NUTCH-1045: --- Integrated in Nutch-trunk #1557 (See [https://builds.apache.org/job/Nutch-trunk/1557/]) NUTCH-1045 Mimeutil uses default Tika config unless overriden jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150670 Files : * /nutch/trunk/conf/tika-mimetypes.xml * /nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/CHANGES.txt > MimeUtil to rely on default config provided by Tika > --- > > Key: NUTCH-1045 > URL: https://issues.apache.org/jira/browse/NUTCH-1045 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4, 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche >Priority: Minor > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch > > > We currently provide conf/tika-mimetypes.xml despite the fact that it is > absolutely similar to the one found in tika-core.jar > Having a mechanism for specifying a custom tika-mimetypes.xml is good though > but if the user hasn't specified one or if it can't be loaded then we should > rely on Tika's default. This way we won't need to provide > conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one > whenever we upgrade Tika. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1065) New mvn.template
[ https://issues.apache.org/jira/browse/NUTCH-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079783#comment-13079783 ] Hudson commented on NUTCH-1065: --- Integrated in Nutch-trunk #1567 (See [https://builds.apache.org/job/Nutch-trunk/1567/]) commit to address NUTCH-1065 - New mvn.template and update of changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1153833 Files : * /nutch/trunk/conf/domain-urlfilter.txt * /nutch/trunk/ivy/mvn.template * /nutch/trunk/CHANGES.txt > New mvn.template > > > Key: NUTCH-1065 > URL: https://issues.apache.org/jira/browse/NUTCH-1065 > Project: Nutch > Issue Type: Task > Components: build >Affects Versions: 1.4, 2.0 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-1065-mvn-template-new.patch, > NUTCH-1065-trunk-mvn-template-new.patch > > > Removal of Otis from mvn.template file and addition of myself. This does not > alter functionality of any mvn or ivy tasks or files. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-920) Project Metadata
[ https://issues.apache.org/jira/browse/NUTCH-920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082894#comment-13082894 ] Hudson commented on NUTCH-920: -- Integrated in Nutch-trunk #1573 (See [https://builds.apache.org/job/Nutch-trunk/1573/]) commit to address NUTCH-920 adding trunk 2.0 DOAP file to svn. lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156101 Files : * /nutch/trunk/doap.rdf > Project Metadata > > > Key: NUTCH-920 > URL: https://issues.apache.org/jira/browse/NUTCH-920 > Project: Nutch > Issue Type: Sub-task >Reporter: Julien Nioche >Assignee: Lewis John McGibbney > Attachments: doap_Apache_Nutch.rdf, doap_Nutch_trunk.rdf > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083534#comment-13083534 ] Hudson commented on NUTCH-623: -- Integrated in Nutch-trunk-ant #5 (See [https://builds.apache.org/job/Nutch-trunk-ant/5/]) commit to revert changes by NUTCH-623 which broke tests. commit to address NUTCH-623 and changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156712 Files : * /nutch/trunk/src/plugin/languageidentifier/plugin.xml * /nutch/trunk/src/plugin/languageidentifier/build.xml * /nutch/trunk/CHANGES.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156692 Files : * /nutch/trunk/src/plugin/languageidentifier/plugin.xml * /nutch/trunk/src/plugin/languageidentifier/build.xml * /nutch/trunk/CHANGES.txt > Change plugin source directory "languageidentifier" to "language-identifier" > > > Key: NUTCH-623 > URL: https://issues.apache.org/jira/browse/NUTCH-623 > Project: Nutch > Issue Type: Improvement >Reporter: Ignacio J. Ortega >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-623-branch-1.4-20110810.patch, > NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-trunk-2.0-20110810.patch > > > When trying to develop and debug Nutch in eclipse, following the > instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant > run with languageidentifier is rename to language-identifier, when later > issue an svn update, you end having two languageidentifier src dirs, one with > the dash and another without it, it's an annoyance only, i know, but it > stucks me for 2 weeks..so if can be corrected... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083909#comment-13083909 ] Hudson commented on NUTCH-623: -- Integrated in Nutch-trunk #1575 (See [https://builds.apache.org/job/Nutch-trunk/1575/]) commit to revert changes by NUTCH-623 which broke tests. commit to address NUTCH-623 and changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156712 Files : * /nutch/trunk/src/plugin/languageidentifier/plugin.xml * /nutch/trunk/src/plugin/languageidentifier/build.xml * /nutch/trunk/CHANGES.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1156692 Files : * /nutch/trunk/src/plugin/languageidentifier/plugin.xml * /nutch/trunk/src/plugin/languageidentifier/build.xml * /nutch/trunk/CHANGES.txt > Change plugin source directory "languageidentifier" to "language-identifier" > > > Key: NUTCH-623 > URL: https://issues.apache.org/jira/browse/NUTCH-623 > Project: Nutch > Issue Type: Improvement >Reporter: Ignacio J. Ortega >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-623-branch-1.4-20110810.patch, > NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-trunk-2.0-20110810.patch > > > When trying to develop and debug Nutch in eclipse, following the > instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant > run with languageidentifier is rename to language-identifier, when later > issue an svn update, you end having two languageidentifier src dirs, one with > the dash and another without it, it's an annoyance only, i know, but it > stucks me for 2 weeks..so if can be corrected... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1099) Add HBase and Cassandra storage properties to nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102414#comment-13102414 ] Hudson commented on NUTCH-1099: --- Integrated in Nutch-trunk-ant #32 (See [https://builds.apache.org/job/Nutch-trunk-ant/32/]) commit to address NUTCH-1099 and update to changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1169475 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml > Add HBase and Cassandra storage properties to nutch-default.xml > --- > > Key: NUTCH-1099 > URL: https://issues.apache.org/jira/browse/NUTCH-1099 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.0 > Environment: Ubuntu 11.04 natty >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 2.0 > > Attachments: NUTCH-1099-20110829.patch > > > I was getting fed up manually adding the properties for HBase and Cassandra > to nutch-site.xml manually and thought if we could at least add them to > nutch-default.xml then comment them out then it would be a simply copy paste > job rather than manually fetching the content from somewhere else I had it > stored. N.B. this changes no functionality, just makes people lives a bit > easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1099) Add HBase and Cassandra storage properties to nutch-default.xml
[ https://issues.apache.org/jira/browse/NUTCH-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102413#comment-13102413 ] Hudson commented on NUTCH-1099: --- Integrated in Nutch-trunk #1601 (See [https://builds.apache.org/job/Nutch-trunk/1601/]) commit to address NUTCH-1099 and update to changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1169475 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml > Add HBase and Cassandra storage properties to nutch-default.xml > --- > > Key: NUTCH-1099 > URL: https://issues.apache.org/jira/browse/NUTCH-1099 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: 2.0 > Environment: Ubuntu 11.04 natty >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 2.0 > > Attachments: NUTCH-1099-20110829.patch > > > I was getting fed up manually adding the properties for HBase and Cassandra > to nutch-site.xml manually and thought if we could at least add them to > nutch-default.xml then comment them out then it would be a simply copy paste > job rather than manually fetching the content from somewhere else I had it > stored. N.B. this changes no functionality, just makes people lives a bit > easier. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1114) Attr file missing in domain filter
[ https://issues.apache.org/jira/browse/NUTCH-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108360#comment-13108360 ] Hudson commented on NUTCH-1114: --- Integrated in Nutch-branch-1.4 #11 (See [https://builds.apache.org/job/Nutch-branch-1.4/11/]) NUTCH-1114 Attr file missing in domain filter markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172637 Files : * /nutch/branches/branch-1.4/CHANGES.txt * /nutch/branches/branch-1.4/src/plugin/urlfilter-domain/plugin.xml > Attr file missing in domain filter > -- > > Key: NUTCH-1114 > URL: https://issues.apache.org/jira/browse/NUTCH-1114 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.4 > > > WARN org.apache.nutch.urlfilter.domain.DomainURLFilter: Attribute "file" is > not defined in plugin.xml for plugin urlfilter-domain > File element in plugin.xml is commented out but should not. Uncommenting > results in an INFO message. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1067) Configure minimum throughput for fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108359#comment-13108359 ] Hudson commented on NUTCH-1067: --- Integrated in Nutch-branch-1.4 #11 (See [https://builds.apache.org/job/Nutch-branch-1.4/11/]) NUTCH-1067 Nutch-default configuration directives missing markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1172585 Files : * /nutch/branches/branch-1.4/conf/nutch-default.xml > Configure minimum throughput for fetcher > > > Key: NUTCH-1067 > URL: https://issues.apache.org/jira/browse/NUTCH-1067 > Project: Nutch > Issue Type: New Feature > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.4 > > Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1067-1.4-1.patch, > NUTCH-1067-1.4-2.patch, NUTCH-1067-1.4-3.patch, NUTCH-1067-1.4-4.patch > > > Large fetches can contain a lot of url's for the same domain. These can be > very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If > all other url's have been fetched, these queue's can stall the entire > fetcher, 60 url's can then take 10 minutes or even more. This can usually be > dealt with using the time bomb but the time bomb value is hard to determine. > This patch adds a fetcher.throughput.threshold setting meaning the minimum > number of pages per second before the fetcher gives up. It doesn't use the > global number of pages / running time but records the actual pages processed > in the previous second. This value is compared with the configured threshold. > Besides the check the fetcher's status is also updated with the actual number > of pages per second and bytes per second. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1115) Option to disable fixing of embedded params in DomContentUtils
[ https://issues.apache.org/jira/browse/NUTCH-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113148#comment-13113148 ] Hudson commented on NUTCH-1115: --- Integrated in Nutch-branch-1.4 #14 (See [https://builds.apache.org/job/Nutch-branch-1.4/14/]) Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 commit NUTCH-1115 Option to disable fixing of URL embedded parameters in DomContentUtils markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174222 Files : * /nutch/branches/branch-1.4/CHANGES.txt markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174147 Files : * /nutch/branches/branch-1.4/conf/nutch-default.xml * /nutch/branches/branch-1.4/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java * /nutch/branches/branch-1.4/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java > Option to disable fixing of embedded params in DomContentUtils > -- > > Key: NUTCH-1115 > URL: https://issues.apache.org/jira/browse/NUTCH-1115 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4 > > Attachments: NUTCH-1115-1.4-1.patch, NUTCH-1115-1.4-2.patch > > > Add option to disable fixing of embedded params: > http://lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html > When enabled, millions of crap url's are output as outlink. This results in > many 404 in the DB and many very long URL's that actually lead to the same > page. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1078) Upgrade all instances of commons logging to slf4j (with log4j backend)
[ https://issues.apache.org/jira/browse/NUTCH-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113147#comment-13113147 ] Hudson commented on NUTCH-1078: --- Integrated in Nutch-branch-1.4 #14 (See [https://builds.apache.org/job/Nutch-branch-1.4/14/]) Recommitted CHANGELOG entry for NUTCH-1115. Was overwritten by NUTCH-1078 commit commit to address NUTCH-1078 and update of changes.txt markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174222 Files : * /nutch/branches/branch-1.4/CHANGES.txt lewismc : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174191 Files : * /nutch/branches/branch-1.4/CHANGES.txt * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Crawl.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDb.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbFilter.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbMerger.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbReader.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/CrawlDbReducer.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/FetchScheduleFactory.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Injector.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDb.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbFilter.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbMerger.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/LinkDbReader.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/MapWritable.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/SignatureFactory.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/URLPartitioner.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/Fetcher.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/OldFetcher.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexingFilters.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrClean.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrUtils.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/indexer/solr/SolrWriter.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/net/URLNormalizers.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/OutlinkExtractor.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseOutputFormat.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParsePluginsReader.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseResult.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseSegment.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParseUtil.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/parse/ParserFactory.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginDescriptor.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginManifestParser.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/plugin/PluginRepository.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/protocol/ProtocolFactory.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/LinkRank.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/Loops.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/NodeDumper.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/ScoreUpdater.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentMergeFilters.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentMerger.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/segment/SegmentReader.java * /nutch/branches/branch-1.4/src/java/org/apache/nutch/tools/Cra
[jira] [Commented] (NUTCH-1074) topN is ignored with maxNumSegments
[ https://issues.apache.org/jira/browse/NUTCH-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113890#comment-13113890 ] Hudson commented on NUTCH-1074: --- Integrated in Nutch-branch-1.4 #15 (See [https://builds.apache.org/job/Nutch-branch-1.4/15/]) NUTCH-1074 topN is ignored with maxNumSegments and generate.max.count markus : http://svn.apache.org/viewvc/nutch/branches/branch-1.4/viewvc/?view=rev&root=&revision=1174689 Files : * /nutch/branches/branch-1.4/CHANGES.txt * /nutch/branches/branch-1.4/src/java/org/apache/nutch/crawl/Generator.java > topN is ignored with maxNumSegments > --- > > Key: NUTCH-1074 > URL: https://issues.apache.org/jira/browse/NUTCH-1074 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.4 > > Attachments: generator_fix.patch > > > When generating segments with topN and maxNumSegments, topN is not respected. > It looks like the first generated segment contains topN * maxNumSegments of > URLs's, at least the number of map input records roughly matches. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114129#comment-13114129 ] Hudson commented on NUTCH-623: -- Integrated in Nutch-trunk #1611 (See [https://builds.apache.org/job/Nutch-trunk/1611/]) commit to address NUTCH-623 and update to changes.txt lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175188 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/language-identifier * /nutch/trunk/src/plugin/language-identifier/build.xml * /nutch/trunk/src/plugin/language-identifier/ivy.xml * /nutch/trunk/src/plugin/language-identifier/plugin.xml * /nutch/trunk/src/plugin/language-identifier/src * /nutch/trunk/src/plugin/languageidentifier > Change plugin source directory "languageidentifier" to "language-identifier" > > > Key: NUTCH-623 > URL: https://issues.apache.org/jira/browse/NUTCH-623 > Project: Nutch > Issue Type: Improvement >Reporter: Ignacio J. Ortega >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-623-branch-1.4-20110810.patch, > NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-branch-1.4-20110910-v2.patch, > NUTCH-623-trunk-1.4-20110924.patch, NUTCH-623-trunk-2.0-20110810.patch > > > When trying to develop and debug Nutch in eclipse, following the > instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant > run with languageidentifier is rename to language-identifier, when later > issue an svn update, you end having two languageidentifier src dirs, one with > the dash and another without it, it's an annoyance only, i know, but it > stucks me for 2 weeks..so if can be corrected... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-623) Change plugin source directory "languageidentifier" to "language-identifier"
[ https://issues.apache.org/jira/browse/NUTCH-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114785#comment-13114785 ] Hudson commented on NUTCH-623: -- Integrated in Nutch-trunk #1613 (See [https://builds.apache.org/job/Nutch-trunk/1613/]) NUTCH-623 fix source directory siren : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1175739 Files : * /nutch/trunk/build.xml > Change plugin source directory "languageidentifier" to "language-identifier" > > > Key: NUTCH-623 > URL: https://issues.apache.org/jira/browse/NUTCH-623 > Project: Nutch > Issue Type: Improvement >Reporter: Ignacio J. Ortega >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.4, 2.0 > > Attachments: NUTCH-623-branch-1.4-20110810.patch, > NUTCH-623-branch-1.4-20110810.patch, NUTCH-623-branch-1.4-20110910-v2.patch, > NUTCH-623-trunk-1.4-20110924.patch, NUTCH-623-trunk-2.0-20110810.patch > > > When trying to develop and debug Nutch in eclipse, following the > instructions at http://wiki.apache.org/nutch/RunNutchInEclipse0%2e9, you cant > run with languageidentifier is rename to language-identifier, when later > issue an svn update, you end having two languageidentifier src dirs, one with > the dash and another without it, it's an annoyance only, i know, but it > stucks me for 2 weeks..so if can be corrected... -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1189) add commented out default settings to gora.properties files
[ https://issues.apache.org/jira/browse/NUTCH-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263362#comment-13263362 ] Hudson commented on NUTCH-1189: --- Integrated in Nutch-nutchgora #240 (See [https://builds.apache.org/job/Nutch-nutchgora/240/]) NUTCH-1189 (Update gora.properties for HBase to reflect Gora 0.2) (Revision 1330744) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/conf/gora.properties > add commented out default settings to gora.properties files > > > Key: NUTCH-1189 > URL: https://issues.apache.org/jira/browse/NUTCH-1189 > Project: Nutch > Issue Type: Sub-task > Components: storage >Affects Versions: nutchgora >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: nutchgora > > Attachments: NUTCH-1189-v2.patch, NUTCH-1189-v3.patch, > NUTCH-1189-v4.patch, NUTCH-1189.patch > > > This issues should have been dealt with as part of its parent issue, however > I think as it is a fairly lareg task in itself, it needs to be done > independently. The gora.properties file should, amongst other settings, and > beside the extreme basic defaults for sqlstore, include defaults for opening > HBase, Cassandra, etc servers on their default ports etc. Leaving this down > to individual interpretation puts a huge owness of the user, hence > constructing a barrier to entry for getting the configuration settings up and > running. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263363#comment-13263363 ] Hudson commented on NUTCH-882: -- Integrated in Nutch-nutchgora #240 (See [https://builds.apache.org/job/Nutch-nutchgora/240/]) NUTCH-882 Design a Host table in GORA (Revision 1330728) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/build.xml * /nutch/branches/nutchgora/conf/gora-hbase-mapping.xml * /nutch/branches/nutchgora/default.properties * /nutch/branches/nutchgora/ivy/ivy.xml * /nutch/branches/nutchgora/src/gora/host.avsc * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/host * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbReader.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDbUpdateReducer.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostInjectorJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Host.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/WebTableCreator.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/Histogram.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: nutchgora >Reporter: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-882-v1.patch, NUTCH-882-v3.txt, NUTCH-882-v3.txt, > hostdb.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-902) Add all necessary files and configuration so that nutch can be used with different backends out-of-the-box
[ https://issues.apache.org/jira/browse/NUTCH-902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263365#comment-13263365 ] Hudson commented on NUTCH-902: -- Integrated in Nutch-nutchgora #240 (See [https://builds.apache.org/job/Nutch-nutchgora/240/]) NUTCH-902 (merge different "storage.data.store.class" entries into one) (Revision 1330807) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/conf/nutch-default.xml > Add all necessary files and configuration so that nutch can be used with > different backends out-of-the-box > -- > > Key: NUTCH-902 > URL: https://issues.apache.org/jira/browse/NUTCH-902 > Project: Nutch > Issue Type: New Feature > Components: documentation, storage >Affects Versions: nutchbase >Reporter: Enis Soztutar >Assignee: Lewis John McGibbney > Fix For: nutchgora > > Attachments: NUTCH-902-v2.patch, NUTCH-902-v3.patch, NUTCH-902.patch > > > As per the discussion in the mailing list and > http://wiki.apache.org/nutch/GORA_HBase, it will be good to include all the > necessary files and configuration. I propose that we maintain configuration > for at least SQL, HBase and Cassandra. > The following changes are needed: > conf/gora-sql-mapping.xml > conf/gora-hbase-mapping.xml > conf/gora-cassandra-mapping.xml > comments on nutch-default and ivy.xml > Shall we also include jars from gora-hbase, gora-cassandra and their > dependencies ? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1340) Increase scalability by only removing markers when they actually exist for DbUpdaterReducer
[ https://issues.apache.org/jira/browse/NUTCH-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13263364#comment-13263364 ] Hudson commented on NUTCH-1340: --- Integrated in Nutch-nutchgora #240 (See [https://builds.apache.org/job/Nutch-nutchgora/240/]) NUTCH-1340 Increase scalability by only removing markers when they actually exist for DbUpdaterReducer (Revision 1330722) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/DbUpdateReducer.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/Mark.java > Increase scalability by only removing markers when they actually exist for > DbUpdaterReducer > --- > > Key: NUTCH-1340 > URL: https://issues.apache.org/jira/browse/NUTCH-1340 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora > > Attachments: NUTCH-1340-v1.txt, NUTCH-1340-v2.txt > > > After applying GORA-120 (this already is a huge performance boost by itself) > one of the major bottlenecks of the DbUpdaterReducer is the deletion of the > markers. The update reducer simply sets every row to delete its markers. A > lot of rows do not actually have the markers but the deletes are fired away > in any case. Because the markers are already always on the input, a simple > check to see if they exist greaty improves performance. > In particular it is very expensive in HBase, because every single Delete > inmediately triggers a connection to the regionservers. (They ignore the > "autoflush=false" directive). Although deletes can be done in batch, this is > currently not supported by Gora. For one it is very difficult to implement in > the current HBaseStore with regard to multithreading, and secondly I noticed > performance did not increase significantly. > By performance debugging on a real life cluster this currently seems to be > the biggest bottleneck of the DbUpdaterReducer. (Remember only after applying > GORA-120) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1205) Upgrade gora modules to 0.2 in ivy/ivy.xml
[ https://issues.apache.org/jira/browse/NUTCH-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268112#comment-13268112 ] Hudson commented on NUTCH-1205: --- Integrated in Nutch-nutchgora #244 (See [https://builds.apache.org/job/Nutch-nutchgora/244/]) NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (addition) (Revision 1333551) NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (addition) (Revision 1333547) NUTCH-1205 Upgrade gora modules to 0.2 in ivy/ivy.xml (Revision 1333435) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/ivy/ivy.xml ferdy : Files : * /nutch/branches/nutchgora/ivy/ivy.xml ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/build.xml * /nutch/branches/nutchgora/conf/gora.properties * /nutch/branches/nutchgora/ivy/ivy.xml * /nutch/branches/nutchgora/src/java/org/apache/nutch/storage/StorageUtils.java * /nutch/branches/nutchgora/src/test/gora.properties * /nutch/branches/nutchgora/src/test/org/apache/nutch/storage/TestGoraStorage.java * /nutch/branches/nutchgora/src/test/org/apache/nutch/util/AbstractNutchTest.java * /nutch/branches/nutchgora/src/testprocess * /nutch/branches/nutchgora/src/testprocess/gora.properties > Upgrade gora modules to 0.2 in ivy/ivy.xml > -- > > Key: NUTCH-1205 > URL: https://issues.apache.org/jira/browse/NUTCH-1205 > Project: Nutch > Issue Type: Improvement > Components: storage >Affects Versions: nutchgora >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Blocker > Fix For: nutchgora > > Attachments: NUTCH-1205-v10.patch, NUTCH-1205-v11-addition.patch, > NUTCH-1205-v11.patch, NUTCH-1205-v2.patch, NUTCH-1205-v3.patch, > NUTCH-1205-v4.patch, NUTCH-1205-v5.patch, NUTCH-1205-v5.patch, > NUTCH-1205-v6.patch, NUTCH-1205.patch > > > Although gora trunk is unstable, work is ongoing to get this fixed. For the > time being, I think Nutchgora should use gora trunk as this will identify > more vulnerabilities. I'll get the trivial patch submitted shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1350) remove unused dependancy because of access restriction
[ https://issues.apache.org/jira/browse/NUTCH-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13268894#comment-13268894 ] Hudson commented on NUTCH-1350: --- Integrated in Nutch-nutchgora #245 (See [https://builds.apache.org/job/Nutch-nutchgora/245/]) NUTCH-1350 remove unused dependancy because of access restriction (Revision 1333803) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/test/org/apache/nutch/util/CrawlTestUtil.java > remove unused dependancy because of access restriction > -- > > Key: NUTCH-1350 > URL: https://issues.apache.org/jira/browse/NUTCH-1350 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema >Priority: Trivial > Fix For: nutchgora > > > CrawlTestUtil has an unused dependancy com.sun.net.httpserver.HttpContext > that sometimes causes an "access restriction" error when used with certain > jdks. I figured since it isn't used anyway I can just remove it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271069#comment-13271069 ] Hudson commented on NUTCH-1352: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1335066) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java * /nutch/branches/nutchgora/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * /nutch/branches/nutchgora/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java * /nutch/branches/nutchgora/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * /nutch/branches/nutchgora/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java > Improve regex urlfilters/normalizers synchronization > > > Key: NUTCH-1352 > URL: https://issues.apache.org/jira/browse/NUTCH-1352 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch > > > I noticed that during fetching a lot of the time the fetcherthreads are > blocking on a monitor because of outlink normalizing/filtering. The cause of > this: Some of the regex plugins use single lock synchronization. > This patch improves throughput by removing synchronization locks and replace > them with threadlocals were needed. > It has been extensively tested in production. I will commit this later today > when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1349) Make batchId explcit within debug logging and improve CLI
[ https://issues.apache.org/jira/browse/NUTCH-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271068#comment-13271068 ] Hudson commented on NUTCH-1349: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) Commit to address NUTCH-1349 and update to CHANGES.txt (Revision 1335436) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/conf/log4j.properties * /nutch/branches/nutchgora/src/bin/nutch * /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/WebTableReader.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java > Make batchId explcit within debug logging and improve CLI > - > > Key: NUTCH-1349 > URL: https://issues.apache.org/jira/browse/NUTCH-1349 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: nutchgora >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1349-v2.patch, NUTCH-1349-v2.patch, > NUTCH-1349.patch > > > I find this a pain when trying to locate the batchId of some urls which are > skipped when going to the Solr index. My DEBUG log output gives me > {code} > 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - > Skipping http://www.glasgowwheelers.com/; different batch id > 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - > Skipping http://www.heraldscotland.com/; different batch id > {code} > when I would actually like > {code} > 2012-05-03 20:44:55,268 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - > Skipping http://www.glasgowwheelers.com/; different batch id (ACTUAL BATCH ID) > 2012-05-03 20:44:55,259 DEBUG indexer.IndexerJob (IndexerJob.java:map(83)) - > Skipping http://www.heraldscotland.com/; different batch id (ACTUAL BATCH ID) > {code} > patch coming up soon -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1353) nutchgora DomainStatistics support crawlId, counter bug and reformatting
[ https://issues.apache.org/jira/browse/NUTCH-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271070#comment-13271070 ] Hudson commented on NUTCH-1353: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) NUTCH-1353 nutchgora DomainStatistics support crawlId, counter bug and reformatting (Revision 1334936) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/domain/DomainStatistics.java > nutchgora DomainStatistics support crawlId, counter bug and reformatting > > > Key: NUTCH-1353 > URL: https://issues.apache.org/jira/browse/NUTCH-1353 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1353.patch > > > This patch fixes three issues about nutchgora DomainStatistics: > -crawlId support (note I closed NUTCH-1290 because I thought DomainStatistics > was already fixed. This was not the case.) > -A counter bug (NOT_FETCHED should be increased instead of FETCHED) > -reformatting (convert tabs to spaces and clear unused imports) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1354) nutchgora support fetcher.queue.depth.multiplier property
[ https://issues.apache.org/jira/browse/NUTCH-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271071#comment-13271071 ] Hudson commented on NUTCH-1354: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) NUTCH-1354 nutchgora support fetcher.queue.depth.multiplier property (Revision 1334945) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/conf/nutch-default.xml * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java > nutchgora support fetcher.queue.depth.multiplier property > - > > Key: NUTCH-1354 > URL: https://issues.apache.org/jira/browse/NUTCH-1354 > Project: Nutch > Issue Type: New Feature >Reporter: Ferdy Galema >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1354.patch > > > Like trunk, nutchgora should support fetcher.queue.depth.multiplier property > too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1355) nutchgora Configure minimum throughput for fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271072#comment-13271072 ] Hudson commented on NUTCH-1355: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) NUTCH-1355 nutchgora Configure minimum throughput for fetcher (Revision 1335063) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/conf/nutch-default.xml * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherReducer.java > nutchgora Configure minimum throughput for fetcher > -- > > Key: NUTCH-1355 > URL: https://issues.apache.org/jira/browse/NUTCH-1355 > Project: Nutch > Issue Type: New Feature >Reporter: Ferdy Galema > Fix For: nutchgora > > Attachments: NUTCH-1355.patch > > > Like trunk, nutchgora should also have a feature to configure the fetcher > with a minimum throughput. (See NUTCH-1067 for the work done by Markus). > It's implemented in almost the same way, except that the number of times > throughput falls below threshold is measured sequentially. (The counter is > reset when throughput is healthy again; this should work even better against > temporary dips). > Defaults to disabled. Will commit later today if there is no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271073#comment-13271073 ] Hudson commented on NUTCH-1356: --- Integrated in Nutch-nutchgora #248 (See [https://builds.apache.org/job/Nutch-nutchgora/248/]) NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling. (Revision 1335065) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParseUtil.java > ParseUtil use ExecutorService instead of manually thread handling. > -- > > Key: NUTCH-1356 > URL: https://issues.apache.org/jira/browse/NUTCH-1356 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, > NUTCH-1356.patch > > > Because ParseUtil manages it's own parser threads by creating a thread for > every parse it sometimes happens that specific parsers are very expensive. > For example, parsers that have threadlocal fields will initialize them for > every item to be parsed. > By simply introducing a caching ExecutorService the ParseUtil will be able to > cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1358) Do not accept bogus arguments
[ https://issues.apache.org/jira/browse/NUTCH-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273026#comment-13273026 ] Hudson commented on NUTCH-1358: --- Integrated in Nutch-nutchgora #249 (See [https://builds.apache.org/job/Nutch-nutchgora/249/]) NUTCH-1358 Do not accept bogus arguments (Revision 1336204) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/DbUpdaterJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/crawl/InjectorJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java > Do not accept bogus arguments > - > > Key: NUTCH-1358 > URL: https://issues.apache.org/jira/browse/NUTCH-1358 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema >Priority: Minor > Fix For: nutchgora > > Attachments: NUTCH-1358.patch > > > Some of the tools do not explicitely check every passed argument for > validity. This can mask very frustrating issues because one passes wrong > arguments and the tool does not fail fast. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273027#comment-13273027 ] Hudson commented on NUTCH-1026: --- Integrated in Nutch-nutchgora #249 (See [https://builds.apache.org/job/Nutch-nutchgora/249/]) NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/conf/log4j.properties * /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java > Strip UTF-8 non-character codepoints > > > Key: NUTCH-1026 > URL: https://issues.apache.org/jira/browse/NUTCH-1026 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: nutchgora >Reporter: Markus Jelsma > Fix For: nutchgora > > > During a very large crawl i found a few documents producing non-character > codepoints. When indexing to Solr this will yield the following exception: > {code} > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 character 0x at char > #1142033, byte #1155068) > at > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > at > com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > {code} > Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the > content field to a method to strip away non-characters. I'm not too sure > about this implementation but the tests i've done locally with a huge dataset > now passes correctly. Here's a list of codepoints to strip away: > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:] > Please comment! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1362) Fix error handling of urls with empty fields
[ https://issues.apache.org/jira/browse/NUTCH-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273829#comment-13273829 ] Hudson commented on NUTCH-1362: --- Integrated in Nutch-nutchgora #250 (See [https://builds.apache.org/job/Nutch-nutchgora/250/]) NUTCH-1362 Fix error handling of urls with empty fields (Revision 1337091) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/TableUtil.java > Fix error handling of urls with empty fields > - > > Key: NUTCH-1362 > URL: https://issues.apache.org/jira/browse/NUTCH-1362 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora >Reporter: Lewis John McGibbney > Fix For: nutchgora > > Attachments: NUTCH-1362.patch > > > Within o.a.n.util.TableUtil.reverseAppendSplits() a simple if (split.length > > 0) block enables us to address this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1366) speed up indexing by eliminating the indexreducer
Hudson commented on NUTCH-1366 speed up indexing by eliminating the indexreducer Integrated in Nutch-nutchgora #253 (See https://builds.apache.org/job/Nutch-nutchgora/253/) NUTCH-1366 speed up indexing by eliminating the indexreducer (Revision 1338217) Result = SUCCESS ferdy : Files : /nutch/branches/nutchgora/CHANGES.txt /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexUtil.java /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerJob.java /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/IndexerReducer.java This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1378) HostDb NullPointerException
[ https://issues.apache.org/jira/browse/NUTCH-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282201#comment-13282201 ] Hudson commented on NUTCH-1378: --- Integrated in Nutch-nutchgora #262 (See [https://builds.apache.org/job/Nutch-nutchgora/262/]) NUTCH-1378 HostDb NullPointerException (Revision 1341879) Result = SUCCESS ferdy : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/host/HostDb.java > HostDb NullPointerException > --- > > Key: NUTCH-1378 > URL: https://issues.apache.org/jira/browse/NUTCH-1378 > Project: Nutch > Issue Type: Bug >Reporter: Ferdy Galema > Fix For: nutchgora > > Attachments: NUTCH-1378.patch > > > This is a no-brainer to fix a NPE when using the HostDb functionality. Will > attach patch and commit right away. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1381) Allow to override default subcollection field name
[ https://issues.apache.org/jira/browse/NUTCH-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291218#comment-13291218 ] Hudson commented on NUTCH-1381: --- Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/]) NUTCH-1381 Allow to override default subcollection field name (Revision 1347744) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java > Allow to override default subcollection field name > -- > > Key: NUTCH-1381 > URL: https://issues.apache.org/jira/browse/NUTCH-1381 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1381-1.6-1.patch > > > The subcollection filter by default uses the subcollection field name but > since NUTCH-1266 allows to override it per subcollection. This issue should > introduce a configuration directive to override the default field name > globally. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291219#comment-13291219 ] Hudson commented on NUTCH-1320: --- Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/]) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java > IndexChecker and ParseChecker choke on IDN's > > > Key: NUTCH-1320 > URL: https://issues.apache.org/jira/browse/NUTCH-1320 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1320-1.5-1.patch > > > These handy debug tools do not handle IDN's and throw an NPE > bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1351) DomainStatistics to aggregate by TLD
[ https://issues.apache.org/jira/browse/NUTCH-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291220#comment-13291220 ] Hudson commented on NUTCH-1351: --- Integrated in nutch-trunk-maven #299 (See [https://builds.apache.org/job/nutch-trunk-maven/299/]) NUTCH-1351 DomainStatistics to aggregate by TLD (Revision 1347747) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java * /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java > DomainStatistics to aggregate by TLD > > > Key: NUTCH-1351 > URL: https://issues.apache.org/jira/browse/NUTCH-1351 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1351-1.6-1.patch > > > The DomainStatistics tool aggregates counts by host, domain or suffix but tld > is missing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1346) Follow outlinks to ignore external
[ https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291596#comment-13291596 ] Hudson commented on NUTCH-1346: --- Integrated in nutch-trunk-maven #301 (See [https://builds.apache.org/job/nutch-trunk-maven/301/]) NUTCH-1346 Follow outlinks to ignore external (Revision 1347897) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java > Follow outlinks to ignore external > -- > > Key: NUTCH-1346 > URL: https://issues.apache.org/jira/browse/NUTCH-1346 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1346-1.6-1.patch > > > The follow outlinks feature already respects the db.ignore.external.links > setting. However, this means that outlinks of fetched pages that are external > are not saved in parse data. There should be a new setting to prevent the > outlink follower from going external but still storing external outlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1336) Optionally not index db_notmodified pages
[ https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291628#comment-13291628 ] Hudson commented on NUTCH-1336: --- Integrated in nutch-trunk-maven #302 (See [https://builds.apache.org/job/nutch-trunk-maven/302/]) NUTCH-1336 Optionally not index db_notmodified pages (Revision 1347909) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java > Optionally not index db_notmodified pages > - > > Key: NUTCH-1336 > URL: https://issues.apache.org/jira/browse/NUTCH-1336 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1336-1.6-1.patch > > > IndexerMapReduce already skips pages with fetch_notmodified as status. > However, despite the fetch status, we may still consider a page not modified > if status is db_notmodified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1381) Allow to override default subcollection field name
[ https://issues.apache.org/jira/browse/NUTCH-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291682#comment-13291682 ] Hudson commented on NUTCH-1381: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1381 Allow to override default subcollection field name (Revision 1347744) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347744 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java > Allow to override default subcollection field name > -- > > Key: NUTCH-1381 > URL: https://issues.apache.org/jira/browse/NUTCH-1381 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1381-1.6-1.patch > > > The subcollection filter by default uses the subcollection field name but > since NUTCH-1266 allows to override it per subcollection. This issue should > introduce a configuration directive to override the default field name > globally. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1336) Optionally not index db_notmodified pages
[ https://issues.apache.org/jira/browse/NUTCH-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291686#comment-13291686 ] Hudson commented on NUTCH-1336: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1336 Optionally not index db_notmodified pages (Revision 1347909) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347909 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java > Optionally not index db_notmodified pages > - > > Key: NUTCH-1336 > URL: https://issues.apache.org/jira/browse/NUTCH-1336 > Project: Nutch > Issue Type: New Feature > Components: indexer >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1336-1.6-1.patch > > > IndexerMapReduce already skips pages with fetch_notmodified as status. > However, despite the fetch status, we may still consider a page not modified > if status is db_notmodified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1346) Follow outlinks to ignore external
[ https://issues.apache.org/jira/browse/NUTCH-1346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291684#comment-13291684 ] Hudson commented on NUTCH-1346: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1346 Follow outlinks to ignore external (Revision 1347897) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347897 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java > Follow outlinks to ignore external > -- > > Key: NUTCH-1346 > URL: https://issues.apache.org/jira/browse/NUTCH-1346 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1346-1.6-1.patch > > > The follow outlinks feature already respects the db.ignore.external.links > setting. However, this means that outlinks of fetched pages that are external > are not saved in parse data. There should be a new setting to prevent the > outlink follower from going external but still storing external outlinks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1320) IndexChecker and ParseChecker choke on IDN's
[ https://issues.apache.org/jira/browse/NUTCH-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291683#comment-13291683 ] Hudson commented on NUTCH-1320: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1320 IndexChecker and ParseChecker choke on IDN's (Revision 1347755) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347755 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java * /nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java > IndexChecker and ParseChecker choke on IDN's > > > Key: NUTCH-1320 > URL: https://issues.apache.org/jira/browse/NUTCH-1320 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1320-1.5-1.patch > > > These handy debug tools do not handle IDN's and throw an NPE > bin/nutch parsechecker http://例子.測試/%E9%A6%96%E9%A0%81 > {code} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.nutch.indexer.IndexingFiltersChecker.run(IndexingFiltersChecker.java:71) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.indexer.IndexingFiltersChecker.main(IndexingFiltersChecker.java:116) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1351) DomainStatistics to aggregate by TLD
[ https://issues.apache.org/jira/browse/NUTCH-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291685#comment-13291685 ] Hudson commented on NUTCH-1351: --- Integrated in Nutch-trunk #1865 (See [https://builds.apache.org/job/Nutch-trunk/1865/]) NUTCH-1351 DomainStatistics to aggregate by TLD (Revision 1347747) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1347747 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/util/URLUtil.java * /nutch/trunk/src/java/org/apache/nutch/util/domain/DomainStatistics.java > DomainStatistics to aggregate by TLD > > > Key: NUTCH-1351 > URL: https://issues.apache.org/jira/browse/NUTCH-1351 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1351-1.6-1.patch > > > The DomainStatistics tool aggregates counts by host, domain or suffix but tld > is missing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293039#comment-13293039 ] Hudson commented on NUTCH-1262: --- Integrated in nutch-trunk-maven #306 (See [https://builds.apache.org/job/nutch-trunk-maven/306/]) NUTCH-1262 Map `duplicating` content-types to a single type (Revision 1348785) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > Map `duplicating` content-types to a single type > > > Key: NUTCH-1262 > URL: https://issues.apache.org/jira/browse/NUTCH-1262 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch > > > Similar or duplicating content-types can end-up differently in an index. > With, for example, both application/xhtml+xml and text/html it is impossible > to use a single filter to select `web pages`. > See also: > http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html > Content-Type mapping is disabled by default and is enabled via > moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/. > {code} > # target MIME-type type1 [ type2 ...] > # Map XHTML to HTML > text/html application/xhtml+xml > # Map XHTML and HTML to something else > Web pagetext/html application/xhtml+xml > # Map some office documents to each other > Office document application/vnd.oasis.opendocument.text > application/x-tika-msoffice > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1385) More robust plug-in order properties in "nutch-site.xml"
[ https://issues.apache.org/jira/browse/NUTCH-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293040#comment-13293040 ] Hudson commented on NUTCH-1385: --- Integrated in nutch-trunk-maven #306 (See [https://builds.apache.org/job/nutch-trunk-maven/306/]) NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Revision 1348764) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java * /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java * /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java > More robust plug-in order properties in "nutch-site.xml" > > > Key: NUTCH-1385 > URL: https://issues.apache.org/jira/browse/NUTCH-1385 > Project: Nutch > Issue Type: Improvement > Components: indexer, parser >Affects Versions: 1.5 >Reporter: Andy Xue >Assignee: Markus Jelsma >Priority: Minor > Labels: filter > Fix For: 1.6 > > Attachments: nutch-1385.txt > > > When listing multiple scoring filters in certain properties (listed below) in > "nutch-site.xml", it is vital that no spaces/newlines/tabs are placed in > front of the value content. > E.g.: > This is fine: > org.apache.nutch.scoring.opic.OPICScoringFilter myFilter > Either of these will generate an exception: > org.apache.nutch.scoring.opic.OPICScoringFilter myFilter > > org.apache.nutch.scoring.opic.OPICScoringFilter > myFilter > > Affects these properties in "nutch-site.xml": > * indexingfilter.order > * urlnormalizer.order > * urlfilter.order > * htmlparsefilter.order > * scoring.filter.order > Solution: replaced {order.split("\\s+")} to {order.trim().split("\\s+")}. > Patch provided. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1384) Typo in ParseSegment's run-method
[ https://issues.apache.org/jira/browse/NUTCH-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293041#comment-13293041 ] Hudson commented on NUTCH-1384: --- Integrated in nutch-trunk-maven #306 (See [https://builds.apache.org/job/nutch-trunk-maven/306/]) NUTCH-1384 Typo in ParseSegments's run-method (Revision 1348766) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java > Typo in ParseSegment's run-method > - > > Key: NUTCH-1384 > URL: https://issues.apache.org/jira/browse/NUTCH-1384 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Agethle >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.6 > > > In the class org.apache.nutch.parse.ParseSegments there's a typo in the > run-method: instead of checking wheter "-noFilter" was specified on the > command-line, the code looks for "-noilter" (missing f, line 234). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293085#comment-13293085 ] Hudson commented on NUTCH-1360: --- Integrated in nutch-trunk-maven #307 (See [https://builds.apache.org/job/nutch-trunk-maven/307/]) commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993) Result = SUCCESS lewismc : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java > Suport the storing of IP address connected to when web crawling > --- > > Key: NUTCH-1360 > URL: https://issues.apache.org/jira/browse/NUTCH-1360 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1360-nutchgora-v2.patch, > NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch > > > Simple issue enabling us to capture the specific IP address of the host which > we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1364) Add a counter in Generator for malformed urls
[ https://issues.apache.org/jira/browse/NUTCH-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293245#comment-13293245 ] Hudson commented on NUTCH-1364: --- Integrated in nutch-trunk-maven #308 (See [https://builds.apache.org/job/nutch-trunk-maven/308/]) commit to address NUTCH-1364 and update to CHANGES.txt (Revision 1349076) Result = SUCCESS lewismc : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java * /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java > Add a counter in Generator for malformed urls > - > > Key: NUTCH-1364 > URL: https://issues.apache.org/jira/browse/NUTCH-1364 > Project: Nutch > Issue Type: New Feature > Components: generator >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1364-nutchgora.patch, NUTCH-1364-trunk.patch > > > This is a simple mechanism for counting the number of malformed urls we > encounter within the Generator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1360) Suport the storing of IP address connected to when web crawling
[ https://issues.apache.org/jira/browse/NUTCH-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1329#comment-1329 ] Hudson commented on NUTCH-1360: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) commit to address NUTCH-1360 and update to CHANGES.txt (Revision 1348993) Result = SUCCESS lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348993 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * /nutch/trunk/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java > Suport the storing of IP address connected to when web crawling > --- > > Key: NUTCH-1360 > URL: https://issues.apache.org/jira/browse/NUTCH-1360 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1360-nutchgora-v2.patch, > NUTCH-1360-nutchgora.patch, NUTCH-1360-trunk.patch > > > Simple issue enabling us to capture the specific IP address of the host which > we connect to to fetch a page. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1385) More robust plug-in order properties in "nutch-site.xml"
[ https://issues.apache.org/jira/browse/NUTCH-1385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293336#comment-13293336 ] Hudson commented on NUTCH-1385: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) NUTCH-1385 More robust plug-in order properties in nutch-site.xml (Revision 1348764) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348764 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLFilters.java * /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java * /nutch/trunk/src/java/org/apache/nutch/parse/HtmlParseFilters.java * /nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java > More robust plug-in order properties in "nutch-site.xml" > > > Key: NUTCH-1385 > URL: https://issues.apache.org/jira/browse/NUTCH-1385 > Project: Nutch > Issue Type: Improvement > Components: indexer, parser >Affects Versions: 1.5 >Reporter: Andy Xue >Assignee: Markus Jelsma >Priority: Minor > Labels: filter > Fix For: 1.6 > > Attachments: nutch-1385.txt > > > When listing multiple scoring filters in certain properties (listed below) in > "nutch-site.xml", it is vital that no spaces/newlines/tabs are placed in > front of the value content. > E.g.: > This is fine: > org.apache.nutch.scoring.opic.OPICScoringFilter myFilter > Either of these will generate an exception: > org.apache.nutch.scoring.opic.OPICScoringFilter myFilter > > org.apache.nutch.scoring.opic.OPICScoringFilter > myFilter > > Affects these properties in "nutch-site.xml": > * indexingfilter.order > * urlnormalizer.order > * urlfilter.order > * htmlparsefilter.order > * scoring.filter.order > Solution: replaced {order.split("\\s+")} to {order.trim().split("\\s+")}. > Patch provided. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1262) Map `duplicating` content-types to a single type
[ https://issues.apache.org/jira/browse/NUTCH-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293334#comment-13293334 ] Hudson commented on NUTCH-1262: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) NUTCH-1262 Map `duplicating` content-types to a single type (Revision 1348785) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348785 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java > Map `duplicating` content-types to a single type > > > Key: NUTCH-1262 > URL: https://issues.apache.org/jira/browse/NUTCH-1262 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1262-1.5-1.patch, NUTCH-1262-1.5-2.patch > > > Similar or duplicating content-types can end-up differently in an index. > With, for example, both application/xhtml+xml and text/html it is impossible > to use a single filter to select `web pages`. > See also: > http://lucene.472066.n3.nabble.com/application-xhtml-xml-gt-text-html-td3699942.html > Content-Type mapping is disabled by default and is enabled via > moreIndexingFilter.mapMimeTypes. Example mapping file is provided in conf/. > {code} > # target MIME-type type1 [ type2 ...] > # Map XHTML to HTML > text/html application/xhtml+xml > # Map XHTML and HTML to something else > Web pagetext/html application/xhtml+xml > # Map some office documents to each other > Office document application/vnd.oasis.opendocument.text > application/x-tika-msoffice > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1384) Typo in ParseSegment's run-method
[ https://issues.apache.org/jira/browse/NUTCH-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293337#comment-13293337 ] Hudson commented on NUTCH-1384: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) NUTCH-1384 Typo in ParseSegments's run-method (Revision 1348766) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1348766 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java > Typo in ParseSegment's run-method > - > > Key: NUTCH-1384 > URL: https://issues.apache.org/jira/browse/NUTCH-1384 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.5 >Reporter: Matthias Agethle >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.6 > > > In the class org.apache.nutch.parse.ParseSegments there's a typo in the > run-method: instead of checking wheter "-noFilter" was specified on the > command-line, the code looks for "-noilter" (missing f, line 234). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1364) Add a counter in Generator for malformed urls
[ https://issues.apache.org/jira/browse/NUTCH-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293335#comment-13293335 ] Hudson commented on NUTCH-1364: --- Integrated in Nutch-trunk #1868 (See [https://builds.apache.org/job/Nutch-trunk/1868/]) commit to address NUTCH-1364 and update to CHANGES.txt (Revision 1349076) Result = SUCCESS lewismc : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349076 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReducer.java * /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java > Add a counter in Generator for malformed urls > - > > Key: NUTCH-1364 > URL: https://issues.apache.org/jira/browse/NUTCH-1364 > Project: Nutch > Issue Type: New Feature > Components: generator >Affects Versions: nutchgora, 1.5 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1364-nutchgora.patch, NUTCH-1364-trunk.patch > > > This is a simple mechanism for counting the number of malformed urls we > encounter within the Generator. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1330) OutlinkDB to preserve back up
[ https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293545#comment-13293545 ] Hudson commented on NUTCH-1330: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1330 WebGraph OutlinkDB to preserve back up (Revision 1349240) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java > OutlinkDB to preserve back up > - > > Key: NUTCH-1330 > URL: https://issues.apache.org/jira/browse/NUTCH-1330 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch > > > The webgraph's outlinkDB is the single source for all scoring jobs and GB's > that eventually come out. In case of disaster, that didn't happen yet, it > should be able to preserve back up just like other DB's. This means users > with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to > crawl/webgraphdb/outlinks/current/. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293543#comment-13293543 ] Hudson commented on NUTCH-1024: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/adaptive-mimetypes.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java > Dynamically set fetchInterval by MIME-type > -- > > Key: NUTCH-1024 > URL: https://issues.apache.org/jira/browse/NUTCH-1024 > Project: Nutch > Issue Type: New Feature > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: AdaptiveFetchSchedule.patch, > MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, > NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, > adaptive-mimetypes.txt > > > Add facility to configure default or fixed fetchInterval values by MIME-type. > This is useful for conserving resources for files that are known to change > frequently or never and everything in between. > * simple key\tvalue\n configuration file > * only set fetchInterval for new documents > * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293546#comment-13293546 ] Hudson commented on NUTCH-1352: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java * /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java * /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java > Improve regex urlfilters/normalizers synchronization > > > Key: NUTCH-1352 > URL: https://issues.apache.org/jira/browse/NUTCH-1352 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch > > > I noticed that during fetching a lot of the time the fetcherthreads are > blocking on a monitor because of outlink normalizing/filtering. The cause of > this: Some of the regex plugins use single lock synchronization. > This patch improves throughput by removing synchronization locks and replace > them with threadlocals were needed. > It has been extensively tested in production. I will commit this later today > when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293544#comment-13293544 ] Hudson commented on NUTCH-1300: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java > Indexer to normalize URL's > -- > > Key: NUTCH-1300 > URL: https://issues.apache.org/jira/browse/NUTCH-1300 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1300-1.5-1.patch > > > Indexers should be able to normalize URL's. This is useful when a new > normalizer is applied to the entire CrawlDB. Without it, some or all records > in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1386) Headings filter not to add empty values
[ https://issues.apache.org/jira/browse/NUTCH-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293547#comment-13293547 ] Hudson commented on NUTCH-1386: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1386 Headings filter not to add empty values (Revision 1349233) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java > Headings filter not to add empty values > --- > > Key: NUTCH-1386 > URL: https://issues.apache.org/jira/browse/NUTCH-1386 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > > Headings filter can add empty values and doesn't trim the headings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293548#comment-13293548 ] Hudson commented on NUTCH-1356: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (Revision 1349230) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java > ParseUtil use ExecutorService instead of manually thread handling. > -- > > Key: NUTCH-1356 > URL: https://issues.apache.org/jira/browse/NUTCH-1356 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, > NUTCH-1356.patch > > > Because ParseUtil manages it's own parser threads by creating a thread for > every parse it sometimes happens that specific parsers are very expensive. > For example, parsers that have threadlocal fields will initialize them for > every item to be parsed. > By simply introducing a caching ExecutorService the ParseUtil will be able to > cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1319) HostNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293549#comment-13293549 ] Hudson commented on NUTCH-1319: --- Integrated in nutch-trunk-maven #310 (See [https://builds.apache.org/job/nutch-trunk-maven/310/]) NUTCH-1319 HostNormalizer plugin (Revision 1349236) Result = SUCCESS markus : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/host-urlnormalizer.txt * /nutch/trunk/src/plugin/urlnormalizer-host * /nutch/trunk/src/plugin/urlnormalizer-host/build.xml * /nutch/trunk/src/plugin/urlnormalizer-host/data * /nutch/trunk/src/plugin/urlnormalizer-host/data/hosts.txt * /nutch/trunk/src/plugin/urlnormalizer-host/ivy.xml * /nutch/trunk/src/plugin/urlnormalizer-host/plugin.xml * /nutch/trunk/src/plugin/urlnormalizer-host/src * /nutch/trunk/src/plugin/urlnormalizer-host/src/java * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java * /nutch/trunk/src/plugin/urlnormalizer-host/src/test * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host/TestHostURLNormalizer.java > HostNormalizer > -- > > Key: NUTCH-1319 > URL: https://issues.apache.org/jira/browse/NUTCH-1319 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1319-1.5-1.patch > > > Nutch would benefit from having a host normalizer. A host normalizer maps a > given host to the desired host. A basic example is to map www.apache.org to > apache.org. The Apache website is one of many on the internet that has a > duplicate website on the same domain just because it allows both www and > non-www to return HTTP 200 and proper content. > It is also able to handle wildcards such as *.example.org to example.org if > there are multiple sub domains that actually point to the same website. > Large internet crawls tend to get polluted very quickly due to these > problems. It also leads to skewed scores in the webgraph as different > websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1398) Upgrade to Hadoop 1.0.3
[ https://issues.apache.org/jira/browse/NUTCH-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295718#comment-13295718 ] Hudson commented on NUTCH-1398: --- Integrated in nutch-trunk-maven #314 (See [https://builds.apache.org/job/nutch-trunk-maven/314/]) NUTCH-1398 Upgrade to Hadoop 1.0.3 (Revision 1350630) Result = SUCCESS jnioche : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml > Upgrade to Hadoop 1.0.3 > --- > > Key: NUTCH-1398 > URL: https://issues.apache.org/jira/browse/NUTCH-1398 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora, 1.5 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1396) Upgrade to Tika 1.1
[ https://issues.apache.org/jira/browse/NUTCH-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295766#comment-13295766 ] Hudson commented on NUTCH-1396: --- Integrated in Nutch-nutchgora #281 (See [https://builds.apache.org/job/Nutch-nutchgora/281/]) Upgrade to Tika 1.1 NUTCH-1396 (Revision 1350580) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/ivy/ivy.xml * /nutch/branches/nutchgora/src/java/org/apache/nutch/util/MimeUtil.java * /nutch/branches/nutchgora/src/plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java * /nutch/branches/nutchgora/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java * /nutch/branches/nutchgora/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestMSWordParser.java * /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestOOParser.java * /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestPdfParser.java * /nutch/branches/nutchgora/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestRSSParser.java * /nutch/branches/nutchgora/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java > Upgrade to Tika 1.1 > --- > > Key: NUTCH-1396 > URL: https://issues.apache.org/jira/browse/NUTCH-1396 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-1396.patch > > > Copied code from trunk for MimeUtil and upgraded dependency to Tika 1.1 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1392) -force and -resume arguments being ignored in ParserJob
[ https://issues.apache.org/jira/browse/NUTCH-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295767#comment-13295767 ] Hudson commented on NUTCH-1392: --- Integrated in Nutch-nutchgora #281 (See [https://builds.apache.org/job/Nutch-nutchgora/281/]) -force and -resume arguments being ignored in ParserJob NUTCH-1392 (Revision 1350213) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java * /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java > -force and -resume arguments being ignored in ParserJob > --- > > Key: NUTCH-1392 > URL: https://issues.apache.org/jira/browse/NUTCH-1392 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: nutchgora >Reporter: Lewis John McGibbney > Fix For: nutchgora > > Attachments: NUTCH-1392.patch > > > From the log below there is obviously something not right here as both > -resume and -force are passed to the CLI but blatantly ignored within the log > output. > lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse > Usage: ParserJob ( | -all) [-crawlId ] [-resume] [-force] > - symbolic batch ID created by Generator > -crawlId - the id to prefix the schemas to operate on, > (default: storage.crawl.id) > -all - consider pages from all crawl jobs > -resume - resume a previous incomplete job > -force- force re-parsing even if a page is already parsed > lewis@lewis:~/ASF/nutchgora/runtime/local$ ./bin/nutch parse -all -resume > -force > ParserJob: starting > ParserJob: resuming: false > ParserJob: forced reparse:false > ParserJob: parsing all > Parsing http://www.trancearoundtheworld.com/ > ParserJob: success -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295798#comment-13295798 ] Hudson commented on NUTCH-1300: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1300 Indexer to filter normalize URL's (Revision 1349262) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349262 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java * /nutch/trunk/src/java/org/apache/nutch/indexer/solr/SolrIndexer.java * /nutch/trunk/src/java/org/apache/nutch/net/URLNormalizers.java > Indexer to normalize URL's > -- > > Key: NUTCH-1300 > URL: https://issues.apache.org/jira/browse/NUTCH-1300 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1300-1.5-1.patch > > > Indexers should be able to normalize URL's. This is useful when a new > normalizer is applied to the entire CrawlDB. Without it, some or all records > in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1356) ParseUtil use ExecutorService instead of manually thread handling.
[ https://issues.apache.org/jira/browse/NUTCH-1356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295803#comment-13295803 ] Hudson commented on NUTCH-1356: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1356 ParseUtil use ExecutorService instead of manually thread handling (Revision 1349230) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349230 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/java/org/apache/nutch/parse/ParseUtil.java > ParseUtil use ExecutorService instead of manually thread handling. > -- > > Key: NUTCH-1356 > URL: https://issues.apache.org/jira/browse/NUTCH-1356 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1356-trunk-v2.patch, NUTCH-1356-trunk.patch, > NUTCH-1356.patch > > > Because ParseUtil manages it's own parser threads by creating a thread for > every parse it sometimes happens that specific parsers are very expensive. > For example, parsers that have threadlocal fields will initialize them for > every item to be parsed. > By simply introducing a caching ExecutorService the ParseUtil will be able to > cache threads therefore parsing more efficient. See attached patch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1319) HostNormalizer
[ https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295804#comment-13295804 ] Hudson commented on NUTCH-1319: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1319 HostNormalizer plugin (Revision 1349236) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349236 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/host-urlnormalizer.txt * /nutch/trunk/src/plugin/urlnormalizer-host * /nutch/trunk/src/plugin/urlnormalizer-host/build.xml * /nutch/trunk/src/plugin/urlnormalizer-host/data * /nutch/trunk/src/plugin/urlnormalizer-host/data/hosts.txt * /nutch/trunk/src/plugin/urlnormalizer-host/ivy.xml * /nutch/trunk/src/plugin/urlnormalizer-host/plugin.xml * /nutch/trunk/src/plugin/urlnormalizer-host/src * /nutch/trunk/src/plugin/urlnormalizer-host/src/java * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host * /nutch/trunk/src/plugin/urlnormalizer-host/src/java/org/apache/nutch/net/urlnormalizer/host/HostURLNormalizer.java * /nutch/trunk/src/plugin/urlnormalizer-host/src/test * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host * /nutch/trunk/src/plugin/urlnormalizer-host/src/test/org/apache/nutch/net/urlnormalizer/host/TestHostURLNormalizer.java > HostNormalizer > -- > > Key: NUTCH-1319 > URL: https://issues.apache.org/jira/browse/NUTCH-1319 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1319-1.5-1.patch > > > Nutch would benefit from having a host normalizer. A host normalizer maps a > given host to the desired host. A basic example is to map www.apache.org to > apache.org. The Apache website is one of many on the internet that has a > duplicate website on the same domain just because it allows both www and > non-www to return HTTP 200 and proper content. > It is also able to handle wildcards such as *.example.org to example.org if > there are multiple sub domains that actually point to the same website. > Large internet crawls tend to get polluted very quickly due to these > problems. It also leads to skewed scores in the webgraph as different > websites link to different versions of the same duplicate website. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1398) Upgrade to Hadoop 1.0.3
[ https://issues.apache.org/jira/browse/NUTCH-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295800#comment-13295800 ] Hudson commented on NUTCH-1398: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1398 Upgrade to Hadoop 1.0.3 (Revision 1350630) Result = SUCCESS jnioche : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1350630 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/ivy/ivy.xml > Upgrade to Hadoop 1.0.3 > --- > > Key: NUTCH-1398 > URL: https://issues.apache.org/jira/browse/NUTCH-1398 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora, 1.5 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 1.6, 2.1 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1352) Improve regex urlfilters/normalizers synchronization
[ https://issues.apache.org/jira/browse/NUTCH-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295801#comment-13295801 ] Hudson commented on NUTCH-1352: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1352 Improve regex urlfilters/normalizers synchronization (Revision 1349227) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349227 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java * /nutch/trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java * /nutch/trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java * /nutch/trunk/src/plugin/urlnormalizer-basic/src/java/org/apache/nutch/net/urlnormalizer/basic/BasicURLNormalizer.java * /nutch/trunk/src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java > Improve regex urlfilters/normalizers synchronization > > > Key: NUTCH-1352 > URL: https://issues.apache.org/jira/browse/NUTCH-1352 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: nutchgora, 1.6 > > Attachments: NUTCH-1352-1.6-1.patch, NUTCH-1352.patch > > > I noticed that during fetching a lot of the time the fetcherthreads are > blocking on a monitor because of outlink normalizing/filtering. The cause of > this: Some of the regex plugins use single lock synchronization. > This patch improves throughput by removing synchronization locks and replace > them with threadlocals were needed. > It has been extensively tested in production. I will commit this later today > when no objection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1024) Dynamically set fetchInterval by MIME-type
[ https://issues.apache.org/jira/browse/NUTCH-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295797#comment-13295797 ] Hudson commented on NUTCH-1024: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1024 Dynamically set fetchInterval by MIME-type (Revision 1349226) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349226 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/adaptive-mimetypes.txt * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java * /nutch/trunk/src/java/org/apache/nutch/metadata/HttpHeaders.java * /nutch/trunk/src/java/org/apache/nutch/metadata/Nutch.java > Dynamically set fetchInterval by MIME-type > -- > > Key: NUTCH-1024 > URL: https://issues.apache.org/jira/browse/NUTCH-1024 > Project: Nutch > Issue Type: New Feature > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: AdaptiveFetchSchedule.patch, > MimeAdaptiveFetchSchedule.java, NUTCH-1024-1.5-1.patch, > NUTCH-1024-1.5-2.patch, NUTCH-1024-1.5-3.patch, Nutch.patch, > adaptive-mimetypes.txt > > > Add facility to configure default or fixed fetchInterval values by MIME-type. > This is useful for conserving resources for files that are known to change > frequently or never and everything in between. > * simple key\tvalue\n configuration file > * only set fetchInterval for new documents > * keep max fetchInterval fixed by current config -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1386) Headings filter not to add empty values
[ https://issues.apache.org/jira/browse/NUTCH-1386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295802#comment-13295802 ] Hudson commented on NUTCH-1386: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1386 Headings filter not to add empty values (Revision 1349233) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349233 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/headings/src/java/org/apache/nutch/parse/headings/HeadingsParseFilter.java > Headings filter not to add empty values > --- > > Key: NUTCH-1386 > URL: https://issues.apache.org/jira/browse/NUTCH-1386 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.5 >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > > Headings filter can add empty values and doesn't trim the headings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1330) OutlinkDB to preserve back up
[ https://issues.apache.org/jira/browse/NUTCH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295799#comment-13295799 ] Hudson commented on NUTCH-1330: --- Integrated in Nutch-trunk #1869 (See [https://builds.apache.org/job/Nutch-trunk/1869/]) NUTCH-1330 WebGraph OutlinkDB to preserve back up (Revision 1349240) Result = SUCCESS markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1349240 Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/scoring/webgraph/WebGraph.java > OutlinkDB to preserve back up > - > > Key: NUTCH-1330 > URL: https://issues.apache.org/jira/browse/NUTCH-1330 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.4 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Fix For: 1.6 > > Attachments: NUTCH-1330-1.6-1.patch, NUTCH-1330-1.6-2.patch > > > The webgraph's outlinkDB is the single source for all scoring jobs and GB's > that eventually come out. In case of disaster, that didn't happen yet, it > should be able to preserve back up just like other DB's. This means users > with an existing outlinkdb must move it from a crawl/webgraphdb/outlinks/ to > crawl/webgraphdb/outlinks/current/. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1404) Nutch script fails to find job file in deploy mode
[ https://issues.apache.org/jira/browse/NUTCH-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13396778#comment-13396778 ] Hudson commented on NUTCH-1404: --- Integrated in nutch-trunk-maven #319 (See [https://builds.apache.org/job/nutch-trunk-maven/319/]) NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, jnioche) (Revision 1351709) Result = SUCCESS jnioche : Files : * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/bin/nutch > Nutch script fails to find job file in deploy mode > -- > > Key: NUTCH-1404 > URL: https://issues.apache.org/jira/browse/NUTCH-1404 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.5 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora, 1.5.1 > > > See > http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1399) TestProtocolHttpClient fails
[ https://issues.apache.org/jira/browse/NUTCH-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397249#comment-13397249 ] Hudson commented on NUTCH-1399: --- Integrated in Nutch-nutchgora #286 (See [https://builds.apache.org/job/Nutch-nutchgora/286/]) TestProtocolHttpClient fails NUTCH-1399 (Revision 1351730) Result = SUCCESS lewismc : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/plugin/protocol-httpclient/src/test/org/apache/nutch/protocol/httpclient/TestProtocolHttpClient.java > TestProtocolHttpClient fails > > > Key: NUTCH-1399 > URL: https://issues.apache.org/jira/browse/NUTCH-1399 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora > > Attachments: NUTCH-1399.patch > > > the test fails because the http servers are not closed between tests -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1401) Upgrade to Hadoop 1.0.3
[ https://issues.apache.org/jira/browse/NUTCH-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397250#comment-13397250 ] Hudson commented on NUTCH-1401: --- Integrated in Nutch-nutchgora #286 (See [https://builds.apache.org/job/Nutch-nutchgora/286/]) NUTCH-1401 (Revision 1351705) Result = SUCCESS jnioche : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/ivy/ivy.xml > Upgrade to Hadoop 1.0.3 > --- > > Key: NUTCH-1401 > URL: https://issues.apache.org/jira/browse/NUTCH-1401 > Project: Nutch > Issue Type: Improvement >Affects Versions: nutchgora >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1404) Nutch script fails to find job file in deploy mode
[ https://issues.apache.org/jira/browse/NUTCH-1404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397248#comment-13397248 ] Hudson commented on NUTCH-1404: --- Integrated in Nutch-nutchgora #286 (See [https://builds.apache.org/job/Nutch-nutchgora/286/]) NUTCH-1404 Nutch script fails to find job file in deploy mode (sidabatra, jnioche) (Revision 1351707) Result = SUCCESS jnioche : Files : * /nutch/branches/nutchgora/CHANGES.txt * /nutch/branches/nutchgora/src/bin/nutch > Nutch script fails to find job file in deploy mode > -- > > Key: NUTCH-1404 > URL: https://issues.apache.org/jira/browse/NUTCH-1404 > Project: Nutch > Issue Type: Bug >Affects Versions: nutchgora, 1.5 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: nutchgora, 1.5.1 > > > See > http://lucene.472066.n3.nabble.com/Nutch-1-5-Deploy-Mode-Doesn-t-Work-like-Nutch-1-4-Deploy-Mode-tp3990169.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira