Re: [jira] Updated: (NUTCH-627) Minimize host address lookup
On 4/10/08 8:25 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Otis Gospodnetic (JIRA) wrote: If nobody complains, I'll commit by the end of the week. Hi Otis, Thanks for helping with Nutch - we are indeed very shorthanded at the moment, and any help is appreciated, and doubly so that of a person who can commit things ... However, on the formal side I think the Nutch team needs to vote you in as a Nutch committer (even though svn allows you to commit directly) - witness the recent situation with Grant. If you wish I can start a vote, and I'm sure it will be positive, and we will have a clean situation from the formal POV. Ok? +1 +1, as well. Cheers, Chris __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: End-Of-Life status for 0.7.x?
+1 On 1/17/08 12:49 PM, Dennis Kubes [EMAIL PROTECTED] wrote: +1. Andrzej Bialecki wrote: Hi all, I'd like to initiate the discussion about the EOL status of Nutch 0.7.x branch. The question is whether we want to actively support it, whether we have enough resources to make any new releases or apply patches that sit in JIRA? My opinion is that we should mark it EOL, and close all JIRA issues that are relevant only to 0.7.x, with the status Won't Fix. __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Student contributions
Hi Frank, Thanks for your interest in using Nutch! The best way to see what's on the horizon, and needed in Nutch, is to check out our JIRA issue tracking system, at: http://issues.apache.org/jira/browse/NUTCH At present, there are 39 current issues with Nutch, planned to be fixed, or added (as a new feature), or improved (made to an existing feature), for the upcoming 1.0.0 release. There are 222 open issues across all versions of Nutch (including prior releases). To help you digest the wealth of information that's there (and trust me, there's plenty), I would offer a few of my own suggestions for class projects: (Difficulty: High) 1. Decouple Nutch's crawl infrastructure, and turn it into its own extension point.The current Nutch crawl infrastructure is highly coupled around a few, monolithic classes, Fetcher (or its big brother, Fetcher2), Hadoop (as the underlying job/crawl execution platform), etc. There have been several requests on the list to make the crawler its own component, make it light-weight, make it configurable, etc. I think an ambitious 2 week student project would be to take a stab at this decoupling. (Difficulty: Medium) 2. Analyze the Nutch code base, and propose/suggest architectural improvements. Currently, the Nutch code base is a behemoth of plugins/extension points, configuration properties, and the like. It would be nice to have a fresh look at its architecture, from an outsider's perspective. The students would suggest places to cut/places to add, cleaner interfaces, the appropriate underlying middleware substrates, e.g., is Hadoop the only logical choice? What about other enterprise solutions such as web services/EJB/JMS/etc.? (Difficulty: Medium) 3. Use Spring as the underlying configuration framework for Nutch, and overhaul Nutch's home-grown configuration infrastructure. Spring is a an open source framework centered around providing configuration and instantiation middleware capabilities: it lets developers focus on the domain objects, and handles the rest. The student would first take a look at Spring, then Nutch, then build a prototype that shows how Spring could be used to configure Nutch. There are plenty of others, but that should help get the juices flowing and were just a few ideas off the top of my head. Also, FYI, a course has been taught for a few semesters at the University of Southern California (USC) by Dr. Ellis Horowitz on Search Engines. Here is a pointer to that page. You can find some other Nutch project suggestions there. http://www-scf.usc.edu/~csci572/ Good luck! Cheers, Chris On 1/2/08 2:44 PM, Frank McCown [EMAIL PROTECTED] wrote: Greetings. I'm teaching a class on search engine development this semester, and I am considering having my students use Nutch in their projects (I'm new to Nutch myself). I'd like them to get some experience with an open source project and make a significant contribution. Are there any implementation tasks you guys think would be appropriate for a small group of undergrad, upperclass CS students? I'm looking for ideas for improving Nutch that they could accomplish in a few weeks time. Thanks, __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Commit Times for Issues
Hi Guys, I'd like to chime in here on this one. My +1 for shortening the time to commit for issues. I fear that development effort on Nutch has teetered on the dwindling side of things for the last year or so, and there (in my opinion, so feel free to disagree) is certainly a stigma to the trunk and its sacred nature that discourages people (including myself) from introducing new code there. I would like to propose even extending Dennis's idea below and developing a new philosophy towards the Nutch CM. To me, the big picture change is the following statement: the trunk is something that can be broke. Let's just accept that it's possible. If it's broke, someone will report it. Nutch has a big enough user base now that plays around with new builds and revisions that this will get caught. Guess what. If the trunk is broke, then it can be fixed. I'll tell you guys a story of one of my bosses here at JPL. He used to work for a civil defense contractor in the U.S., with very rigorous design and software development process. Unit tests for each line of code type of place. In any case, my boss used to break his company's equivalent of the trunk daily build process all the time. Well one day he gets called in to speak with the vice president of engineering at the company, who proceeds to tell him: You're really good at breaking the code, eh?. My boss immediately jumps up to defend himself, citing the fact that it wasn't a big problem and that he has fixed it already, but the vice president cuts him off and says, You probably think I'm mad. Well let me tell you: I'm not. You can break the code all you want because you know what it tells me? That you're actually *DOING WORK* unlike the rest of these people who work here and do very little. The above story has stuck with me and made me feel a lot better about situations such as those in that it gives me the belief that waiting until everything is perfect before acting in a situation isn't always the best thing to do because you may end up waiting forever. It's better to make incremental progress (even falter while doing so), because what you end up with may be just as good (or even better) as if you tried to be a perfectionist and only made progress/did work when you felt everything was right. My 2 cents, Chris On 11/15/07 1:37 PM, Dennis Kubes [EMAIL PROTECTED] wrote: So I have been talking with some of the other committers and I wanted to layout a suggestion for standardizing some of the nutch committer workflow processes in the hope of speeding up nutch development. The first one I was hoping to tackle is time to commit. At least for me it has been hard to know when to commit something, especially when it was trivial or no one commented on the issue. Here is what is being proposed: Trivial changes = immediate, this at the discretion of the committers Minor changes = 24 hours from latest patch or 1 or more +1 from committers Major and blocker changes = 4 days from latest patch or 2 or more +1 from committers This way if an issue has been active for some time but no one has taken a look at it, and it has passed all unit tests, then we can go ahead and commit it. Also this should allow more of the smaller changes to be handled faster. So these of course are just some suggestions would love to hear from others in the community. What I think would be best is to come to a consensus on this and then have a wiki page describing this and other processes for committers. Dennis Kubes __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: JIRA, Resolving and Closing Issues
Dennis, My practice has been to do the following: 1. Resolve the issue, and describe (at a high level), the changes made to the code, e.g., *Introduced new classes A, B, C *Refactored method Y out of class D and into new class E *made internal method F of class G use member variable as an increment check for blah blah 2. Close the issue and list the revision number that the patch you applied first exists in. That's my practice: not sure if it's right, but it's what I gleaned from watching the other committers for a few years. Cheers, Chris On 10/18/07 9:58 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Quick question about Jira. When we commit, are we supposed to first resolve and then close the issue. What is the process on this. Dennis Kubes __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: writing a new parse-exe plugin
).getEmptyParse(getConf()); } /// i'm not sure what to return here if i only need to d/l the file ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, ,null, null, null); parseData.setConf(this.conf); return new ParseImpl(, parseData); } public void setConf(Configuration conf) { this.conf = conf; } public Configuration getConf() { return this.conf; } __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework
Hi Guys, I vote for reverting this patch, unless there is an overall consensus among Nutch developers that it's ok to keep it as it is - on one hand considering the added functionality and simplification of Nutch code, and on the other hand considering the (lack of) maturity of Tika. I agree with Andrzej here. I would have waited a bit more before rushing into this. Because at this point (where no Tika releases have been made) it might (even though it does not look like it right now) even be possible that the project will be retired without any releases at all. I'm not out for beating a dead horse here, but the thought comes to mind: what about the vitality of the code as it exists within the Nutch code base? When was the last time anybody at all worked on the mime system? It was pioneered by Jerome, but he's been largely inactive as a committer for more than a year now, and it doesn't look like that's going to change. I ported what was largely Nutch's mime system, with Jerome's improvements to Tika, where the code is actively being developed, by me (and vetted by the other *active* members of the team) -- in contrast to Nutch. As a developer, I don't want to maintain the code in both places, but I'm willing to maintain the Nutch use of and interface to Tika, which means that Nutch will inherit the benefits using this approach. Being a member of the Nutch community for almost 2 years now, I can't tell you how many times people have asked for Nutch to be able to reliably detect XML content. This is reified in the form of a number of different JIRA issues that reference that deficiency, that are for all intents and purposes, not being worked on at all. I'm all for following the process, and so forth, but at the same time, I think the Nutch community needs to take a serious look at itself with regards to the sacred nature of the trunk, which we currently treat with a large amount of sensitivity, etc. However, the trunk as it stands on other projects (and of course, I'm bias, but I use my work as an example and also say something like Tika), the trunk is not something that is expected to be always working and is regularly expected as somewhere where bugs can exist, and where they can be fixed before a release is made. That's not the way I feel on this project and quite honestly I think it stymies progress. Finally, there is precedence for what I did with the Tika patch and making its way into the Nutch. If I recall something very similar happened when Hadoop came along and NDFS (at the time as it was called) and MapReduce made their way into an external library, and Nutch was made to rely on that (at the time) in-development library. This makes sense, because the folks working on Hadoop were actively working on updates to the portion of the code that Nutch relied upon, and all the developers that were interested in that portion of the code started developing in that arena. I'm not compariing Hadoop to Tika, but certainly there are some similarities here. -Chris __ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266B Mailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/nutch/parse/h
No problemo! Thanks! Cheers, Chris On 6/25/07 9:45 PM, Dennis Kubes [EMAIL PROTECTED] wrote: ooopsgotta remember to do that. Done. Dennis Chris Mattmann wrote: On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Jun 25 20:33:59 2007 New Revision: 550669 URL: http://svn.apache.org/viewvc?view=revrev=550669 Log: NUTCH-497: Fixes problems relating to StackOverflow errors and extreme nested tags. Adds general framework for stack based Node walking. [...snip...] Hi Dennis, Could you update CHANGES.txt to reflect your commit of NUTCH-497? Thanks! Cheers, Chris
Re: Build failed in Hudson: Nutch-Nightly #123
Doğacan, This is strange indeed. I noticed this during my testing of parse-feed, however, thought it was an anomaly. I got this same strange cryptic unit test error message, and then after some frustration figuring it out, I did ant clean, then ant compile-core test, and miraculously the error seemed to go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test (of course after running ant compile-core from the top-level $NUTCH dir), the unit tests seem to pass? [XXX:src/plugin/feed] mattmann% pwd /Users/mattmann/src/nutch/src/plugin/feed [XXX:src/plugin/feed] mattmann% ant clean test Searching for build.xml ... Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml clean: [delete] Deleting directory /Users/mattmann/src/nutch/build/feed [delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed init: [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test [mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data init-plugin: deps-jar: compile: [echo] Compiling plugin: feed [javac] Compiling 2 source files to /Users/mattmann/src/nutch/build/feed/classes compile-test: [javac] Compiling 1 source file to /Users/mattmann/src/nutch/build/feed/test jar: [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar deps-test: init: init-plugin: compile: jar: deps-test: deploy: copy-generated-lib: init: init-plugin: deps-jar: compile: [echo] Compiling plugin: protocol-file jar: deps-test: deploy: copy-generated-lib: deploy: [mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed copy-generated-lib: [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec BUILD SUCCESSFUL Total time: 3 seconds [XXX:src/plugin/feed] mattmann% Any ideas? Cheers, Chris On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote: On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote: This is rather strange. Here is part of the console output: test: [echo] Testing plugin: parse-swf [junit] Running org.apache.nutch.parse.swf.TestSWFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.315 sec [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 5.387 sec init: [junit] Test org.apache.nutch.parse.feed.TestFeedParser FAILED SWFParser fails one of the unit tests but the report says that FeedParser has failed even though it has actually passed its test: test: [echo] Testing plugin: feed [junit] Running org.apache.nutch.parse.feed.TestFeedParser [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec (ant test forks processes to test code, that's why we are seeing test outputs out of order.) Anyway, it is not TestSWFParser but TestFeedParser that fails. I am trying to understand why it fails. Chris, can you lend me a hand here? -- Doğacan Güney __ Chris A. Mattmann [EMAIL PROTECTED] Key Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Build failed in Hudson: Nutch-Nightly #123
On 6/20/07 8:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote: Since you are doing compile-core, no plugins get compiled (say, urlfilter-prefix), then when you do a ant test in feed only protocol-file gets compiled. So, no urlfilter-prefix, no problem :). I have to say that I am certain that I am not sure of what I just said. Can you retry with just 'ant' instead of 'ant compile-core'? Heh, yep, that replicated the issue. Okay, so I agree with you with regards to the fix that you suggested, however the larger issue here is one of annoyance. Why should I have to have a version of the urlfilter-prefix plugin compiled for this issue to manifest itself? Plugin development is supposed to be independent, i.e., while developing the feed plugin I shouldn't need to care about how others have developed the urlfilter plugin, etc., or whether or not there is an appropriate test file there to use in unit testing. I have 2 suggestions: 1. We should make the urlfilter-prefix use more of a sensible default for its filters (e.g., a default filter perhaps) that takes effect when the plugin cannot find the specified .txt file. 2. We should think about this more general issue and come up with a way that plugin development in Nutch supports the use case that I was trying, which I find to be highly representative of what many other folks using Nutch are doing as well (i.e., why should I have to do a full rebuild/test of other plugins when I'm simply working on a single one? For my part in the interim, I will ensure that next time before I commit a plugin I make sure that it passes with the full ant clean compile-core test cycle. Doğacan, thanks for your help in tracking this down. Could you please commit an example test urlfilter file to make the unit test pass since you are going to make that change to use lib-xml anyways? Let me know okay, thanks! Cheers, Chris
Re: Welcome Doğacan as Nutch committer
+1 Welcome to the team, Doğacan! Cheers, Chris On 6/12/07 9:43 AM, Sami Siren [EMAIL PROTECTED] wrote: Doğacan Güney wrote: Hi all, I hope that together we will make nutch rock even harder. By looking at your earlier efforts there should be no doubt. Welcome!
Committer
Hi Folks, I'd just like to throw out my +1 for Doğacan Güney's committer status. I've been impressed by several of his contributions and the guy just keeps them coming and coming. I'm not a member of the Lucene PMC, so I don't have official voting rights, however, I would like to express my support for his elevation to committer status. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Key Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Nutch Release 0.9 - Waiting for release to propagate to mirrors
Hi Guys, Okay, it looks like Nutch 0.9 has propagated to (at least some of the) Apache mirror sites. So, I will now move forward with the final steps of the release. I will have some free time later this afternoon (PST, Los Angeles time) to finish it up. I'll post an email to the developers list announcing the completion of the release. Thanks! Cheers, Chris On 4/4/07 7:21 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Guys, I've just moved forward with step 13 in the release process (waiting for release to propogate to mirrors). Should I just go ahead and do the other steps (update Nutch site, update Lucene site, Update javadoc, create version in JIRA, etc.)? It seems that I could do these without the release having propagated to the mirrors as of yet. What do you guys think? Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Nutch 0.9 officially released!
Hi Folks, After some hard work from all folks involved, we've managed to push out Apache Nutch, release 0.9. This is the second release of Nutch based entirely on the underlying Hadoop platform. This release includes several critical bug fixes, as well as key speedups described in more detail at Sami Siren's blog: http://blog.foofactory.fi/2007/03/twice-speed-half-size.html See the list of changes made in this version: http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt The release is available here. http://www.apache.org/dyn/closer.cgi/lucene/nutch/ Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes, Sami Siren, and the rest of the Nutch development team for providing lots of help along the way, and for allowing me to be the release manager! Enjoy the new release! Cheers, Chris
Re: [VOTE] Release Apache Nutch 0.9
Hi Guys, Alrighty, that's 4 binding votes (Sami, Andrzej, me, and Dennis), so I think we can safely move forward with the release process. I will finish the release up when I get back to my home computer tonight (~5pm Pacific Standard Time, Los Angeles). Thanks, and I will get this thing wrapped up tonight! :-) Cheers, Chris On 4/4/07 8:04 AM, Sami Siren [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/rc2/ Please vote on releasing these packages as Apache Nutch 0.9. +1 -- Sami Siren __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Nutch Release 0.9 - Waiting for release to propagate to mirrors
Hi Guys, I've just moved forward with step 13 in the release process (waiting for release to propogate to mirrors). Should I just go ahead and do the other steps (update Nutch site, update Lucene site, Update javadoc, create version in JIRA, etc.)? It seems that I could do these without the release having propagated to the mirrors as of yet. What do you guys think? Thanks! Cheers, Chris
Re: [VOTE] Release Apache Nutch 0.9
Hi Guys, I think we're discussing about the same thing(improving the process), I just don't think 0.9 is out yet :) But to wrap it up for me: +1 for creating 0.9 branch after fixing the bug (and removing the tag), creating new rc and starting a vote. +1. +1. So, that's 3 binding votes to change the process. It looks like we have enough to get started. I will begin work tonight (my time, Los Angeles, PST) on removing the tag, and starting the process over again. In the meanwhile, Dennis, do you have the patch that fixes the issue with Hadoop? If so, ,could you commit it ASAP to the trunk. Once that's done, I'll remove the tag, and star the release process over again, and get an RC out for a vote. Then, we can move forward from there. Thanks, guys! Cheers, Chris I still propose that we discuss a bit more (in a separate thread) before rewriting the how to release page in wiki. I agree - the current release process didn't fare too well in this particular situation ... __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java
Hi Dennis, Thanks for taking care of this. :-) Could you update CHANGES.txt as well? Once you take care of that, in about 2 hrs (when I get home), I'll begin the release process again. Thanks! Cheers, Chris On 4/2/07 2:40 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Apr 2 14:40:10 2007 New Revision: 524932 URL: http://svn.apache.org/viewvc?view=revrev=524932 Log: NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. Patch supplied originally by Michael Stack and updated by Doğacan Güney. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/segm ent/SegmentMerger.java?view=diffrev=524932r1=524931r2=524932 == --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java Mon Apr 2 14:40:10 2007 @@ -18,17 +18,37 @@ package org.apache.nutch.segment; import java.io.IOException; -import java.util.*; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; +import java.util.TreeMap; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; - -import org.apache.hadoop.conf.*; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.PathFilter; -import org.apache.hadoop.io.*; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.io.MapFile; +import org.apache.hadoop.io.SequenceFile; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.UTF8; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.WritableComparable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobClient; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Mapper; +import org.apache.hadoop.mapred.OutputCollector; +import org.apache.hadoop.mapred.OutputFormatBase; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.RecordWriter; +import org.apache.hadoop.mapred.Reducer; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.SequenceFileInputFormat; +import org.apache.hadoop.mapred.SequenceFileRecordReader; import org.apache.hadoop.util.Progressable; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Generator; @@ -39,6 +59,7 @@ import org.apache.nutch.parse.ParseText; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.NutchJob; /** * This tool takes several segments and merges their data together. Only the @@ -482,7 +503,7 @@ if (LOG.isInfoEnabled()) { LOG.info(Merging + segs.length + segments to + out + / + segmentName); } -JobConf job = new JobConf(getConf()); +JobConf job = new NutchJob(getConf()); job.setJobName(mergesegs + out + / + segmentName); job.setBoolean(segment.merger.filter, filter); job.setLong(segment.merger.slice, slice); Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/segm ent/SegmentReader.java?view=diffrev=524932r1=524931r2=524932 == --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Mon Apr 2 14:40:10 2007 @@ -17,18 +17,48 @@ package org.apache.nutch.segment; -import java.io.*; +import java.io.BufferedReader; +import java.io.BufferedWriter; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.OutputStreamWriter; +import java.io.PrintStream; +import java.io.PrintWriter; +import java.io.Writer; import java.text.SimpleDateFormat; -import java.util.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Date; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; - import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; -import org.apache.hadoop.fs.*; -import org.apache.hadoop.io.*; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.fs.FileSystem; +import
Re: svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java
Hi Dennis, No problem! :-) You did it really fast quite honestly. I will start the release process shortly... Take care! Cheers, Chris On 4/2/07 6:21 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Chris, I have updated changes and resolved and closed the issue. Sorry about not getting to it sooner. Dennis Kubes Chris Mattmann wrote: Hi Dennis, Thanks for taking care of this. :-) Could you update CHANGES.txt as well? Once you take care of that, in about 2 hrs (when I get home), I'll begin the release process again. Thanks! Cheers, Chris On 4/2/07 2:40 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Apr 2 14:40:10 2007 New Revision: 524932 URL: http://svn.apache.org/viewvc?view=revrev=524932 Log: NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. Patch supplied originally by Michael Stack and updated by Doğacan Güney. Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/se gm ent/SegmentMerger.java?view=diffrev=524932r1=524931r2=524932 == --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java Mon Apr 2 14:40:10 2007 @@ -18,17 +18,37 @@ package org.apache.nutch.segment; import java.io.IOException; -import java.util.*; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; +import java.util.TreeMap; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; - -import org.apache.hadoop.conf.*; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.PathFilter; -import org.apache.hadoop.io.*; -import org.apache.hadoop.mapred.*; +import org.apache.hadoop.io.MapFile; +import org.apache.hadoop.io.SequenceFile; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.UTF8; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.WritableComparable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobClient; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Mapper; +import org.apache.hadoop.mapred.OutputCollector; +import org.apache.hadoop.mapred.OutputFormatBase; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.RecordWriter; +import org.apache.hadoop.mapred.Reducer; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.SequenceFileInputFormat; +import org.apache.hadoop.mapred.SequenceFileRecordReader; import org.apache.hadoop.util.Progressable; import org.apache.nutch.crawl.CrawlDatum; import org.apache.nutch.crawl.Generator; @@ -39,6 +59,7 @@ import org.apache.nutch.parse.ParseText; import org.apache.nutch.protocol.Content; import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.NutchJob; /** * This tool takes several segments and merges their data together. Only the @@ -482,7 +503,7 @@ if (LOG.isInfoEnabled()) { LOG.info(Merging + segs.length + segments to + out + / + segmentName); } -JobConf job = new JobConf(getConf()); +JobConf job = new NutchJob(getConf()); job.setJobName(mergesegs + out + / + segmentName); job.setBoolean(segment.merger.filter, filter); job.setLong(segment.merger.slice, slice); Modified: lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/se gm ent/SegmentReader.java?view=diffrev=524932r1=524931r2=524932 == --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java (original) +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java Mon Apr 2 14:40:10 2007 @@ -17,18 +17,48 @@ package org.apache.nutch.segment; -import java.io.*; +import java.io.BufferedReader; +import java.io.BufferedWriter; +import java.io.IOException; +import java.io.InputStreamReader; +import java.io.OutputStreamWriter; +import java.io.PrintStream; +import java.io.PrintWriter; +import java.io.Writer; import java.text.SimpleDateFormat; -import java.util.*; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Date; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List
[VOTE] Release Apache Nutch 0.9
Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/rc2/ See the included CHANGES-0.9.txt file for details on release contents and latest changes. The release was made from the 0.9-dev trunk, including the recent patch applied by Dennis. I've also created a branch for this release candidate at: http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9. Please vote on releasing these packages as Apache Nutch 0.9. The vote is open for the next 72 hours. Only votes from Nutch committers are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... Thanks! Cheers, Chris
Re: [VOTE] Release Apache Nutch 0.9
Folks, As an FYI, here is a link to the log of the steps that I followed to get to this point in the release: http://people.apache.org/~mattmann/NUTCH_0.9_release_log_v2.doc Cheers, Chris On 4/2/07 10:52 PM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/rc2/ See the included CHANGES-0.9.txt file for details on release contents and latest changes. The release was made from the 0.9-dev trunk, including the recent patch applied by Dennis. I've also created a branch for this release candidate at: http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9. Please vote on releasing these packages as Apache Nutch 0.9. The vote is open for the next 72 hours. Only votes from Nutch committers are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... Thanks! Cheers, Chris
Re: [VOTE] Release Apache Nutch 0.9
Well, it's just going to add more work for me, but in the end, it's probably something that needs to be in there. I could go either way on this though, as in, if we don't commit it, 0.9.1 shouldn't be far off. Here's my +1 for going ahead and committing it... On 3/28/07 10:21 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? Dennis Andrzej Bialecki wrote: Doğacan Güney wrote: Hi, On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote: This is definitely a hadoop problem. This is similar to the classpath issues that we were encountering before with Hadoop and the ReductTaskRunner. When I include the nutch-*.jar in the hadoop class path the errors go away. Not a fix but it proves the point that this is an issue with Hadoop class loading. Dennis Kubes Dennis, you were running SegmentMerger, I presume? This occurs probably because in SegmentMerger and SegmentReader's dump Nutch uses JobConf instead of NutchJob. Because of this Hadoop can't find the necessary job file. I put a simple patch at http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you try it with this? Duh, the patch seems to be exactly what's needed - thanks Doğacan! In the future we should rework the test suite to execute using a clean Hadoop installation, i.e. one where Hadoop daemons are started without Nutch classes on the classpath.
Re: Next release - 0.10.0 or 1.0.0 ?
My +1 for 1.0.0. I already changed it to 0.10.0, but this can be easily reverted, and was probably something that I should have brought to the attention of the dev list before I did that (sorry about that). In any case, I think 1.0.0 makes a lot of sense, politically, and software wise. Nutch is production quality software (we use it in production environments here at JPL), and deserves to have a 1.0.0 release... My 2 cents, Chris On 3/28/07 11:38 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Hi all, I know it's a trivial issue, but still ... When this release is out, I propose that we should name the next release 1.0.0, and not 0.10.0. The effect is purely psychological, but it also reflects our confidence in the platform. Many Open Source projects are afraid of going to 1.0.0 and seem to be unable to ever reach this level, as if it were a magic step beyond which they are obliged to make some implied but unjustified promises ... Perhaps it's because in the commercial world everyone knows what a 1.0.0 release means :) The downside of the version numbering that never reaches 1.0.0 is that casual users don't know how usable the software is - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases to go before it becomes usable. Therefore I propose the following: * shorten the release cycle, so that we can make a release at least once every quarter. This was discussed before, and I hope we can make it happen, especially with the help of new forces that joined the team ;) * call the next version 1.0.0, and continue in increments of 0.1.0 for each bi-monhtly or quarterly release, * make critical bugfix / maintenance releases using increments of 0.0.1 - although the need for such would be greatly diminished with the shorter release cycle. * once we arrive at versions greater than x.5.0 we should plan for a big release (increment of 1.0.0). * we should use only single digits for small increments, i.e. limit them to values between 0-9. What do you think?
Re: [VOTE] Release Apache Nutch 0.9
I've gone ahead and figured out how to generate my GPG public key :-) It wasn't as hard as I thought. Anyways, I placed my gpg.txt file in: ~mattmann/gpg.txt On people.apache.org. I've also added my GPG key to the KEYS file in the nutch dist directory, /www/www.apache.org/dist/lucene/nutch/, using the same convention as the others. To get the header, I did a gpg --list-keys. Thanks! Cheers, Chris On 3/27/07 8:14 AM, Chris Mattmann [EMAIL PROTECTED] wrote: Hi Sami, A very limited acid test shows that I can do crawling and searching through web app so that part is ok. Great! Similar tests of my own showed the same. About signatures: I can't find your public gpg key anywhere (to verify the signature), not in KEYS file nor in keyservers I checked. Am i just blind? Yeah, in my release log, I actually noted this. I was having a hard time figuring out how to generate my public gpg key. Do you know what command to run? I know where the KEYS file is in the dist directory, so I'm guessing I just: 1. Generate my public gpg key (I already have my private one I guess) 2. Add that public gpg key to the KEYS file in the Nutch dist directory on pepole.apache.org Am I right about this? If so, could you tell me the command to run to generate my public gpg key? The md5 format used differs from rest of lucene sub projects. According to the Apache sign and release guide ( http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step), I ran the following command: openssl md5 nutch-0.9.tar.gz nutch-0.9.tar.gz.md5 To create it in similar format as the rest of lucene one could use md5sum file file.md5 We should probably adopt to same convention or wdot? It's fine by me, but, just for my reference, what's the difference between using the openssl md5 versus md5sum? If you want me to regenerate it, just let me know... Cheers, Chris -- Sami Siren
Re: [VOTE] Release Apache Nutch 0.9
Hey Sami, Well the sum itself is obviously the same :) The point in this is to use same conventions in Lucene family, not strictly required, but still IMO it just looks better. Okey dok -- I will run the md5sum command, and generate a .md5 for the nutch release that matches that. I will put it in the same place as the current md5 -- it should be there in 5 mins. Thanks! Cheers, Chris -- Sami Siren
Initiation of 0.9 release process
Hi Folks, As your friendly neighborhood 0.9 release manager, I just wanted to give you all a heads up that I'd like to begin the release process today. If I hear no objections by 00:00:00 UTC time, I will begin the release process then. I will notify the list as soon as I'm done. Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Initiation of 0.9 release process
Hey Dennis, I'm basically going to follow the release process on the wiki (pointed to by Doug), and the steps that I discussed with you and Sami (posted to the dev list). In terms of help, if there's anything in those steps that I get stuck on, I'll hollar at ya. Otherwise, if the process goes smoothly, I can probably get it done on my own. Thanks for the offer: I'll be sure to call on you if I get stuck. :-) Cheers, Chris On 3/26/07 10:06 AM, Dennis Kubes [EMAIL PROTECTED] wrote: Let me know if I can help in any way? Dennis Kubes Chris Mattmann wrote: Hi Folks, As your friendly neighborhood 0.9 release manager, I just wanted to give you all a heads up that I'd like to begin the release process today. If I hear no objections by 00:00:00 UTC time, I will begin the release process then. I will notify the list as soon as I'm done. Thanks! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Nutch 0 .9 release progress update
Hi Folks, Just to update everyone on progress. I've made it to Step 13 (waiting for release to appear on mirrors) in the Release Process: http://wiki.apache.org/nutch/Release_HOWTO You can view a full log of the fun that I've been having by going to: http://people.apache.org/~mattmann/NUTCH_0.9_release_log.doc Tomorrow when I wade up (here in Los Angeles, Pacific Standard Time), I will go ahead and wrap up the rest of the process. Thanks to all the folks who've given me guidance along the way. It's been interesting figuring out the process. Thanks! Cheers, Chris
Re: Nutch 0 .9 release progress update
Hi Sami, Thanks for the heads up! :-) Okay, so I did the following: 1. Removed nutch-0.9.* from people.apache.org:/www/www.apache.org/dist/lucene/nutch 2. Removed CHANGES-0.9.txt from the same place I will send out a separate email calling for a vote (thanks for the pointer to the example!) Thanks! Cheers, Chris On 3/26/07 10:22 PM, Sami Siren [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Folks, Just to update everyone on progress. I've made it to Step 13 (waiting for release to appear on mirrors) in the Release Process: Chris, thanks for your work so far. Seems like we're missing one important point in the rtfm: release review vote. Every apache release should be voted before it is made official. Three binding votes are required (I believe we now have enough active committers to do it this way?). So please put the artifacts in a staging area and call a vote before going further. (there's a nice example here for a vote mail: http://www.mail-archive.com/dev@jackrabbit.apache.org/msg04641.html) -- Sami Siren
[VOTE] Release Apache Nutch 0.9
Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release at http://people.apache.org/~mattmann/nutch_0.9/ See the included CHANGES-0.9.txt file for details on release contents and latest changes. The release was made from the 0.9-dev trunk. Please vote on releasing these packages as Apache Nutch 0.9. The vote is open for the next 72 hours. Only votes from Nutch committers are binding, but everyone is welcome to check the release candidate and voice their approval or disapproval. The vote passes if at least three binding +1 votes are cast. [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... Thanks! Cheers, Chris
Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt
Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Sat Mar 10 10:03:07 2007 New Revision: 516759 URL: http://svn.apache.org/viewvc?view=revrev=516759 Log: Updated to reflect commits of NUTCH-233 and NUTCH-436. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167 59r1=516758r2=516759 == --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007 @@ -50,6 +50,13 @@ 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan +Groschupf via kubes) + +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL + path is empty (kubes) + + ** WARNING !!! * This upgrade breaks data format compatibility. A tool 'convertdb' * * was added to migrate existing CrawlDb-s to the new format. Segment data *
Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt
Dennis, No probs. Thanks, a lot! Cheers, Chris On 3/10/07 5:35 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Dennis, Not to nit-pick, but the place where you inserted your change isn't at the end (where they typically should be placed). You inserted in the middle of the file, throwing off the numbering (there are now 2 sets of 18, and 19 in the unreleased changes section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris Sorry about that. I say the warning message thinking it was a version break. Everything should be fixed now. Dennis Kubes On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Sat Mar 10 10:03:07 2007 New Revision: 516759 URL: http://svn.apache.org/viewvc?view=revrev=516759 Log: Updated to reflect commits of NUTCH-233 and NUTCH-436. Modified: lucene/nutch/trunk/CHANGES.txt Modified: lucene/nutch/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=51 67 59r1=516758r2=516759 == --- lucene/nutch/trunk/CHANGES.txt (original) +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007 @@ -50,6 +50,13 @@ 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab) +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan +Groschupf via kubes) + +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL + path is empty (kubes) + + ** WARNING !!! * This upgrade breaks data format compatibility. A tool 'convertdb' * * was added to migrate existing CrawlDb-s to the new format. Segment data *
Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly
Hi Andrzej, Yep, +1. I also want to make a small update, where instead of creating a new NutchConf object, to just pass it through (maybe via the protocol layer?). Does this make sense? Cheers, Chris On 3/8/07 1:47 PM, Andrzej Bialecki (JIRA) [EMAIL PROTECTED] wrote: [ https://issues.apache.org/jira/browse/NUTCH-384?page=com.atlassian.jira.plugin .system.issuetabpanels:comment-tabpanel#action_12479442 ] Andrzej Bialecki commented on NUTCH-384: - +1 - although the patch needs whitespace cleanup before committing (indentation should be 2 literal spaces, if keyword should be separated by one space from the parens). Protocol-file plugin does not allow the parse plugins framework to operate properly - -- Key: NUTCH-384 URL: https://issues.apache.org/jira/browse/NUTCH-384 Project: Nutch Issue Type: Bug Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: All Reporter: Paul Ramirez Assigned To: Chris A. Mattmann Attachments: file_protocol_mime_patch.diff When using the file protocol one can not map a parse plugin to a content type. The only way to get the plugin called is through the default plugin. The issue is that the content type never gets mapped. Currently the content type does not get set by the file protocol.
Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly
Hi Andrzej, Ah, yep, you're right. I just did a cursory inspection, and hadn't applied the patch (yet). I didn't notice it was in the main method. Kk, sounds good. I am applying patch now, and will test later this afternoon, fix the whitespace stuff, and then commit. Thanks! Cheers, Chris On 3/8/07 1:55 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Andrzej, Yep, +1. I also want to make a small update, where instead of creating a new NutchConf object, to just pass it through (maybe via the protocol layer?). Does this make sense? I'm not sure what you mean - the only place where this patch creates a Configuration object is in File.main(), which is innocuous.
0.9 release
Hi Folks, As suggested by Sami, I'm moving this discussion to the nutch-dev list. Seems like I am the guy that is going to do the Nutch 0.9 release :-) However, it seems also that there are some issues that need to be sorted out first. I'd like to follow up to Andrzej's email about loose ends before moving forward with the release. So, here are my questions: 1. What remaining issues out there need to be applied to the sources, (or have patches contributed, then applied) and make it into 0.9? There were some discussions about this, however, I don't think we have a concrete set yet. The answer I'm looking for would be something like: A. NUTCH-XXX (has a patch), NUTCH-YYY (has a patch) before 0.9 is made B. NUTCH-ZZZ (patch in progress) before 0.9 is made C. We've got enough in 0.9-dev in the trunk right now to make a 0.9 release 2. Any outstanding things that need to get done that aren't really code that needs to get committed, e.g., things we need to close the loop on 3. Release Manager: I've got this taken care of, as soon as you all give me the green light. So, please, committer-brethren, let me know what you think about 1-3, as it would help me understand how to move forward. Thanks! Cheers, Chris
Re: Issues pending before 0.9 release
Hi Guys, Blocker * NUTCH-400 (Update add missing license headers) - I believe this is fixed and should be closed +1, thanks to Sami for closing it. * NUTCH-353 (pages that serverside forwards will be refetched every time) - this was partially fixed in NUTCH-273, but a more complete solution would require significant changes to LinkDb. As there are no patches implementing this, I left it open, but it's no longer as critical as it was before. I propose to move it to Major and address it in the next release. +1 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I propose to apply the fix provided by Sean Dean and close this issue for now. +1 Critical * NUTCH-436 (Incorrect handling of relative paths when the embedded URL path is empty). There is no patch available yet. If someone could contribute a patch I'd like to see this fixed before the release. Looks like Dennis is on this one * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's certainly not critical (as this is an optional new feature). I propose to change it to Major, and make a decision - do we want another plugin like parse-mp3 or parse-rtf, or not. Let's hold off on this: it's not necessary for 0.9, and I don't think there's been a bunch of traffic on the list identifying this as critical to get into the sources for the release * NUTCH-381 (Ignore external link not work as expected) - I'll try to reproduce it, and if I find an easy fix I'd like to apply it before the release. +1 * NUTCH-277 (Fetcher dies because of max. redirects) - I wasn't able to reproduce it. If there is no updated information on this I propose to close it with Can't reproduce. +1, I had to do something similar with NUTCH-258 * NUTCH-167 (Observation of META NAME=ROBOTS CONTENT=NOARCHIVE) - there's a patch which I tested in a limited production env. If there are no objections I'd like to apply it before the release. +1 Major = There are 84 major issues, but some of them are either invalid, or should be minor, or no longer apply and should be closed. Please review them if you can and provide some comments or recommendations if you think you have some new information. I will spend some time going through JIRA today and see if there's any issues that I can find that: 1. Have a patch already 2. Sound like something quick, easy, and not so far-reaching across the entire Nutch API One decision also that we need to make is which version of Hadoop should be included in the release. Current trunk uses 0.10.1, I have a set of production-tested patches that use 0.11.2, and today the Hadoop team released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time before our release). The most conservative option is to stay with 0.10.1, but by the time people start using Nutch this will be a fairly old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 - but in this case with the expectation that we release less than stable version of Nutch to be soon followed by a minor stable release ... I'd agree with the upgrade to 0.11.2, +1 Cheers, Chris P.S. I am going to contact Pitor and coordinate with him: I'd like to be the release manager for this Nutch release.
Re: Welcome Dennis Kubes as Nutch committer
Dennis, I take my coffee black: with a single creamer ;) Okay, okay, sorry: I thought we were talking about *real* hazing ;) Cheers, Chris On 2/28/07 12:31 PM, Dennis Kubes [EMAIL PROTECTED] wrote: Hi All, Thank you Andrzej for your kind words. I am looking forward to working together with everyone and I hope I can continue to be too inquisitive. I don't know if I can introduce myself shortly but I will try. ;) For those that don't know me I am based in Plano (Dallas), Texas. I am 28 and have been programming for about 12 years. So as first commit I need to add my name and re-publish the website. Let the hazing begin. Dennis Kubes Andrzej Bialecki wrote: Hi all, Some time ago I proposed to Lucene PMC that Dennis should become a Nutch committer. Dennis has been found guilty of providing too many good quality patches, sending too many supportive emails to the mailing lists, and generally being too inquisitive in nature, which led to a constant stream of comments, suggestions and patches. We weren't able to keep up - something had to be done about it ... ;) I'm glad to announce that Lucene PMC has voted in his favor. Congratulations and welcome aboard! (The tradition on Apache projects is that new committers should (shortly) introduce themselves, and as the first commit they should put their name in the Credits section of the website and re-publish the website). __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: log guards
Hi Doug, and Jerome, Ah, yes, the log guard conversation. I remember this from a while back. Hmmm, do you guys know what issue that this recorded as in JIRA? I have some free time recently, so I will be able to add this to my list of Nutch stuff to work on, and would be happy to take the lead on removing the guards where needed, and reviewing whether or not the debug ones make sense where they are. Cheers, Chris On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote: These guards were all introduced by a patch some time ago. I complained at the time and it was promised that this would be repaired, but it has not yet been. Yes, Sorry Doug that's my own fault I really don't have time to fix this :-( Best regards Jérôme __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS-fecter and index individul-how can i realize this function
Hi Doug, Okay, I see your points. It seems like this would be really useful for some current folks, and for Nutch going forward. I see that there has been some initial work today and preparing patches. I'd be happy to shepherd this into the sources. I will begin reviewing what's required, and contacting the folks who've begun work on this issue. Thanks! Cheers, Chris On 2/7/07 1:31 PM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Got it. So, the logic behind this is, why bother waiting until the following fetch to parse (and create ParseData objects from) the RSS items out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the RSS metadata in it. However, it's perfectly acceptable to have feeds that simply have a title, description, and link in it. Almost. The feed may have less than the referenced page, but it's also a lot easier to parse, since the link could be an anchor within a large page, or could be a page that has lots of navigation links, spam comments, etc. So feed entries are generally much more precise than the pages they reference, and may make for a higher-quality search experience. I guess this is still valuable metadata information to have, however, the only caveat is that the implication of the proposed change is: 1. We won't have cached copies, or fetched copies of the Content represented by the item links. Therefore, in this model, we won't be able to pull up a Nutch cache of the page corresponding to the RSS item, because we are circumventing the fetch step Good point. We indeed wouldn't have these URLs in the cache. 2. It sounds like a pretty fundamental API shift in Nutch, to support a single type of content, RSS. Even if there are more content types that follow this model, as Doug and Renaud both pointed out, there aren't a multitude of them (perhaps archive files, but can you think of any others)? Also true. On the other hand, Nutch provides 98% of an RSS search engine. It'd be a shame to have to re-invent everything else and it would be great if Nutch could evolve to support RSS well. Could image search might also benefit from this? One could generate a Parse for each image on a page whose text was from the page. Product search too, perhaps. The other main thing that comes to mind about this for me is it prevents the fetched Content for the RSS items from being able to provide useful metadata, in the sense that it doesn't explicitly fetch the content. What if we wanted to apply some super cool metadata extractor X that used word-stemming, HTML design analysis, and other techniques to extract metadata from the content pointed to by an RSS item link? In the proposed model, we assume that the RSS xml item tag already contains all necessary metadata for indexing, which in my mind, limits the model. Does what I am saying make sense? I'm not shooting down the issue, I'm just trying to brainstorm a bit here about the issue. Sure, the RSS feed may contain less than the page it references, but that might be all that one wishes to index. Otherwise, if, e.g., a blog includes titles from other recent posts you're going to get lots of false positives. Ideally Nutch should support various options: searching the feed only, searching the referenced page only, or perhaps searching both. Doug
Re: RSS-fecter and index individul-how can i realize this function
Guys, Sorry to be so thick-headed, but could someone explain to me in really simple language what this change is requesting that is different from the current Nutch API? I still don't get it, sorry... Cheers, Chris On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote: Renaud Richardet wrote: I see. I was thinking that I could index the feed items without having to fetch them individually. Okay, so if Parser#parse returned a MapString,Parse, then the URL for each parse should be that of its link, since you don't want to fetch that separately. Right? So now the question is, how much impact would this change to the Parser API have on the rest of Nutch? It would require changes to all Parser implementations, to ParseSegement, to ParseUtil, and to Fetcher. But, as far as I can tell, most of these changes look straightforward. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS-fecter and index individul-how can i realize this function
Hi Doug, Since the target of the link must still be indexed separately from the item itself, how much use is all this? If the RSS document is considered a single page that changes frequently, and item's links are considered ordinary outlinks, isn't much the same effect achieved? IMHO, yes. That's what it's been hard for me to understand the real use case for what Gal et al. are talking about. I've been trying to wrap my head around it, but it seems to me the capability they require is sort of already provided... Cheers, Chris Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: RSS-fecter and index individul-how can i realize this function
Hi Gal, et al., I'd like to be explicit when we talk about what the issue with the RSS parsing plugin is here; I think we have had conversations similar to this before and it seems that we keep talking around each other. I'd like to get to the heart of this matter so that the issue (if there is an actual one) gets addressed ;) Okay, so you mention below that the thing that you see missing from the current RSS parsing plugin is the ability to store data in the CrawlDatum, and parse it in the next fetch phase. Well, there are 2 options here for what you refer to as it: 1. If you're talking about the RSS file, then in fact, it is parsed, and its data is stored in the CrawlDatum, akin to any other form of content that is fetched, parsed and indexed. 2. If you're talking about the item links within the RSS file, in fact, they are parsed (eventually), and their data stored in the CrawlDatum, akin to any other form of content that is fetched, parsed, and indexed. This is accomplished by adding the RSS items as Outlinks when the RSS file is parsed: in this fashion, we go after all of the links in the RSS file, and make sure that we index their content as well. Thus, if you had an RSS file R that contained links in it to a PDF file A, and another HTML page P, then not only would R get fetched, parsed, and indexed, but so would A and P, because they are item links within R. Then queries that would match R (the physical RSS file), would additionally match things such as P and A, and all 3 would be capable of being returned in a Nutch query. Does this make sense? Is this the issue that you're talking about? Am I nuts? ;) Cheers, Chris On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote: Hi, Many sites provide RSS feeds for several reasons, usually to save bandwidth, to give the users concentrated data and so forth. Some of the RSS files supplied by sites are created specially for search engines where each RSS item represent a web page in the site. IMHO the only thing missing in the parse-rss plugin is storing the data in the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new flag to CrawlDatum, that would flag the URL as parsable not fetchable? Just my two cents... Gal. -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 31, 2007 8:44 AM To: nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how
Re: RSS-fecter and index individul-how can i realize this function
Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
Hi there, On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Chris, I saw your name associated with the rss parser in nutch. My understanding is that nutch is using feedparser. I had two questions: 1. Have you looked at vtd as an rss parser? I haven't in fact; what are its benefits over those of commons-feedparser? 2. Any view on asynchronous communication as the underlying protocol? I do not believe that feedparser uses that at this point. I'm not sure exactly what asynchronous communication when parsing rss feeds affords you: what type of communications are you talking about above? Nutch handles the communications layer for fetching content using a pluggable, Protocol-based model. The only feature that Nutch's rss parser uses from the underlying feedparser library is its object model and callback framework for parsing RSS/Atom/Feed XML documents. When you mention asynchronous above, are you talking about the protocol for fetching the different RSS documents? Thanks! Cheers, Chris Thanks -Original Message- From: Chris Mattmann [EMAIL PROTECTED] Date: Tue, 30 Jan 2007 18:16:44 To:nutch-dev@lucene.apache.org Subject: Re: RSS-fecter and index individul-how can i realize this function Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What’s I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item’s info as a individual hit. What’s my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM’s link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I’ve check the code of the plug-in protocol-http’s code ,but I can’t find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM’s structure item title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item
Re: RSS-fecter and index individul-how can i realize this function
Hi there, With the explanation that you give below, it seems like parse-rss as it exists would address what you are trying to do. parse-rss parses an RSS channel as a set of items, and indexes overall metadata about the RSS file, including parse text, and index data, but it also adds each item (in the channel)'s URL as an Outlink, so that Nutch will process those pieces of content as well. The only thing that you suggest below that parse-rss currently doesn't do, is to allow you to associate the metadata fields category:, and author: with the item Outlink... Cheers, Chris On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote: thx for ur reply . mybe i didn't tell clearly . I want to index the item as a individual page .then when i search the some thing for example nutch-open source, the nutch return a hit which contain title : nutch-open source description : nutch nutch nutch nutch nutch url : http://lucene.apache.org/nutch category : news author : kauu so , is the plugin parse-rss can satisfy what i need? item titlenutch--open source/title description nutch nutch nutch nutch nutch /description linkhttp://lucene.apache.org/nutch/link categorynews /category authorkauu/author On 1/31/07, Chris Mattmann [EMAIL PROTECTED] wrote: Hi there, I could most likely be of assistance, if you gave me some more information. For instance: I'm wondering if the use case you describe below is already supported by the current RSS parse plugin? The current RSS parser, parse-rss, does in fact index individual items that are pointed to by an RSS document. The items are added as Nutch Outlinks, and added to the overall queue of URLs to fetch. Doesn't this satisfy what you mention below? Or am I missing something? Cheers, Chris On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote: Hi folks : What's I want to do is to separate a rss file into several pages . Just as what has been discussed before. I want fetch a rss page and index it as different documents in the index. So the searcher can search the Item's info as a individual hit. What's my opinion create a protocol for fetch the rss page and store it as several one which just contain one ITEM tag .but the unique key is the url , so how can I store them with the ITEM's link tag as the unique key for a document. So my question is how to realize this function in nutch-.0.8.x. I've check the code of the plug-in protocol-http's code ,but I can't find the code where to store a page to a document. I want to separate the rss page to several ones before storing it as a document but several ones. So any one can give me some hints? Any reply will be appreciated ! ITEM's structure item title欧洲暴风雪后发制人 致航班 延误交通混乱(组图)/title description暴风雪横扫欧洲,导致多次航班延误 1 月24日,几架民航客机在德 国斯图加特机场内等待去除机身上冰雪。1月24日,工作人员在德国南部 的慕尼黑机场 清扫飞机跑道上的积雪。 据报道,迟来的暴风雪连续两天横扫中... /description linkhttp://news.sohu.com/20070125 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/ link category搜狐焦点图新闻/category author[EMAIL PROTECTED] /author pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate comments http://comment.news.sohu.com http://comment.news.sohu.com/comment/topic.jsp?id=247833847 /comment/topic.jsp?id=247833847/comments /item -- www.babatu.com
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Hi Doug, So, does this render the patch that I wrote obsolete? Cheers, Chris On 1/25/07 10:08 AM, Doug Cutting [EMAIL PROTECTED] wrote: Scott Ganyo (JIRA) wrote: ... since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor ... FYI, Hadoop no longer does this. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
It's at least out-of-date and perhaps obsolete. A quick read of Fetcher.java looks like there might be a case where a fatal error is logged but the fetcher doesn't exit, in FetcherThread#output(). So this raises an interesting question: People (such as Scott G.) out there -- are you folks still experiencing similar problems? Do the recent Hadoop changes alleviate the bad behavior you were experiencing? If so, then maybe this issue should be closed... Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Reviving Nutch 0.7
Before doubling (or after 0.9.0 tripling?) the maintenance/development work please consider the following: One option would be re factoring the code in a way that the parts that are usable to other projects like protocols?, parsers (this actually was proposed by Jukka Zitting some time last year) and stuff would be modified to be independent of nutch (and hadoop) code. Yeah, this is easy to say, but would require significant amount of work. The more focused,smaller chunks of nutch would probably also get bigger audience (perhaps also outside nutch land) and that way perhaps more people willing to work for them. Don't know about others but at least I would be more willing to work towards this goal than the one where there would be practically many separate projects, each sharing common functionality but different code base. +1 ;) This was actually the project proposed by Jerome Charron and myself, called Tika. We went so far as to create a project proposal, and send it out to the nutch-dev list, as well as the Lucene PMC for potential Lucene sub-project goodness. I could probably dig up the proposal should the need arise. Good ol' Jukka then took that effort and created us a project within Google code, that still lives in there in fact: http://code.google.com/p/tika/ There hasn't be active development on it because: 1. None of us (I'm speaking for Jerome, and myself here) ended up having the time to shepherd it going forward 2. There was little, if any response, from the proposal to the nutch-dev list, and folks willing to contribute (besides people like Jukka) 3. I think, as you correctly note above, most people thought it to be too much of a Herculean effort that wouldn't pay the necessary dividends in the end to undertake it In any case, I think that, if we are going to maintain separate branches of the source, in fact, really parallel projects, then an undertaking such as Tika is properly needed ... Cheers, Chris -- Sami Siren
Re: How to Become a Nutch Developer
Hi Dennis, On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote: All, I am working on a How to Become a Nutch Developer document for the wiki and I need some input. I need an overview of how the process for JIRA works? If I am a developer new to Nutch and just starting to look at the JIRA and I want to start working on some piece of functionality or to help with bug fixes where would I look. JIRA provides a lot of search facilities: it's actually kind of nice. The starting point for browsing bugs and other types of issues is: http://issues.apache.org/jira/browse/NUTCH (in general, for all Apache projects that use JIRA, you'll find that their issue tracking system boils down to: http://issues.apache.org/jira/browse/APACHE_PROJ_JIRA_ID ) From there, you can access canned filters for open issues like: Blocker Critical Major Minor Trivial For more detailed search capabilities, click on the Find Issues button at the top breadcrumb bar. Search capabilities there include the ability to look for issues by developer, status, issue type, and to combine such fields using AND, and OR. Additionally, you can issue a free text query across all issues by using the free text box there. Would I just choose something that is unscheduled and begin working on it? That's a good starting point: additionally, high priority issues marked as Blockers, Critical and Major are always good because the sooner we (the committers) get a patch for those, the sooner we'll be testing it for inclusion into the sources. What if I see something that I want to work on but it is scheduled to somebody else? Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you don't have to do that. ;) Just speak up on the mailing list, and volunteer your support. One of the people listed in the group nutch-developers in JIRA (e.g., the committers) can reassign the issue to you so long as the other gent it was assigned to doesn't mind... Are items only scheduled to committers or can they be scheduled to developers as well? If they can be scheduled to regular developers how does someone get their name on the list to be scheduled items? Items can be scheduled to folks listed in the nutch-developers group within JIRA. Most of these folks are the committers, however, not all of them are. I'm not entirely sure how folks get into that group (maybe Doug?), however, that's the real criteria for having a JIRA issue officially assigned to you. However, that doesn't mean that you can't work on things in lieu of that. If there's an issue that you'd like to contribute to, please, prepare a patch, attach it to JIRA, and then speak up on the mailing list. Chances are, with the recent busy schedules of the committers (including myself) besides Sami, and Andrzej, the committers don't have time to prepare patches for the issue assigned to them. If you contribute a great patch, the committer will pick it up, test it, apply it, and you'll get the same effect as if the issue were directly assigned to you. Should I submit a JIRA and/or notify the list before I start working on something? What is the common process for this? Yup, that's pretty much it. Voice your desire to work on a particular task on the nutch-dev list. Many of the developers on that list have been around for a while now, and they know what's been discussed, and implemented before. When I submit a JIRA is there anything else I need to do either in the JIRA system or with the mailing lists, committers, etc? Nope: the nutch-dev list is automatically notified by all JIRA issue submissions, and the committers (and rest of the folks) will pick up on this and act accordingly. Getting this information together in one place will go a long way toward helping others to start contributing more and more. Thanks for all your input. No probs, glad to be of service :-) Cheers, Chris Dennis Kubes
Re: Next Nutch release
Folks, When would you like to make the release? I've been working on NUTCH-185, but got a bit bogged down with other work. If there is interest in having NUTCH-185 included in the release, I could make a push to get out a patch by week's end... As for the rest, my +1 for NUTCH-61 being included sooner rather than later. It seems that the patch has garnered enough use and attention that folks would like to see it in the release. I think the email from the user trying to manage a terabyte of data a few days back was particularly telling. Cheers, Chris On 1/16/07 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Sami Siren wrote: Hello, It has been a while from a previous release (0.8.1) and looking at the great fixes done in trunk I'd start thinking about baking a new release soon. Looking at the jira roadmaps there are 1 blocking issues (fixing the license headers) for 0.8.2 and two other blocking issues for 0.9.0 of which I think NUTCH-233 is safe to put in. Agreed. The replacement regex mentioned in the original comment seems safe enough, and simpler. The top 10 voted issues are currently: NUTCH-61Adaptive re-fetch interval. Detecting umodified content Well ... I'm of a split mind on this. I can bring this patch up to date and apply it before 0.9.0, if we understand that this is a 0 release ... ;) Otherwise I'd prefer to wait with it right after the release. I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus some changes I made in the meantime), since I'd like to expose the new fetcher to a broader audience, and it doesn't affect the existing implementation. NUTCH-48 Did you mean query enhancement/refignment feature NUTCH-251 Administration GUI NUTCH-289 CrawlDatum should store IP address I'm still not entirely convinced about this - and there is already a mechanism in place to support it if someone really wishes to keep this particular info (CrawlDatum.metaData). NUTCH-36 Chinese in Nutch NUTCH-185 XMLParser is configurable xml parser plugin. NUTCH-59 meta data support in webdb NUTCH-92 DistributedSearch incorrectly scores results NUTCH-68 This is too intrusive to fix just before the release - and needs additional discussion. NUTCH-68 A tool to generate arbitrary fetchlists Easy to port this to 0.9.0 - I can do this. NUTCH-87 Efficient site-specific crawling for a large number of sites
Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
Hi Sami, On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: siren Date: Sat Dec 9 14:27:07 2006 New Revision: 485076 URL: http://svn.apache.org/viewvc?view=revrev=485076 Log: Optimize SpellCheckedMetadata further by taking into account the fact that it is used only for http-headers. I am starting to believe that spellchecking should just be an utility method used by http protocol plugins. I think that right now I'm -1 for this change. I would make note of all the comments on NUTCH-139, from which this code was born. In the end, I think what we all realized was that the spell checking capabilities is necessary, but not everywhere, as you point out. However, I don't think it's limited entirely to HTTP headers (what you've currently changed the code to). I think it should be implemented as a protocol layer service, also providing spell checking support to other protocol plugins, like protocol-file, etc., where field headers run the risk of being misspelled as well. What's to stop someone from implementing protocol-file++ that returns different file header keys than that of protocol-file? Just b/c HTTP is the most pervasively used plugin right now, I think it's convenient to assume that only HTTP protocol field keys may need spell checking services. Just my 2 cents... Cheers, Chris
Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java
Hi Sami, Indeed, I see your point. I guess what I was advocating for was more of a ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we could update the code that you have below to use ProtocolHeaders.class rather than HttpHeaders.class. We would then make ProtocolHeaders extend HttpHeaders, so that it by default inherits all of the HttpHeaders, while still allowing more ProtocolHeader met keys (e.g., we could have an interface for FileHeaders, etc.). What do you think about that? Alternatively we could just create a ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all the met key fields from HttpHeaders, and it would be the place that the met key fields for FileHeaders, etc. could go into. Let me know what you think, and thanks! Cheers, Chris On 12/9/06 3:53 PM, Sami Siren [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Sami, On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: siren Date: Sat Dec 9 14:27:07 2006 New Revision: 485076 URL: http://svn.apache.org/viewvc?view=revrev=485076 Log: Optimize SpellCheckedMetadata further by taking into account the fact that it is used only for http-headers. I am starting to believe that spellchecking should just be an utility method used by http protocol plugins. I think that right now I'm -1 for this change. I would make note of all the comments on NUTCH-139, from which this code was born. In the end, I think what we all realized was that the spell checking capabilities is necessary, but not everywhere, as you point out. However, I don't think it's limited entirely to HTTP headers (what you've currently changed the code to). I think it should be implemented as a protocol layer service, also providing spell checking support to other protocol plugins, like protocol-file, etc., In protocol file all headers are artificial an generated in nutch code so if there's spelling mistake there then we should fix the code generating the headers and not rely on spellchecking in the first place. where field headers run the risk of being misspelled as well. What's to stop someone from implementing protocol-file++ that returns different file header keys than that of protocol-file? Just b/c HTTP is the most pervasively used plugin right now, I think it's convenient to assume that only HTTP protocol field keys may need spell checking services. If there's a real need for spell checking on other keys one can just add more classes to the array no big deal. -- Sami Siren
Re: [jira] Updated: (NUTCH-379) ParseUtil does not pass through the content's URL to the ParserFactory
Hi Guys, Can we disable the selection of released versions within JIRA for issues so that people like me don't continue to get confused? Thanks! Cheers, Chris On 10/13/06 9:32 AM, Sami Siren (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-379?page=all ] Sami Siren updated NUTCH-379: - Fix Version/s: (was: 0.8.1) (was: 0.8) cannot fix released versions ParseUtil does not pass through the content's URL to the ParserFactory -- Key: NUTCH-379 URL: http://issues.apache.org/jira/browse/NUTCH-379 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.8, 0.9.0 Environment: Power Mac Dual G5, 2.0 Ghz, although fix is independent of environment Reporter: Chris A. Mattmann Assigned To: Chris A. Mattmann Fix For: 0.8.2, 0.9.0 Attachments: NUTCH-379.Mattmann.100406.patch.txt Currently the ParseUtil class that is called by the Fetcher to actually perform the parsing of content does not forward thorugh the content's url for use in the ParserFactory. A bigger issue, however, is that the url (and for that matter, the pathSuffix) is no longer used to determine which parsing plugin should be called. My colleague at JPL discovered that more major bug and will soon input a JIRA issue for it. However, in the meantime, this small patch at least sets up the forwarding of the content's URL to the ParserFactory.
Nutch requires JDK 1.5 now?
Hi Folks, I noticed that Nutch now requires JDK 5 in order to compile, due to recent changes to the PluginRepository and some other classes. I think that this is a good move, however, I wasn't sure that I had seen any official announcement that Nutch now requires 1.5... Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Nutch requires JDK 1.5 now?
The switch to 1.5 format was also logged on jira issue http://issues.apache.org/jira/browse/NUTCH-360 -- Sami Siren Ahh, I didn't see this. Way to go Sami, I love it when people actually keep records of changes! ;) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Nutch requires JDK 1.5 now?
Hey Guys, Speaking of which, I noticed that Sami's issue below is a Task in JIRA, which reminded me of a task that I input a long time ago that would be nice to fix real quick (for those with JIRA permissions to do so): http://issues.apache.org/jira/browse/NUTCH-304 We should really change the email address for JIRA to not use the Apache incubator one anymore, and to use to Lucene one. Sound good? If so, could someone with permissions please take care of it? :-) Cheers, Chris On 10/3/06 9:04 AM, Sami Siren [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: Chris Mattmann wrote: Hi Folks, I noticed that Nutch now requires JDK 5 in order to compile, due to recent changes to the PluginRepository and some other classes. I think that this is a good move, however, I wasn't sure that I had seen any official announcement that Nutch now requires 1.5... This is a proactive change - as soon as we upgrade to Hadoop 0.6.x we will lose 1.4 compatibility anyway, so we may as well prepare in advance. Also, Now refers to the unreleased 0.9, we will keep branch 0.8.x compatible with 1.4. The switch to 1.5 format was also logged on jira issue http://issues.apache.org/jira/browse/NUTCH-360 -- Sami Siren __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Patch Available status?
Hi Doug, But the nutch-developers Jira group pretty closely corresponds to Nutch's committers, so perhaps all committers should be permitted to close, although this should be exercised with caution, only at releases, since closes cannot be undone in this workflow. Another alternative would be to construct a new workflow that just adds the Patch Available status and still permits issues to be re-opened. Which sounds best for Nutch? Good question. Well, my personal preference would be for one that allows issue closes to be undone, as I've seen several cases (even some recent ones such as NUTCH-258) where someone in the nutch-developers group has closed an issue (including myself) that users in fact don't believe is resolved. So my +1 for having the 2nd option above: an alternative workflow to that of the Hadoop one that simply adds the Patch Available status and still permits issues to be re-opened. Just my 2 cents. Thanks! Cheers, Chris Doug
Re: Patch Available status?
Hi Doug and Andrzej, +1. I think that workflow makes a lot of sense. Currently users in the nutch-developers group can close and resolve issues. In the Hadoop workflow, would this continue to be the case? Cheers, Chris On 8/30/06 3:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doug Cutting wrote: Sami Siren wrote: I am not able to do it either, or then I just don't know how, can Doug help us here? This requires a change the the project's workflow. I'd be happy to move Nutch to use the workflow we use for Hadoop, which supports Patch Available. This workflow has one other non-default feature, which is that bugs, once closed, cannot be re-opened. This works as follows: Only project administrators are allowed to close issues. Bugs are resolved as they're fixed, and only closed when a release is made. This keeps the release notes Jira generates from changing after a release is made. Would you like me to switch Nutch to use this Jira workflow? +1, this would finally make sense with the resolved vs. closed ...
Re: 0.8 not loading plugins
Hi Chris, It seems from your email message that your plugin is located in $NUTCH_HOME/build/custom-meta? Is this where your plugin * code * is currently stored? If so, this is the wrong location and the most likely reason that your plugin isn't being loaded. Plugin code should live in $NUTCH_HOME/src/plugins, so in your case, you'd have /usr/local/nutch-0.8/src/plugin/custom-meta, with the underlying plugin code dir structure underneath there. Then, to deploy your plugin to the build directory (which is $NUTCH_HOME/build/plugins), you would type: ant deploy. Give this a shot and see if that fixes it. Cheers, Chris On 8/17/06 3:05 PM, Chris Stephens [EMAIL PROTECTED] wrote: Its definitely not trying to load my plugin, I added that debug setting and didn't see anything regarding my plugin. One thing I noticed is that my plugin is not in the plugins directory. At what point do the plugs get copied there? Here is the output from my compile: compile: [echo] Compiling plugin: custom-meta [javac] Compiling 3 source files to /usr/local/nutch-0.8/build/custom-meta/classes [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. jar: [jar] Building jar: /usr/local/nutch-0.8/build/custom-meta/custom-meta.jar deps-test: deploy: [copy] Copying 1 file to /usr/local/nutch-0.8/build/plugins/custom-meta HUYLEBROECK Jeremy RD-ILAB-SSF wrote: Did you check if your plugin.xml is read by putting the plugin package in debug mode? (put this in the log4j.properties) log4j.logger.org.apache.nutch.plugin=DEBUG -Original Message- From: Chris Stephens [mailto:[EMAIL PROTECTED] Sent: Thursday, August 17, 2006 2:30 PM To: nutch-dev@lucene.apache.org Subject: Re: 0.8 not loading plugins I have this line in src/plugin/build.xml under the deploy section: ant dir=custom-meta target=deploy / The plugin is compiling ok. I spent several days getting errors on compile and investing how to port them to 0.8. Jonathan Addison wrote: Hi Chris, Chris Stephens wrote: I think I finally have my plugin ported to 0.8, however I cannot get my plugin to load. My plugin.includes file in conf/nutch-site.xml has the following for its plugin.includes value: valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-basic |query-(basic|site|url)|summary-basic|scoring-opic|custom-meta/value My plugin is the 'custom-meta' entry at the end. My plugin never shows up in the Registered Plugins list in the hadoop.log, and lines in my plugin that run logger.info never show up as well. Is there a step I am missing with 0.8, what should I do next to debug the problem? Have you also added your plugin to plugin/build.xml? Thank you, Chris Stephens
Re: Tika update
Hi Jukka, Thanks for your email. Indeed, there was discussion on the Lucene PMC email list, about the Tika project. It was decided by the powers that be to discuss it more on the Nutch mailing list before moving forward with any vote on making Tika a sub-project of Apache Lucene. With regards to that, my action was to send the Tika proposal to the nutch-dev list, and help to start up a discussion on Tika, to get feedback from the community. Seeing as though you lighted the fire under this (thanks!), it's only appropriate for me to send out the Tika project proposal sent to the Lucene PMC. So, here it is, attached. I'd love to here feedback from the Nutch community on what it thinks of such a project. Cheers, Chris On 8/16/06 4:06 AM, Jukka Zitting [EMAIL PROTECTED] wrote: Hi, There was recently discussion on perhaps starting a new Lucene sub-project, named Tika, to create a general-purpose library from the parser components and other features in Nutch that might interest a wider audience. To keep things rolling we've created a temporary staging area for the project at http://code.google.com/p/tika/ on Google Code, and I've started to flesh out a potential project structure using Maven 2. Note that the project materials in svn refer to the project as Apache Tika even though the project has *not* been officially accepted. The reason for this is that the Google Code project is just a temporary staging ground and I wanted to give a better idea of what the project could look like if accepted. The jury is still out on whether to start a project like this, so any comments and feedback on the idea are very much welcome. Most, if not all, code in Tika will be based on existing code from Nutch and other Apache projects, so I'm not sure if the project needs to go through the Incubator if accepted by the Lucene PMC. So far the tika source tree contains just a modified version of my TextExtractor code from the Apache Jackrabbit project, and Jérôme is planning to add some of his stuff. The source tree at Google Code should be considered just a playground for bringing things together and discussing ideas, before migrating back to ASF infrastructure. BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development
Re: Any plans to move to build Nutchusing Maven?
Hi Steven, On 8/16/06 7:36 AM, steven shingler [EMAIL PROTECTED] wrote: (This thread moved from the User List.) OK Lukas, lets open it up to the dev list! :) Particularly, does the group feel moving to Maven would be _a good thing_ ? +1 I suggested this (however did not make any progress on realizing it ;) ) a while back. I think it makes a * lot of sense *. Maven's dependency system would significantly reduce the size of the CM'ed Nutch source code, as all the jars required by Nutch could be referenced externally (plugins are a different beast, but we're working on that). Additionally, Maven would allow automatic generation of a soft of nightly build Nutch site, showing recent commits, unit test results and more. Even if so, what are the problems? The main problem I see is the plugin system, and how to appropriate represent plugin dependencies in Maven (or just neglect to elegantly handle them, and treat them like invididual projects, like nutch, which requires CM'ing jar files). Additionally, I think it will probably require writing some custom Jelly scripts to do all the neat ant build stuff that Nutch does on the side (e.g., unpack Hadoop, etc.). There are currently two versions of Lucene in the Maven repos, but Hadoop would have to be added manually, I think. It would probably make most sense to run a Maven repo explicitly for Nutch off of the Lucene Nutch site. Something like (http://lucene.apache.org/nutch/maven/) might be sensible. Just my 2 cents. Cheers, Chris All thoughts gratefully received. Cheers Steven On 8/16/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I would like to help. But first of all I would suggest to start wider discussion in dev list to get more feedback/suggestions. I think one problem can be that Nutch depends on both Lucene and Hadoop libraries and it won't be easy to maintain these dependencies if recent versions are not yet committed into some maven accesible repo. Regards, Lukas On 8/16/06, steven shingler [EMAIL PROTECTED] wrote: Well I'm up for giving it a try. My current work has me looking at both Nutch and Maven, so what better way to understand both projects :) I agree it is far from trivial - so if anyone here would like to collaborate on it, that would be great. Cheers, Steven On 8/15/06, Lukas Vlcek [EMAIL PROTECTED] wrote: Hi, I would warmly appreciate this activity. At least it would help more people to understand/join this great project. But I don't think this will be an easy step (this reminds me what N.Armstrong said on moon: That's one small step for [a] man, one giant leap for mankind.) :-) Regards, Lukas On 8/15/06, Sami Siren [EMAIL PROTECTED] wrote: steven shingler wrote: Hi all, I know this has come up at least once before, but I just thought I'd raise the question again: Are there any plans to move to building Nutch using Maven? Haven't heard of such activities, however if you or somebody else can put such thing together and it proves to be a good thing to do then I certainly don't have anything against it. -- Sami Siren
Patch Available status?
Hi Guys, I've seen on the Hadoop mailing list recently that there was a new status added for issues in JIRA called Patch Available to let committers know that a patch is ready for review to commit. How about we add this to the Nutch jira instance as well? I tried doing this, but I don't think I have the permissions to do so. I've got 2 patches for issues that are attached in jira that I'd like to set as having this new status :-) https://issues.apache.org/jira/browse/NUTCH-338 https://issues.apache.org/jira/browse/NUTCH-258 Cheers, Chris
Re: parse-plugins.xml
Hi Marko, Thanks for your question. Basically it was set up as a sort of last result of getting at least * some * information from the PDF file, albeit littered with garbage. If indeed the parse-text does not really make sense in terms of a backup parser to handle PDF files and get at least some text to index, then we may think of either (a) removing it from the default parse-plugins.xml, or (b) writing a simple PdfParser that can handle truncation as a backup to the existing PdfParser. Basically the philosophy behind each mimeType entry in parse-plugins.xml is to try and map the set of existing Nutch parse-plugins to the available content types, giving each mimeType as many options as possible in terms of getting some content out of them. Cheers, Chris On 8/3/06 4:04 AM, Marko Bauhardt [EMAIL PROTECTED] wrote: Hi all, i have a question about the parse-plugins.xml and application/pdf. Why is the TextParse used for parsing pdf files? The mimiType appliation/pdf is mapped to parse-pdf and parse-text. But the TextParser does not support pdf files. The problem is, if the pdf file is truncated the textparser parse this content and the indexer indexing waste. So what is the reason to map application/pdf to the parse-text plugin? mimeType name=application/pdf plugin id=parse-pdf / plugin id=parse-text / /mimeType Thanks for hints, Marko
Re: parse-plugins.xml
Hey Andrzej, On 8/3/06 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Chris Mattmann wrote: Hi Marko, Thanks for your question. Basically it was set up as a sort of last result of getting at least * some * information from the PDF file, albeit littered with garbage. If indeed the parse-text does not really make sense IMO it doesn't make sense. PDF text content, even if it's available in plain text, is usually compressed. The percentage of non-compressed PDFs out there in my experience is negligible. in terms of a backup parser to handle PDF files and get at least some text to index, then we may think of either (a) removing it from the default +1 Okey dok, you'll find a quick patch this at: http://issues.apache.org/jira/browse/NUTCH-338 I decided to create an issue to just keep track of the fact that we made this change, and additionally because I tried pasting the quick patch into my email program here on my Mac and it looked like it was coming out weird :-) parse-plugins.xml, or (b) writing a simple PdfParser that can handle truncation as a backup to the existing PdfParser. Basically the philosophy I think that simple PDF parser is an oxymoron ... ;) Heh, I agree with you on that one. If everyone would just move to XML DocBook, then it would be great! ;) Thanks! Cheers, Chris
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Folks, Before I (or someone else) reopens the issue, I think it's important to understand the implications: 1) Having a *side-effect* of the entire system stop processing after merely logging a message at a certain event level is a poor practice. I'm not sure that the Fetcher quitting is a * side-effect * as you call it. In fact, I think it's clearly stated as the behavior of the system, both within the code, and in several mailing list conversations I've seen over the course of the past two years (I can dig these up, if needed). In fact, I believe that this would make a fantastic anti-pattern. If this kind of behavior is *really* wanted (and I argue that it should not be below), it should be done through an explicit mechanism, not as a side-effect. Again, the use of side-effect here is strange to me: how is an explicit check for any LOG messages to the SEVERE level before quitting a side-effect? For example, did you realize that since Hadoop hijacks and reassigns all log formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter static constructor that anyone using Nutch as a library and logs a SEVERE\ error will suffer by having Nutch stop fetching? I'm not convinced that having Nutch stop fetching when a SEVERE error is logged is the wrong behavior. Let's think about what possible SEVERE errors may typically be logged: Out of Memory error, potentially, InterruptedExceptions in Threads (possibly), failure in any of the plugin libraries critical to the fetch running (possibly), the list goes on and on. So, in this case, you argue that the Fetcher should continue operating? 2) Moreover, having the system stop processing forever more by use of a static(!) flag makes the use of the Nutch system as a library within a server or service environment impossible. Once this logging is done, no more Fetcher processing in this run *or any other* can take place. I've been using Nutch in a server environment (JSPs and Tomcat) within a large-scale data system at NASA for the course of the past year, and have never been impeded by the behavior of the fetcher. Can you be more specific here as to the exact use-case that's failing in your scenario? I've also been watching the mailing lists for the better course of almost 2 years, and have seen little traffic (outside of the aforementioned clarifications/etc. above) about this issue. I may be out on an island here, but again, I'm not convinced that this is a core issue. Just my 2 cents. If the votes continue that this is an issue, however, I'll have no problem opening it up (or one of the committers can do it as well). Cheers, Chris On 6/5/06 7:11 AM, Stefan Groschupf (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] Stefan Groschupf commented on NUTCH-258: Scott, I agree with you. However we need a clean patch to solve the problem, we can not just comment things out of the code. So I vote for the issue and I vote to reopen this issue. Once Nutch logs a SEVERE log item, Nutch fails forevermore -- Key: NUTCH-258 URL: http://issues.apache.org/jira/browse/NUTCH-258 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev Environment: All Reporter: Scott Ganyo Priority: Critical Attachments: dumbfix.patch Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. This is from the run() method in Fetcher.java: public void run() { synchronized (Fetcher.this) {activeThreads++;} // count threads try { UTF8 key = new UTF8(); CrawlDatum datum = new CrawlDatum(); while (true) { if (LogFormatter.hasLoggedSevere()) // something bad happened break;// exit Notice the last 2 lines. This will prevent Nutch from ever Fetching again once this is hit as LogFormatter is storing this data as a static. (Also note that LogFormatter.hasLoggedSevere() is also checked in org.apache.nutch.net.URLFilterChecker and will disable this class as well.) This must be fixed or Nutch cannot be run as any kind of long-running service. Furthermore, I believe it is a poor decision to rely on a logging event to determine the state of the application - this could have any number of side-effects that would be extremely difficult to track down. (As it has already for me.) __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810
Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore
Hi Andrzej, The main problem, as Scott observed, is that the static flag affects all instances of the task executing inside the same JVM. If there are several Fetcher tasks (or any other tasks that check for SEVERE flag!), belonging to different jobs, all of them will quit. This is certainly not the intended behavior. Got it. In fact, I believe that this would make a fantastic anti-pattern. If this kind of behavior is *really* wanted (and I argue that it should not be below), it should be done through an explicit mechanism, not as a side-effect. I have a proposal for a simple solution: set a flag in the current Configuration instance, and check for this flag. The Configuration instance provides a task-specific context persisting throughout the lifetime of a task - but limited only to that task. Voila - problem solved. We get rid of the dubious use of LogFormatter (I hope Chris that even you would agree that this pattern is slightly .. unusual ;) ) What, unusual? Huh? :-) and we gain flexible mechanism limited in scope to the current task, which ensures isolation from other tasks in the same JVM. How about that? +1 I like your proposed solution. I haven't used multiple fetchers really inside the same process too, much however, I do have an application that calls fetches in more of a sequential way in the same JVM. So, I guess I just never ran across the behavior. The thing I like about the proposed solution is its separation and isolation of a task context, which I think that Nutch (now relying on Hadoop as the underlying architectural computing platform) needed to address. So, to summarize, the proposed resolution is: * add flag field in Configuration instance to signify whether or not a SEVERE error has been logged within a task's context * check this field within the fetcher to determine whether or not to stop the fetcher, just for that fetching task identified by its Configuration (and no others) Is this representative of what you're proposing Andrzej? If so, I'd like to take the lead on contributing a small patch that handles this, and then it would be great if people like Scott could test this out in their existing environments where this error was manifesting itself. Thanks! Cheers, Chris (BTW: would you like me to re-open the JIRA issue, or do you want to do it?) __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Nutch Parser Bug
Hi Alex, I also noticed this issue a while back. It's described here: http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200510.mbox/%3c435 [EMAIL PROTECTED] Cheers, Chris On 4/25/06 2:41 PM, Alex [EMAIL PROTECTED] wrote: Hi there, I'm fairly new to nutch and in working on the nutch search I realize that when I try to search for terms such as #1 top item sales, the search seem to ignored everything after the # sign. I also tried with other symbols such as @, !, $, % , ^ , etc... those seem to be ignored. This seem to be a problem in the Query.parse method, Can this be add to the list of bug fix for the next build? or is it something that's already been done? Please adv. Thank you. Alex - Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2¢/min or less. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
0.8 release?
Hi Guys, Any progress on the 0.8 release? Was there any resolution about which JIRA issues to complete before the 0.8 release? We had a bit of conversation there and some ideas, but no definitive answer... Thanks for your help, and sorry to pester ;) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: 0.8 release schedule (was Re: latest build throws error - critical)
+1 On 4/7/06 10:20 AM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: +1 for a release sooner rather than later. I think this is a good plan. There's no reason we can't do another release in a month. If it is back-compatbible we can call it 0.8.x and if it's incompatible we can call it 0.9.0. I'm going to make a Hadoop 0.1.1 release today that can be included in Nutch 0.8.0. (With Hadoop we're going to aim for monthly releases, with potential bugfix releases between when serious bugs are found. The big bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.) So we could aim for a Nutch 0.8.0 release sometime next week. Does that work for folks? Piotr, would you like to make this release, or should I? Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: 0.8 release schedule (was Re: latest build throws error - critical)
Hi Andrzej, On 4/7/06 12:18 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Do you guys have any additional insights / suggestions whether NUTCH-240 and/or NUTCH-61 should be included in this release? Looking at the JIRA popular issues pane for Nutch ( http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin. system.project:popularissues-panel), I note that NUTCH-61 is the most popular issue right now with 7 votes. Additionally, NUTCH-240 shares the 3rd most votes (4) with NUTCH-134. So, all in all, there are 4 issues with = 4 votes in JIRA. Of those 4 issues, 3 of them all have attached patches in JIRA. Would it be safe to say that the committers should focus on committing NUTCH-61, NUTCh-240, and NUTCH-48, since these 3 issues all have attached patch files, and then freeze it for the 0.8.0 release? As for my own opinion, I recently downloaded and reviewed NUTCH-61, and really like the patch. +1 on my end. I haven't tried out NUTCH-240 yet, but it seems to be a logical extension point for Nutch to be able to plug in different scoring components. So, +1 from me. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: 0.8 release schedule (was Re: latest build throws error - critical)
+1 for a release sooner rather than later. Several interesting features contributed since the 0.7 branch I believe are now tested and production-worthy, at least in my environment. Hats off to the folks who were able to split the MapReduce and NDFS into Hadoop -- I'm going to be experimenting with that portion of the code over the next few weeks on a 16 node, 32 processor Opteron cluster at JPL that will be used as the development machine for a large scale earth science data processing mission. Because the Hadoop code is in its own project now, I can leverage and test the Hadoop processing and HDFS capability without having to include all the search engine specific stuff. Ya! :-) Cheers, Chris On 4/6/06 12:59 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote: Doug Cutting wrote: TDLN wrote: I mean, how do others keep uptodate with the main codeline? Do you advice updating everyday? Should we make a 0.8.0 release soon? What features are still missing that we'd like to get into this release? I think we should make a release soon - instabilities related to Hadoop split are mostly gone now, and we need to endorse the new architecture more officially... The adaptive fetch and scoring API functionality are the top priority for me. While the scoring API change is pretty innocuous, we just need to clean it up, the adaptive fetch changes have a big potential for wrecking the main re-fetch cycle ... ;) We could do it in two ways: I could apply this patch and let people run with it for a while, fixing bugs as they pop up - but then it will be another 3-4 weeks I suppose. Or we could wait with this after the release. __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Null Pointer exception in AnalyzerFactory?
Hi Folks, I updated to the latest SVN revision (385691) today, and I am now seeing a Null Pointer exception in the AnalyzerFactory.java class. It seems that in some cases, the method: private Extension getExtension(String lang) { Extension extension = (Extension) this.conf.getObject(lang);if (extension == null) { extension = findExtension(lang); if (extension != null) { this.conf.setObject(lang, extension); }}return extension; } Has a null lang parameter passed to it, which causes a NullPointer exception at line: 81 in src/java/org/apache/nutch/analyzer/AnalyzerFactory.java I found that if I checked for null in the lang variable, and returned null if lang == null, that my crawl finished. Here is a small patch that will fix the crawl: Index: /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java === --- /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java(revision 385691) +++ /Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory .java(working copy) @@ -78,14 +78,19 @@private Extension getExtension(String lang) { -Extension extension = (Extension) this.conf.getObject(lang); -if (extension == null) { - extension = findExtension(lang); - if (extension != null) { - this.conf.setObject(lang, extension); - } -} -return extension; +if(lang == null){ +return null; +} +else{ + Extension extension = (Extension) this.conf.getObject(lang); +if (extension == null) { + extension = findExtension(lang); + if (extension != null) { +this.conf.setObject(lang, extension); + } +} +return extension;+} } private Extension findExtension(String lang) { NOTE: not sure if returning null is the right thing to do here, but hey, at least it made my crawl finish! :-) Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: found resource parse-plugins.xm?
Hi Stefan, after a short time I already had 1602 time this lines in my tasktracker log files. 060307 022707 task_m_2bu9o4 found resource parse-plugins.xml at file:/home/joa/nutch/conf/parse-plugins.xml Sounds like this file is loaded 1602 (after lets say 3 minutes) I guess that wasn't the goal or do I oversee anything? It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the following line in the ParserFactory.java class: /** List of parser plugins. */ private static final ParsePluginList PARSE_PLUGIN_LIST = new ParsePluginsReader().parse(); (see revision 326889) Looking at the revision history for the ParserFactory file, after the application of NUTCH-169, the above changes to: private ParsePluginList parsePluginList; //... code here public ParserFactory(NutchConf nutchConf) { this.nutchConf = nutchConf; this.extensionPoint = nutchConf.getPluginRepository().getExtensionPoint( Parser.X_POINT_ID); this.parsePluginList = new ParsePluginsReader().parse(nutchConf); if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); } if (this.parsePluginList == null) { throw new RuntimeException( Parse Plugins preferences could not be loaded.); } } Thus, every time the ParserFactory is constructed, the parse-plugins.xml file is read (it's the result of the call to ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times, I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm wondering why the parse-plugins.xml configuration parameters aren't declared as final static anymore? That could be a serious performance improvement to just load this file once. Yup, I think that's the reason we made it final static. If there is no reason to not have it final static, I would suggest that it be put back to final static. There may be a problem however, now since NUTCH-169, the loading requires an existing Configuration object I believe. So, we may need a static Configuration object as well. Thoughts? I was not able to find the code that is logging this statement, has anyone a idea where this happens? The statement gets logged within the ParsePluginsReader.java class, line 98: ppInputStream = conf.getConfResourceAsInputStream( conf.get(PP_FILE_PROP)); HTH, Chris Thanks. Stefan - blog: http://www.find23.org company: http://www.media-style.com
RE: found resource parse-plugins.xm?
Hi Stefan, Hi Chris, thanks for the clarification. No probs. Do you think we can we somehow cache it in the nutchConf instance, since this is the way we doing this on other places as well? Yeah I think we can. Here is a small patch to the ParserFactory that should do the trick. Give it a test and let me know if it works. If it does, I would say +1 to the committers to get this into the sources ASAP, no? Index: src/java/org/apache/nutch/parse/ParserFactory.java === --- src/java/org/apache/nutch/parse/ParserFactory.java (revision 383463) +++ src/java/org/apache/nutch/parse/ParserFactory.java (working copy) @@ -55,7 +55,13 @@ this.conf = conf; this.extensionPoint = PluginRepository.get(conf).getExtensionPoint( Parser.X_POINT_ID); -this.parsePluginList = new ParsePluginsReader().parse(conf); + +if(conf.getObject(parsePluginList) != null){ + this.parsePluginList = (ParsePluginList)conf.getObject(parsePluginList); +} +else{ +this.parsePluginList = new ParsePluginsReader().parse(conf); +} if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); Cheers, Chris Cheers, Stefan Am 07.03.2006 um 04:38 schrieb Chris Mattmann: Hi Stefan, after a short time I already had 1602 time this lines in my tasktracker log files. 060307 022707 task_m_2bu9o4 found resource parse-plugins.xml at file:/home/joa/nutch/conf/parse-plugins.xml Sounds like this file is loaded 1602 (after lets say 3 minutes) I guess that wasn't the goal or do I oversee anything? It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the following line in the ParserFactory.java class: /** List of parser plugins. */ private static final ParsePluginList PARSE_PLUGIN_LIST = new ParsePluginsReader().parse(); (see revision 326889) Looking at the revision history for the ParserFactory file, after the application of NUTCH-169, the above changes to: private ParsePluginList parsePluginList; //... code here public ParserFactory(NutchConf nutchConf) { this.nutchConf = nutchConf; this.extensionPoint = nutchConf.getPluginRepository ().getExtensionPoint( Parser.X_POINT_ID); this.parsePluginList = new ParsePluginsReader().parse(nutchConf); if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); } if (this.parsePluginList == null) { throw new RuntimeException( Parse Plugins preferences could not be loaded.); } } Thus, every time the ParserFactory is constructed, the parse- plugins.xml file is read (it's the result of the call to ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times, I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm wondering why the parse-plugins.xml configuration parameters aren't declared as final static anymore? That could be a serious performance improvement to just load this file once. Yup, I think that's the reason we made it final static. If there is no reason to not have it final static, I would suggest that it be put back to final static. There may be a problem however, now since NUTCH-169, the loading requires an existing Configuration object I believe. So, we may need a static Configuration object as well. Thoughts? I was not able to find the code that is logging this statement, has anyone a idea where this happens? The statement gets logged within the ParsePluginsReader.java class, line 98: ppInputStream = conf.getConfResourceAsInputStream( conf.get(PP_FILE_PROP)); HTH, Chris Thanks. Stefan - blog: http://www.find23.org company: http://www.media-style.com - blog: http://www.find23.org company: http://www.media-style.com
RE: found resource parse-plugins.xm?
Sorry, My last patch was missing one line. Here's the update: Index: src/java/org/apache/nutch/parse/ParserFactory.java === --- src/java/org/apache/nutch/parse/ParserFactory.java (revision 383463) +++ src/java/org/apache/nutch/parse/ParserFactory.java (working copy) @@ -55,7 +55,14 @@ this.conf = conf; this.extensionPoint = PluginRepository.get(conf).getExtensionPoint( Parser.X_POINT_ID); -this.parsePluginList = new ParsePluginsReader().parse(conf); + +if(conf.getObject(parsePluginList) != null){ + this.parsePluginList = (ParsePluginList)conf.getObject(parsePluginList); +} +else{ +this.parsePluginList = new ParsePluginsReader().parse(conf); +conf.setObject(parsePluginList, this.parsePluginList); +} if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); -Original Message- From: Chris Mattmann [mailto:[EMAIL PROTECTED] Sent: Monday, March 06, 2006 7:51 PM To: 'nutch-dev@lucene.apache.org' Subject: RE: found resource parse-plugins.xm? Hi Stefan, Hi Chris, thanks for the clarification. No probs. Do you think we can we somehow cache it in the nutchConf instance, since this is the way we doing this on other places as well? Yeah I think we can. Here is a small patch to the ParserFactory that should do the trick. Give it a test and let me know if it works. If it does, I would say +1 to the committers to get this into the sources ASAP, no? Index: src/java/org/apache/nutch/parse/ParserFactory.java === --- src/java/org/apache/nutch/parse/ParserFactory.java(revision 383463) +++ src/java/org/apache/nutch/parse/ParserFactory.java(working copy) @@ -55,7 +55,13 @@ this.conf = conf; this.extensionPoint = PluginRepository.get(conf).getExtensionPoint( Parser.X_POINT_ID); -this.parsePluginList = new ParsePluginsReader().parse(conf); + +if(conf.getObject(parsePluginList) != null){ + this.parsePluginList = (ParsePluginList)conf.getObject(parsePluginList); +} +else{ +this.parsePluginList = new ParsePluginsReader().parse(conf); +} if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); Cheers, Chris Cheers, Stefan Am 07.03.2006 um 04:38 schrieb Chris Mattmann: Hi Stefan, after a short time I already had 1602 time this lines in my tasktracker log files. 060307 022707 task_m_2bu9o4 found resource parse-plugins.xml at file:/home/joa/nutch/conf/parse-plugins.xml Sounds like this file is loaded 1602 (after lets say 3 minutes) I guess that wasn't the goal or do I oversee anything? It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the following line in the ParserFactory.java class: /** List of parser plugins. */ private static final ParsePluginList PARSE_PLUGIN_LIST = new ParsePluginsReader().parse(); (see revision 326889) Looking at the revision history for the ParserFactory file, after the application of NUTCH-169, the above changes to: private ParsePluginList parsePluginList; //... code here public ParserFactory(NutchConf nutchConf) { this.nutchConf = nutchConf; this.extensionPoint = nutchConf.getPluginRepository ().getExtensionPoint( Parser.X_POINT_ID); this.parsePluginList = new ParsePluginsReader().parse(nutchConf); if (this.extensionPoint == null) { throw new RuntimeException(x point + Parser.X_POINT_ID + not found.); } if (this.parsePluginList == null) { throw new RuntimeException( Parse Plugins preferences could not be loaded.); } } Thus, every time the ParserFactory is constructed, the parse- plugins.xml file is read (it's the result of the call to ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times, I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm wondering why the parse-plugins.xml configuration parameters aren't declared as final static anymore? That could be a serious performance improvement to just load this file once. Yup, I think that's the reason we made it final static. If there is no reason to not have it final static, I would suggest that it be put back to final static. There may be a problem however, now since NUTCH-169, the loading requires an existing Configuration object I believe. So, we may need a static Configuration object as well. Thoughts? I was not able to find the code that is logging this statement, has anyone a idea where this happens? The statement gets logged
Re: ignore eclipse .project and .classpath
Thanks a lot! Cheers, Chris On 2/9/06 12:13 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Done. - Original Message From: Stefan Groschupf [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Wed 08 Feb 2006 03:15:15 PM EST Subject: Re: ignore eclipse .project and .classpath +1 Am 08.02.2006 um 06:16 schrieb Chris Mattmann: Hi Folks, Just wondering if someone could add to the svn:ignore property for Nutch the files: .classpath .project I happen to use eclipse to do Nutch development and always ignore these files in my other eclipse projects as well. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
ignore eclipse .project and .classpath
Hi Folks, Just wondering if someone could add to the svn:ignore property for Nutch the files: .classpath .project I happen to use eclipse to do Nutch development and always ignore these files in my other eclipse projects as well. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type
Hi Gail, Check out: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/ That's the way that the parser factory currently works. Also added, but not described in that proposal is the ability to call a parser by its id, which is a method present in ParseUtil.java. G'luck! Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Gal Nitzan (JIRA) [mailto:[EMAIL PROTECTED] Sent: Sunday, January 15, 2006 4:10 PM To: nutch-dev@incubator.apache.org Subject: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ] Gal Nitzan updated NUTCH-179: - Description: Sorry, please close this issue. I figured that if I set my parse plugin first. I can always be called first and than decide if I want to parse or not. was: Somtime there are requirements of the real world (usually your boss) where a special parse is required for a certain site. Though the content type is text/html, a specialized parser is needed. Sample: I am required to crawl certain sites where some of them are partners sites. when fetching from the partners site I need to look for certain entries in the text and boost the score. Currently the ParserFactory looks for a plugin based only on the content type. Facing this issue myself I noticed that it would give a very easy implementation for others if ParserFactory could use NutchConf to check for certain properties and if matched to use the correct plugin based on the url and not just the content type. The implementation shouldn be to complicated. Looking to hear more ideas. Proposition: Enable Nutch to use a parser plugin not just based on content type --- Key: NUTCH-179 URL: http://issues.apache.org/jira/browse/NUTCH-179 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Reporter: Gal Nitzan Sorry, please close this issue. I figured that if I set my parse plugin first. I can always be called first and than decide if I want to parse or not. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata
Guys, My apologies for the spamming comments -- I tried to submit my comment through JIRA one time and it kept giving me service unavailable. So I resubmitted like 5 times, on the fifth time it finally went through -- but I guess the other comments went through too. I'll try and remove them right away. Sorry again. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED] Sent: Thursday, January 05, 2006 8:04 PM To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata [ http://issues.apache.org/jira/browse/NUTCH- 139?page=comments#action_12361922 ] Doug Cutting commented on NUTCH-139: One more thing. Content length should also not need to be stored in the metadata as an x-nutch value. The content length is simply the length of the Content's data. The protocol may have truncated the content, in which case perhaps we need an x-nutch-truncated-content metadata property or something, but we should not be overwriting the HTTP Content-Length header, nor should we trust that it reflects the length of the data actually fetched. Standard metadata property names in the ParseData metadata -- Key: NUTCH-139 URL: http://issues.apache.org/jira/browse/NUTCH-139 Project: Nutch Type: Improvement Components: fetcher Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content- TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT_TyPE and all the permutations are really the same). What about if I named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. I'll contribute a patch near the end of the this week, or beg. of next week that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Standard metadata property names in the ParseData metadata
Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT-TyPE and all the permutations are really the same). What about if named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. What do you all think? If you guys think that this is a good solution, I'll create an issue in JIRA about it and contribute a patch near the end of the week. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Standard metadata property names in the ParseData metadata
Hi Stefan, Thanks. Yup, I noticed it and I think it will really help out a lot. Great job to the both of you :-) Cheers, Chris On 12/13/05 10:59 AM, Stefan Groschupf [EMAIL PROTECTED] wrote: +1! BTW, did you notice that Jerome committed a patch that makes Content meta data now case insensitive? Stefan Am 13.12.2005 um 18:07 schrieb Chris Mattmann: Hi Folks, I was just thinking about the ParseData java.util.Properties metaata object and thinking about the way that we store names in there. Currently, people are free to name their string-based properties anything that they want, such as having names of Content-type, content-TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe proposed a solution in which all property names be converted to lower case, but in essence this really only fixes half the problem right (the case of identifying that CONTENT_TYPE and conTeNT-TyPE and all the permutations are really the same). What about if named it Content Type, or ContentType? I propose that a way to correct this would be to create a standard set of named Strings in the ParseData class that the protocol framework and the parsing framework could use to identify common properties such as Content-type, Creator, Language, etc. The properties would be defined at the top of the ParseData class, something like: public class ParseData{ . public static final String CONTENT_TYPE = content-type; public static final String CREATOR = creator; } In this fashion, users could at least know what the name of the standard properties that they can obtain from the ParseData are, for example by making a call to ParseData.getMetadata().get (ParseData.CONTENT_TYPE) to get the content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course, this wouldn't preclude users from doing what they are currently doing, it would just provide a standard method of obtaining some of the more common, critical metadata without pouring over the code base to figure out what they are named. What do you all think? If you guys think that this is a good solution, I'll create an issue in JIRA about it and contribute a patch near the end of the week. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Idea about aliases in the parse-plugins.xml file
Hi Folks, Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType-list of pluginIds rather than mimeType-list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem. We propose to have the concept of aliases in the parse-plugins.xml file, defined at the end of the file, something lie: parse-plugins mimeType name=text/html plugin id=parse-html/ /mimeType . aliases alias name=parse-html extension-point=org.apache.nutch.parse.html.HtmlParser/ alias name=parse-html2 extension-point=my.other.html.Parser/ /aliases /parse-plugins What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current pluginId concept. Comments welcome. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Standard metadata property names in the ParseData metadata
Hi Guys, Okay, that makes sense then. I will create an issue in JIRA later today describing the update, and then begin working on this over the next few days. Thanks for your responses and reviews. Cheers, Chris On 12/13/05 12:45 PM, Jérôme Charron [EMAIL PROTECTED] wrote: I agree, too. Perhaps we should use the names as they appear in the Dublin Core for those properties that are defined there A big YES! - just prepended them with X-nutch- in order to avoid name-clashes with other properties (e.g. blindly copied from the protocol headers). Another big YES! -- http://motrech.free.fr/ http://www.frutch.org/ __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
NUTCH-112: Link in cached.jsp page to cached content is an absolute link
Hi Guys, Just wondering if any of the committers checked out http://issues.apache.org/jira/browse/NUTCH-112. Turns out the link to the cached.jsp page to the cached content contains an absolute link which makes the link mess up when you don't deploy the nutch webapp in the root context. I've attached a pretty simple patch to the issue, and tested it. It would be nice to have this included for those people like me who are using Nutch deployed at a context other than root, e.g., http://myhost/nutch/. Thanks, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: Urlfilter Patch
Jerome, I think that this is a great idea and ensures that there isn't replication of so-called management information across the system. It could be easily implemented as a utility method because we have utility java classes that represent the ParsePluginList, that you could get the mimeTypes from. Additionally, we could create a utility method that searches the extension point list for parsing plugins and returns a boolean true or false whether they are activated or not. Using this information, I believe that the url filtering would be a snap. +1 Cheers, Chris On 12/1/05 12:11 PM, Jérôme Charron [EMAIL PROTECTED] wrote: Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file extensions associated to each content-type, we can build a list of file extensions to include (other ones will be excluded) in the fecth process. No? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/ __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: Urlfilter Patch
Hi Doug, Chris Mattmann wrote: In principle, the mimeType system should give us some guidance on determining the appropriate mimeType for the content, regardless of whether it ends in .foo, .bar or the like. Right, but the URL filters run long before we know the mime type, in order to try to keep us from fetching lots of stuff we can't process. The mime type is not known until we've fetched it. Duh, you're right. Sorry about that. Matt Kangas wrote: The latter is not strictly true. Nutch could issue an HTTP HEAD before the HTTP GET, and determine the mime-type before actually grabbing the content. It's not how Nutch works now, but this might be more useful than a super-detailed set of regexes... I liked Matt's idea of the HEAD request though. I wonder if some benchmarks on performance of this would be useful, because in some cases (such as focused crawling, or non-whole-internet crawling, such as intranet, etc.), it would seem that the performance penalty of performing the HEAD to get the content-type would be useful, and worth the cost... Cheers, Chris
RE: [proposal] Generic Markup Language Parser
Hi Stefan, -1! Xsl is terrible slow! You have to consider what the XSL will be used for. Our proposal suggests XSL as a means of intermediate transformation of markup content on the backend, as Jerome suggested in his reply. This means that whenever markup content is encountered, specifically, XML based content, then XSL will be used to create an intermediary parse-out xml file, containing the fields to index. I don't think, given the percentage of xml-based markup content out there (of course excluding html), compared to regular content, that this would significantly degrade performance. Xml will blow up memory and storage usage. Possibly, but I would think that we would do it in a clever fashion. For instance, the parse-out xml files would most likely be small (~kb) files that could be deleted if space is a concern. It could be a parameterized option. Dublin core may is good for semantic web, but not for a content storage. I completely disagree with that. In fact, I think many people would disagree with that in fact. Dublin core is a standard metadata model for electronic resources. It is by no means the entire spectrum of metadata that could be stored for electronic content. However, rather than creating your own author field, or content creator, or document creator, or whatever you want to call it, I think it would be nice to provide the DC metadata because at least it is well known and provides interoperability with other content storage systems. Check out DSpace from MIT. Check out ISO-11179 registry systems. Check out the ISO standard OAIS reference model for archiving systems. Each of these systems has recognized that standard metadata is an important concern in any content management system. In general the goal must be to minimalize memory usage and improve performance such a parser would increase memory usage and definitely slow down parsing. I dont think it would slow down parsing significantly, as I mentioned above markup content represents a small portion of the amount of content out there. The magic world is minimalism. So I vote against this suggestion! Stefan In general, this proposal represents a step forward in being able to parse generic XML content in Nutch, which is a very challenging problem. Thanks for your suggestions, however, I think that our proposal would help Nutch to move forward in being to handle generic forms of XML markup content. Cheers, Chris Mattmann Am 24.11.2005 um 00:01 schrieb Jérôme Charron: Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch/MarkupLanguageParserProposal Here is the Summary of Issue: Currently, Nutch provides some specific markup language parsing plugins: one for handling HTML, another one for RSS, but no generic XML parsing plugin. This is extremely cumbersome as adding support for a new markup language implies that you have to develop the whole XML parsing code from scratch. This methodology causes: (1) code duplication, with little or no reuse of common pieces of XML parsing code, and (2) dependency library duplication, where many XML parsing plugins may rely on similar xml parsing libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing plugin keeps its own local copy of these libraries. It is also very difficult to identify precisely the type of XML content encountered during a parse. That difficult issue is outside the scope of this proposal, and will be identified in a future proposal. Thanks for your feedback, comments, suggestions (and votes). Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: [proposal] Generic Markup Language Parser
Hi Stefan, and Jerome, A mail archive is a amazing source of information, isn't it?! :-) To answer your question, just ask your self how many pages per second your plan to fetch and parse and how much queries per second a lucene index is able to handle - and you can deliver in the ui. I have here something like 200++ to a maximal 20 queries per second. http://wiki.apache.org/nutch/HardwareRequirements I'm not sure that our proposal affects the ui, really at all. Parsing occurs only during a fetch, which creates the index for the ui, no? So, why mention the amount of queries per second that the ui can handle? Speed improvement in ui can be done by caching components you use to assemble the ui. There are some ways to improve speed But seriously I don't think there will be any pages that contains 'cacheable' items until parsing. Until last years there is one thing I notice that matters in a search engine - minimalism. There is no usage in nutch of a logging library, Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-) no RMI and no meta data in the web db. Why? Minimalism. Minimalism == speed, speed == scalability, scalability == serious enterprise search engine projects. I don't think it would be a good move to slow down html parsing (most used parser) to make rss parser writing more easier for developers. This proposal isn't meant for RSS, that's seriously constraining the scope. The proposal is meant for making writing * XML * parsers easier. Note the XML. RSS is a significantly small subset of XML as a whole. And, there currently exists no default support for generic XML documents in Nutch. BTW, we already have a html and feed parser that works, as far I know. I guess 90 % of the nutch users use the html parser but only 10 % the feed-parser (since blogs are mostly html as well). This may or may not be true however I wouldn't be surprised if it was because it is representative of the division of content on the web -- HTML definitely is orders of magnitude more pervasive than RSS. From my perspective we have much more general things to solve in nutch (manageability, monitoring, ndfs block based task-routing, more dynamic search servers) than improving thing we already have. I would tend to agree with Jerome on this one -- these seem to be the items on your agenda: a representative set indeed, but by no means an exhaustive set of what's needed to improve, and benefit Nutch. One of the motivations behind our proposal was several emails posted to the Nutch list by users interested in crawling blogs and RSS: http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23 69417.html One of my replies to this thread was a message on October 19th, 2005, which really identified the main problem: http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23 69576.html There is a lack of a general XML parser in Nutch that would allow it to deal with general XML content based on user defined schemas and DTDs. Our proposal would be the initial step towards a solution to this overall problem. At least, that's part of its intention. Anyway as you may know we have a plugin system and one goal of the plugin system is to give developers the freedom to develop custom plugins. :-) Indeed. And our goal is help developers in their endeavors by providing at starting point and generic solution for XML based parsing plugins :-) Cheers, Chris Cheers, Stefan B-) P.S. Do you think it makes sense to run another public nutch mailing list, since 'THE nutch [...]' (mailing list is nutch- [EMAIL PROTECTED]), 'Isn't it?' http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html Am 24.11.2005 um 19:28 schrieb Jérôme Charron: Hi Stefan, And thanks for taking time to read the doc and giving us your feedback. -1! Xsl is terrible slow! Xml will blow up memory and storage usage. But there still something I don't understand... Regarding a previous discussion we had about the use of OpenSearch API to replace Servlet = HTML by Servlet = XML = HTML (using xsl), here is a copy of one of my comment: In my opinion, it is the front-end dreamed architecture. But more pragmatically, I'm not sure it's a good idea. XSL transformation is a rather slow process!! And the Nutch front-end must be very responsive. and then your response and Doug response too: Stefan: We already done experiments using XSLT. There are some ways to improve speed, however it is 20 ++ % slower then jsp. Doug: I don't think this would make a significant impact on overall Nutch search performance. (the complete thread is available at http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/ msg03811.html ) I'm a little bit confused... why the use of xsl must be considered as too time and memory expansive in the back-end process, but not in the front-end?
Re: developing a parse-/index-/query- plugin set
Hi Doug, On 10/17/05 11:38 AM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: So, one thing it seems is that fields to be indexed, and used in a field query must be fully lowercase to work? Additionally, it seems that they can't have symbols in them, such as _, is that correct? Would you guys consider this to be a bug? Yes, this sounds like a bug. Okay, I will look and see if I can figure out why this is happening and if I can, I will try and submit a patch. Performing Lucene Query: using filter QueryFilter(+contactemail:[EMAIL PROTECTED]) and numHits = 20 051016 190347 11 total hits: 0 A query whose only clause has a boost of 0.0 will return no results. Nutch uses the convention that clauses whose boost is 0.0 may be converted to filters, for efficiency. A filter affects the set of hits, but not their ranking. So a boost of 0.0 is used to declare that a clause does not affect ranking and may not be used in isolation. This makes it akin to searching for filetype:pdf on Google--filetype is only used to filter other queries and may not be a standalone query. Okay, this makes sense. In fact, when I do a query now for: contactemail:[EMAIL PROTECTED] specimen The query actually works. Of the 3 documents I indexed only one of them has the contactemail [EMAIL PROTECTED], and so I only got one result back. So your answer there makes total sense. So, my question to you then is, what type of QueryFilter should I develop in order to get my query for contactemail:email address to work as a standalone query? For instance, right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be the right way to do it now. Is there a class in Nutch that I can sub-class to get most of the functionality for doing a type:value query as a standalone query? Thanks for the help. Cheers, Chris Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
Re: developing a parse-/index-/query- plugin set
Hi Doug, Thanks, that worked. Cheers, Chris On 10/17/05 11:56 AM, Doug Cutting [EMAIL PROTECTED] wrote: Chris Mattmann wrote: So, my question to you then is, what type of QueryFilter should I develop in order to get my query for contactemail:email address to work as a standalone query? For instance, right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be the right way to do it now. Is there a class in Nutch that I can sub-class to get most of the functionality for doing a type:value query as a standalone query? You can simply pass a non-zero boost to the RawFieldQueryFilter constructor, e.g.: public class MyQueryFilter extends RawFieldQueryFilter { public MyQueryFilter() { super(myfield, 1.0f); } } Or you can implement QueryFilter directly. There's not that much to it. Doug __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work, and there's no other way to solve the problem without losing text, then your patch has my +1. Cheers, Chris __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 12, 2005 5:19 PM To: nutch-dev@incubator.apache.org Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ] [EMAIL PROTECTED] updated NUTCH-110: Attachment: fixIllegalXmlChars.patch Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part? OpenSearchServlet outputs illegal xml characters Key: NUTCH-110 URL: http://issues.apache.org/jira/browse/NUTCH-110 Project: Nutch Type: Bug Components: searcher Versions: 0.7 Environment: linux, jdk 1.5 Reporter: [EMAIL PROTECTED] Attachments: fixIllegalXmlChars.patch OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '#12;' The character/entity '#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: failing of org.apache.nutch.tools.TestSegmentMergeTool?
You know what the crazy thing is: Seemingly, all tests pass now. And I didn't change a thing. Honest. I swear. Very strange, indeed, but I'm happy because at least the tests are passing! :-) Cheers, Chris On 9/27/05 12:29 PM, Paul Baclace [EMAIL PROTECTED] wrote: Chris Mattmann wrote: I just noticed after checking out the latest SVN of Nutch that I am currently failing the TestSegmentMergeTool Junit test when I type ant test for Nutch. I'm on the mapred branch, not the trunk, and all tests pass. One thing I have noticed is that it is best to start with 'ant clean' and if you made any mods to the conf files, rewind them back by copying the x.template files to x. Paul __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
failing of org.apache.nutch.tools.TestSegmentMergeTool?
Hi there, I just noticed after checking out the latest SVN of Nutch that I am currently failing the TestSegmentMergeTool Junit test when I type ant test for Nutch. Is anyone experiencing the same problem? Here is the relevant information which I captured out of the $NUTCH_HOME/build/test/TEST-org.apache.nutch.tools.TestSegmentMergeTool.txt file: Testsuite: org.apache.nutch.tools.TestSegmentMergeTool Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 46.256 sec - Standard Error - 050926 215316 parsing file:/C:/Program%20Files/eclipse/workspace/nutch/conf/nutch-default.xml 050926 215316 parsing file:/C:/Program%20Files/eclipse/workspace/nutch/build/test/classes/nutch-si te.xml 050926 215316 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer 050926 215321 No FS indicated, using default:local 050926 215321 * Opening 10 segments: 050926 215321 - segment seg0: 500 records. 050926 215321 - segment seg1: 500 records. 050926 215321 - segment seg2: 500 records. 050926 215321 - segment seg3: 500 records. 050926 215321 - segment seg4: 500 records. 050926 215321 - segment seg5: 500 records. 050926 215321 - segment seg6: 500 records. 050926 215321 - segment seg7: 500 records. 050926 215321 - segment seg8: 500 records. 050926 215321 - segment seg9: 500 records. 050926 215321 * TOTAL 5000 input records in 10 segments. 050926 215321 * Creating master index... 050926 215328 * Creating index took 6356 ms 050926 215328 * Optimizing index took 0 ms 050926 215328 * Removing duplicate entries... 050926 215328 * Deduplicating took 652 ms 050926 215328 * Merging all segments into output 050926 215333 * Merging took 4381 ms 050926 215333 * Deleting old segments... 050926 215333 Finished SegmentMergeTool: INPUT: 5000 - OUTPUT: 5000 entries in 12.15 s (416.6 entries/sec). 050926 215339 No FS indicated, using default:local 050926 215339 * Opening 10 segments: 050926 215339 - segment seg0: 500 records. 050926 215339 - segment seg1: 500 records. 050926 215339 - segment seg2: 500 records. 050926 215339 - segment seg3: 500 records. 050926 215339 - segment seg4: 500 records. 050926 215339 - segment seg5: 500 records. 050926 215339 - segment seg6: 500 records. 050926 215339 - segment seg7: 500 records. 050926 215339 - segment seg8: 500 records. 050926 215339 - segment seg9: 500 records. 050926 215339 * TOTAL 5000 input records in 10 segments. 050926 215339 * Creating master index... 050926 215344 * Creating index took 5083 ms 050926 215344 * Optimizing index took 0 ms 050926 215344 * Removing duplicate entries... 050926 215344 * Deduplicating took 150 ms 050926 215344 * Merging all segments into output 050926 215345 * Merging took 662 ms 050926 215345 * Deleting old segments... 050926 215345 Finished SegmentMergeTool: INPUT: 5000 - OUTPUT: 500 entries in 6.316 s (833. entries/sec). java.lang.Exception: Missing or invalid 'fetcher' or 'fetcher_output' directory in c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index at org.apache.nutch.segment.SegmentReader.isParsedSegment(SegmentReader.java:16 8) at org.apache.nutch.segment.SegmentReader.init(SegmentReader.java:143) at org.apache.nutch.segment.SegmentReader.init(SegmentReader.java:82) at org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo ol.java:185) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:324) at junit.framework.TestCase.runTest(TestCase.java:154) at junit.framework.TestCase.runBare(TestCase.java:127) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:118) at junit.framework.TestSuite.runTest(TestSuite.java:208) at junit.framework.TestSuite.run(TestSuite.java:203) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu nner.java:289) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR unner.java:523) junit.framework.AssertionFailedError: Missing or invalid 'fetcher' or 'fetcher_output' directory in c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index at junit.framework.Assert.fail(Assert.java:47) at org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo ol.java:190) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
Re: [Nutch-cvs] [Nutch Wiki] Update of ParserFactoryImprovementProposal by ChrisMattmann
Hi Otis, Point taken. In actuality since both convey the same information I think that it's okay to support both, but by default say we could code the initial plugins specified in parse-plugins.xml without the order= attribute. Fair enough? Cheers, Chris On 9/15/05 3:23 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Well, you have to tell users about order=N somewhere in the docs. Instead of telling them about order=N, tell them that the order in XML matters. Either case requires education, and the latter one requires less typing and avoids the case described in the proposal. Otis --- Sébastien LE CALLONNEC [EMAIL PROTECTED] wrote: Hi Otis, This issue arose during our discussion for this proposal, and my feeling was that the XML specification doesn't state that the order is significant in an XML file. I therefore read the spec again, and indeed didn't find anything on that subject... I think it is somehow reasonable to consider that a parser _might_ return the elements in a different order—though, as I mentioned to Chris Jerome, that would be quite unheard of, and, to be honnest, rather irritating. What do you think? Regards, Sebastien. Quick comment about order=N and the paragraph that describes how to deal with cases where people mess things up and enter multiple plugins for the same content type and the same order: - Why is the order attribute even needed? It looks like a redundant piece of information - why not derive order from the order of plugin definitions in the XML file? For instance: Instead of this: mimeType name=* plugin id=”parse-text” order=”1”/ plugin id=”another-one-default-parser” order=”2”/ /mimeType We have this: mimeType name=* plugin id=”parse-text”/ plugin id=”another-one-default-parser”/ /mimeType parse-text first, another-one-default-parser second. Less typing, and we avoid the case of equal ordering all together. Otis --- Apache Wiki [EMAIL PROTECTED] wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Nutch Wiki for change notification. The following page has been changed by ChrisMattmann: http://wiki.apache.org/nutch/ParserFactoryImprovementProposal The comment on the change is: Initial Draft of ParserFactoryImprovementProposal New page: = Parser Factory Improvement Proposal = == Summary of Issue == Currently Nutch provides a plugin mechanism wherein plugins register certain metadata about themselves, including their id, classname, and so forth. In particular, the set of parsing plugins register which contentTypes and file suffixes they can support with a PluginRepository. One “adopted practiceâ€� in current Nutch parsing plugins (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.) has also been to verify that the content type passed to it during a fetch is indeed one of the contentTypes that it supports (be it application/xml, or application/pdf, etc.). This practice is cumbersome for a few reasons: *Any updates to supported content types for a parsing plugin will require a recompilation of the plugin code *Checking for “hard codedâ€� content types within the parsing plugin is a duplication of information that already exists in the plugin’s descriptor file, plugin.xml *By the time that content gets to a parsing plugin, (e.g., the parsing plugin is returned by the ParserFactory, and provided content during a fetch), the ParsingFactory should have already ensured that the appropriate plugin is getting called for a particular contentType. In addition to this problem is the fact that several parsing plugins may all support many of the same content types. For instance, the parse-js plugin may be the only well suited parsing plugin for javascript, but perhaps it may also provided a good enough heuristic parser for plain text as well, and so it may support both types. However, there may be a parsing plugin for text (which there is!), parse-text, whose primary purpose is to parse plain text as well. == Suggested Remedy == To deal with ensuring the desired parsing plugin is called for the appropriate content type, and to in effect, “kill two birds with one stoneâ€�, we propose that there be a parsing plugin preference list for each content type that Nutch knows how to handle, i.e., each content type available via the mimeType system. Therefore, during a fetch, once the appropriate mimeType has been determined for content, and the ParserFactory is tasked with returning a parsing plugin, the ParserFactory should consult a preference list for that contentType, allowing it to determine which plugin has the highest preference for the contentType. That parsing plugin should be returned via the ParserFactory to the fetcher. If there is any problem using the initial returned parsing plugin for a
RE: [jira] Commented: (NUTCH-30) rss feed parser
Hi Folks, I response to Michael's comment, I've went ahead and uploaded a working patch and an updated patch and source distribution for the parse-rss plugin. The latest patch and source work against the new protocol and parsing APIs by Andrzej. The patch was made against the latest SVN from 73005. The patch and source distro are zipped up in the file: parse-rss-73005.zip. Here is a direct link: http://issues.apache.org/jira/secure/attachment/12311475/parse-rss-73005.zip Thanks! Cheers, Chris Mattmann __ Chris A. Mattmann [EMAIL PROTECTED] Staff Member Modeling and Data Management Systems Section (387) Data Management Systems and Technologies Group _ Jet Propulsion LaboratoryPasadena, CA Office: 171-266BMailstop: 171-246 Phone: 818-354-8810 ___ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology. -Original Message- From: Michael Nebel (JIRA) [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 27, 2005 8:42 AM To: [EMAIL PROTECTED] Subject: [jira] Commented: (NUTCH-30) rss feed parser [ http://issues.apache.org/jira/browse/NUTCH- 30?page=comments#action_12316928 ] Michael Nebel commented on NUTCH-30: I loaded the latest sources from the svn yesterday and tried to integrate this plugin (I used the Zip from Hasan) . I found: - getParse throws a ParseException which isn't supported by getParse - the call to new ParseData needs a new parameter ParseStatus My fixes are far from perfect (I just identified the problems by now), so I'm not creating a patch. :-( rss feed parser --- Key: NUTCH-30 URL: http://issues.apache.org/jira/browse/NUTCH-30 Project: Nutch Type: Improvement Components: fetcher Reporter: Stefan Grroschupf Assignee: Chris A. Mattmann Priority: Minor Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0- 040605.zip, parse-rss-patch.txt, parse-rss-srcbin-incl-path.zip, parse- rss.zip, parseRss.zip A simple rss feed parser supporting: rss and atom: + version 0.3 + version 09 + version 10 + version 20 Converting of different rss versions is done via xslt. The xslt was contributed by Frank Henze - Thanks! -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira