from:"Chris Mattmann"

Re: [jira] Updated: (NUTCH-627) Minimize host address lookup

2008-04-10 Thread Chris Mattmann

On 4/10/08 8:25 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 
 
 Andrzej Bialecki wrote:
 Otis Gospodnetic (JIRA) wrote:
 
 If nobody complains, I'll commit by the end of the week.
 
 Hi Otis,
 
 Thanks for helping with Nutch - we are indeed very shorthanded at the
 moment, and any help is appreciated, and doubly so that of a person who
 can commit things ...
 
 However, on the formal side I think the Nutch team needs to vote you in
 as a Nutch committer (even though svn allows you to commit directly) -
 witness the recent situation with Grant. If you wish I can start a vote,
 and I'm sure it will be positive, and we will have a clean situation
 from the formal POV. Ok?
 
 +1
 

+1, as well.

Cheers,
 Chris


__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: End-Of-Life status for 0.7.x?

2008-01-17 Thread Chris Mattmann

+1


On 1/17/08 12:49 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 +1.
 
 Andrzej Bialecki wrote:
 Hi all,
 
 I'd like to initiate the discussion about the EOL status of Nutch 0.7.x
 branch. The question is whether we want to actively support it, whether
 we have enough resources to make any new releases or apply patches that
 sit in JIRA?
 
 My opinion is that we should mark it EOL, and close all JIRA issues that
 are relevant only to 0.7.x, with the status Won't Fix.
 

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Student contributions

2008-01-02 Thread Chris Mattmann

Hi Frank,

Thanks for your interest in using Nutch!

The best way to see what's on the horizon, and needed in Nutch, is to check
out our JIRA issue tracking system, at:

http://issues.apache.org/jira/browse/NUTCH

At present, there are 39 current issues with Nutch, planned to be fixed,
or added (as a new feature), or improved (made to an existing feature), for
the upcoming 1.0.0 release. There are 222 open issues across all versions of
Nutch (including prior releases).

To help you digest the wealth of information that's there (and trust me,
there's plenty), I would offer a few of my own suggestions for class
projects:

(Difficulty: High) 1. Decouple Nutch's crawl infrastructure, and turn it
into its own extension point.The current Nutch crawl infrastructure is
highly coupled around a few, monolithic classes, Fetcher (or its big
brother, Fetcher2), Hadoop (as the underlying job/crawl execution platform),
etc. There have been several requests on the list to make the crawler its
own component, make it light-weight, make it configurable, etc. I think an
ambitious 2 week student project would be to take a stab at this decoupling.

(Difficulty: Medium) 2. Analyze the Nutch code base, and propose/suggest
architectural improvements. Currently, the Nutch code base is a behemoth of
plugins/extension points, configuration properties, and the like. It would
be nice to have a fresh look at its architecture, from an outsider's
perspective. The students would suggest places to cut/places to add, cleaner
interfaces, the appropriate underlying middleware substrates, e.g., is
Hadoop the only logical choice? What about other enterprise solutions such
as web services/EJB/JMS/etc.?

(Difficulty: Medium) 3. Use Spring as the underlying configuration framework
for Nutch, and overhaul Nutch's home-grown configuration infrastructure.
Spring is a an open source framework centered around providing configuration
and instantiation middleware capabilities: it lets developers focus on the
domain objects, and handles the rest. The student would first take a look at
Spring, then Nutch, then build a prototype that shows how Spring could be
used to configure Nutch.

There are plenty of others, but that should help get the juices flowing and
were just a few ideas off the top of my head.

Also, FYI, a course has been taught for a few semesters at the University of
Southern California (USC) by Dr. Ellis Horowitz on Search Engines. Here is a
pointer to that page. You can find some other Nutch project suggestions
there.

http://www-scf.usc.edu/~csci572/

Good luck!

Cheers,
 Chris


On 1/2/08 2:44 PM, Frank McCown [EMAIL PROTECTED] wrote:

 Greetings.  I'm teaching a class on search engine development this
 semester, and I am considering having my students use Nutch in their
 projects (I'm new to Nutch myself).  I'd like them to get some
 experience with an open source project and make a significant
 contribution.  Are there any implementation tasks you guys think would
 be appropriate for a small group of undergrad, upperclass CS students?
  I'm looking for ideas for improving Nutch that they could accomplish
 in a few weeks time.
 
 Thanks,

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Commit Times for Issues

2007-11-16 Thread Chris Mattmann

Hi Guys,

 I'd like to chime in here on this one. My +1 for shortening the time to
commit for issues. I fear that development effort on Nutch has teetered on
the dwindling side of things for the last year or so, and there (in my
opinion, so feel free to disagree) is certainly a stigma to the trunk and
its sacred nature that discourages people (including myself) from
introducing new code there.

 I would like to propose even extending Dennis's idea below and developing a
new philosophy towards the Nutch CM. To me, the big picture change is the
following statement: the trunk is something that can be broke. Let's just
accept that it's possible. If it's broke, someone will report it. Nutch has
a big enough user base now that plays around with new builds and revisions
that this will get caught. Guess what. If the trunk is broke, then it can be
fixed. 

 I'll tell you guys a story of one of my bosses here at JPL. He used to work
for a civil defense contractor in the U.S., with very rigorous design and
software development process. Unit tests for each line of code type of
place. In any case, my boss used to break his company's equivalent of the
trunk daily build process all the time. Well one day he gets called in to
speak with the vice president of engineering at the company, who proceeds to
tell him: You're really good at breaking the code, eh?. My boss
immediately jumps up to defend himself, citing the fact that it wasn't a big
problem and that he has fixed it already, but the vice president cuts him
off and says, You probably think I'm mad. Well let me tell you: I'm not.
You can break the code all you want because you know what it tells me? That
you're actually *DOING WORK* unlike the rest of these people who work here
and do very little.

 The above story has stuck with me and made me feel a lot better about
situations such as those in that it gives me the belief that waiting until
everything is perfect before acting in a situation isn't always the best
thing to do because you may end up waiting forever. It's better to make
incremental progress (even falter while doing so), because what you end up
with may be just as good (or even better) as if you tried to be a
perfectionist and only made progress/did work when you felt everything was
right.

 My 2 cents,
  Chris


 


On 11/15/07 1:37 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 So I have been talking with some of the other committers and I wanted to
 layout a suggestion for standardizing some of the nutch committer
 workflow processes in the hope of speeding up nutch development.
 
 The first one I was hoping to tackle is time to commit.  At least for me
 it has been hard to know when to commit something, especially when it
 was trivial or no one commented on the issue.  Here is what is being
 proposed:
 
 Trivial changes = immediate, this at the discretion of the committers
 Minor changes = 24 hours from latest patch or 1 or more +1 from committers
 Major and blocker changes = 4 days from latest patch or 2 or more +1
 from committers
 
 This way if an issue has been active for some time but no one has taken
 a look at it, and it has passed all unit tests, then we can go ahead and
 commit it.  Also this should allow more of the smaller changes to be
 handled faster.
 
 So these of course are just some suggestions would love to hear from
 others in the community.  What I think would be best is to come to a
 consensus on this and then have a wiki page describing this and other
 processes for committers.
 
 Dennis Kubes

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: JIRA, Resolving and Closing Issues

2007-10-18 Thread Chris Mattmann

Dennis,

 My practice has been to do the following:

 1. Resolve the issue, and describe (at a high level), the changes made to
the code, e.g.,

  *Introduced new classes A, B, C
  *Refactored method Y out of class D and into new class E
  *made internal method F of class G use member variable as an increment
check for blah blah

 2. Close the issue and list the revision number that the patch you applied
first exists in.

 That's my practice: not sure if it's right, but it's what I gleaned from
watching the other committers for a few years.

Cheers,
  Chris



On 10/18/07 9:58 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Quick question about Jira.  When we commit, are we supposed to first
 resolve and then close the issue.  What is the process on this.
 
 Dennis Kubes

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: writing a new parse-exe plugin

2007-10-17 Thread Chris Mattmann

).getEmptyParse(getConf());
   }
 
 /// i'm not sure what to return here if i only need to d/l the file
 
 ParseData parseData = new ParseData(ParseStatus.STATUS_SUCCESS, ,null,
 null, null);
 parseData.setConf(this.conf);
 return new ParseImpl(, parseData);
   }
 
   public void setConf(Configuration conf) {
 this.conf = conf;
   }
 
   public Configuration getConf() {
 return this.conf;
   }
 
 
 
 

__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Closed: (NUTCH-562) Port mime type framework to use Tika mime detection framework

2007-10-10 Thread Chris Mattmann

Hi Guys,

 
 I vote for reverting this patch, unless there is an overall consensus
 among Nutch developers that it's ok to keep it as it is - on one hand
 considering the added functionality and simplification of Nutch code,
 and on the other hand considering the (lack of) maturity of Tika.
 
 I agree with Andrzej here. I would have waited a bit more before rushing
 into this. Because at this point (where no Tika releases have been made)
 it might (even though it does not look like it right now) even be
 possible that the project will be retired without any releases at all.

I'm not out for beating a dead horse here, but the thought comes to mind:
what about the vitality of the code as it exists within the Nutch code base?
When was the last time anybody at all worked on the mime system? It was
pioneered by Jerome, but he's been largely inactive as a committer for more
than a year now, and it doesn't look like that's going to change.

I ported what was largely Nutch's mime system, with Jerome's improvements to
Tika, where the code is actively being developed, by me (and vetted by the
other *active* members of the team) -- in contrast to Nutch. As a developer,
I don't want to maintain the code in both places, but I'm willing to
maintain the Nutch use of and interface to Tika, which means that Nutch will
inherit the benefits using this approach. Being a member of the Nutch
community for almost 2 years now, I can't tell you how many times people
have asked for Nutch to be able to reliably detect XML content. This is
reified in the form of a number of different JIRA issues that reference that
deficiency, that are for all intents and purposes, not being worked on at
all. 

I'm all for following the process, and so forth, but at the same time, I
think the Nutch community needs to take a serious look at itself with
regards to the sacred nature of the trunk, which we currently treat with a
large amount of sensitivity, etc. However, the trunk as it stands on other
projects (and of course, I'm bias, but I use my work as an example and also
say something like Tika), the trunk is not something that is expected to be
always working and is regularly expected as somewhere where bugs can
exist, and where they can be fixed before a release is made. That's not the
way I feel on this project and quite honestly I think it stymies progress.

Finally, there is precedence for what I did with the Tika patch and making
its way into the Nutch. If I recall something very similar happened when
Hadoop came along and NDFS (at the time as it was called) and MapReduce made
their way into an external library, and Nutch was made to rely on that (at
the time) in-development library. This makes sense, because the folks
working on Hadoop were actively working on updates to the portion of the
code that Nutch relied upon, and all the developers that were interested in
that portion of the code started developing in that arena. I'm not
compariing Hadoop to Tika, but certainly there are some similarities here.

-Chris


__
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266B Mailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/nutch/parse/h

2007-06-25 Thread Chris Mattmann

No problemo!

Thanks!

Cheers,
  Chris



On 6/25/07 9:45 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 ooopsgotta remember to do that.  Done.
 
 Dennis
 
 Chris Mattmann wrote:
 On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: kubes
 Date: Mon Jun 25 20:33:59 2007
 New Revision: 550669
 
 URL: http://svn.apache.org/viewvc?view=revrev=550669
 Log:
 NUTCH-497: Fixes problems relating to StackOverflow errors
 and extreme nested tags.  Adds general framework for stack
 based Node walking.
 
 [...snip...] 
 
 Hi Dennis,
 
  Could you update CHANGES.txt to reflect your commit of NUTCH-497?
 
 Thanks!
 
 Cheers,
   Chris

Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Chris Mattmann

Doğacan,

 This is strange indeed. I noticed this during my testing of parse-feed,
however, thought it was an anomaly. I got this same strange cryptic unit
test error message, and then after some frustration figuring it out, I did
ant clean, then ant compile-core test, and miraculously the error seemed to
go away. Also, if you go into $NUTCH/src/plugin/feed/ and run ant clean test
(of course after running ant compile-core from the top-level $NUTCH dir),
the unit tests seem to pass?

[XXX:src/plugin/feed] mattmann% pwd
/Users/mattmann/src/nutch/src/plugin/feed
[XXX:src/plugin/feed] mattmann% ant clean test
Searching for build.xml ...
Buildfile: /Users/mattmann/src/nutch/src/plugin/feed/build.xml

clean:
   [delete] Deleting directory /Users/mattmann/src/nutch/build/feed
   [delete] Deleting directory /Users/mattmann/src/nutch/build/plugins/feed

init:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/classes
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test
[mkdir] Created dir: /Users/mattmann/src/nutch/build/feed/test/data
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/feed/test/data

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: feed
[javac] Compiling 2 source files to
/Users/mattmann/src/nutch/build/feed/classes

compile-test:
[javac] Compiling 1 source file to
/Users/mattmann/src/nutch/build/feed/test

jar:
  [jar] Building jar: /Users/mattmann/src/nutch/build/feed/feed.jar

deps-test:

init:

init-plugin:

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

init:

init-plugin:

deps-jar:

compile:
 [echo] Compiling plugin: protocol-file

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:
[mkdir] Created dir: /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed

copy-generated-lib:
 [copy] Copying 1 file to /Users/mattmann/src/nutch/build/plugins/feed
 [copy] Copying 2 files to /Users/mattmann/src/nutch/build/plugins/feed

test:
 [echo] Testing plugin: feed
[junit] Running org.apache.nutch.parse.feed.TestFeedParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.663 sec

BUILD SUCCESSFUL
Total time: 3 seconds
[XXX:src/plugin/feed] mattmann%

Any ideas?

Cheers,
  Chris




On 6/20/07 6:04 AM, Doğacan Güney [EMAIL PROTECTED] wrote:

 On 6/20/07, Doğacan Güney [EMAIL PROTECTED] wrote:

 This is rather
 strange. Here is part of the console output:

 test:
  [echo] Testing
 plugin: parse-swf
 [junit] Running
 org.apache.nutch.parse.swf.TestSWFParser
 [junit] Tests run: 1, Failures:
 0, Errors: 0, Time elapsed: 2.315 sec
 [junit] Tests run: 1, Failures: 1,
 Errors: 0, Time elapsed: 5.387 sec

 init:
 [junit] Test
 org.apache.nutch.parse.feed.TestFeedParser FAILED


 SWFParser fails one of
 the unit tests but the report says that
 FeedParser has failed even though it
 has actually passed its test:

 test:
  [echo] Testing plugin: feed

 [junit] Running org.apache.nutch.parse.feed.TestFeedParser
 [junit] Tests
 run: 1, Failures: 0, Errors: 0, Time elapsed: 1.304 sec



(ant test forks
 processes to test code, that's why we are seeing test
outputs out of
 order.)

Anyway, it is not TestSWFParser but TestFeedParser that fails. I
 am
trying to understand why it fails. Chris, can you lend me a hand here?

--

Doğacan Güney


__
Chris A. Mattmann
[EMAIL PROTECTED]
Key Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Chris Mattmann

On 6/20/07 8:17 AM, Doğacan Güney [EMAIL PROTECTED] wrote:

 Since you are doing compile-core, no plugins get compiled
 (say,
urlfilter-prefix), then when  you do a ant test in feed
 only
protocol-file gets compiled. So, no urlfilter-prefix, no problem :).
 I
have to say that I am certain that I am not sure of what I just said.

Can
 you retry with just 'ant' instead of 'ant compile-core'?

Heh, yep, that replicated the issue. Okay, so I agree with you with regards
to the fix that you suggested, however the larger issue here is one of
annoyance. Why should I have to have a version of the urlfilter-prefix
plugin compiled for this issue to manifest itself? Plugin development is
supposed to be independent, i.e., while developing the feed plugin I
shouldn't need to care about how others have developed the urlfilter plugin,
etc., or whether or not there is an appropriate test file there to use in
unit testing. 

I have 2 suggestions:

1. We should make the urlfilter-prefix use more of a sensible default for
its filters (e.g., a default filter perhaps) that takes effect when the
plugin cannot find the specified .txt file.

2. We should think about this more general issue and come up with a way that
plugin development in Nutch supports the use case that I was trying, which I
find to be highly representative of what many other folks using Nutch are
doing as well (i.e., why should I have to do a full rebuild/test of other
plugins when I'm simply working on a single one?

For my part in the interim, I will ensure that next time before I commit a
plugin I make sure that it passes with the full ant clean compile-core test
cycle. Doğacan, thanks for your help in tracking this down. Could you please
commit an example test urlfilter file to make the unit test pass since you
are going to make that change to use lib-xml anyways? Let me know okay,
thanks!

Cheers,
  Chris

Re: Welcome Doğacan as Nutch committer

2007-06-12 Thread Chris Mattmann

+1

Welcome to the team, Doğacan!

Cheers,
  Chris



On 6/12/07 9:43 AM, Sami Siren [EMAIL PROTECTED] wrote:

 Doğacan Güney wrote:
 Hi all,
 I hope that together we will make nutch rock even harder.
 
 By looking at your earlier efforts there should be no doubt.
 
 Welcome!

Committer

2007-05-30 Thread Chris Mattmann

Hi Folks,

 I'd just like to throw out my +1 for Doğacan Güney's committer status. I've
been impressed by several of his contributions and the guy just keeps them
coming and coming. I'm not a member of the Lucene PMC, so I don't have
official voting rights, however, I would like to express my support for his
elevation to committer status.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Key Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch Release 0.9 - Waiting for release to propagate to mirrors

2007-04-05 Thread Chris Mattmann

Hi Guys,

 Okay, it looks like Nutch 0.9 has propagated to (at least some of the)
Apache mirror sites. So, I will now move forward with the final steps of the
release. I will have some free time later this afternoon (PST, Los Angeles
time) to finish it up. I'll post an email to the developers list announcing
the completion of the release.

Thanks!

Cheers,
  Chris
 


On 4/4/07 7:21 PM, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Guys,
 
  I've just moved forward with step 13 in the release process (waiting for
 release to propogate to mirrors). Should I just go ahead and do the other
 steps (update Nutch site, update Lucene site, Update javadoc, create version
 in JIRA, etc.)? It seems that I could do these without the release having
 propagated to the mirrors as of yet. What do you guys think?
 
  Thanks!
 
 Cheers,
   Chris
 
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Nutch 0.9 officially released!

2007-04-05 Thread Chris Mattmann

Hi Folks,

 After some hard work from all folks involved, we've managed to push out
Apache Nutch, release 0.9. This is the second release of Nutch based
entirely on the underlying Hadoop platform. This release includes several
critical bug fixes, as well as key speedups described in more detail at Sami
Siren's blog:

 http://blog.foofactory.fi/2007/03/twice-speed-half-size.html

 See the list of changes made in this version:

 http://www.apache.org/dist/lucene/nutch/CHANGES-0.9.txt

The release is available here.

 http://www.apache.org/dyn/closer.cgi/lucene/nutch/

 Special thanks to (in no particular order): Andrzej Bialecki, Dennis Kubes,
Sami Siren, and the rest of the Nutch development team for providing lots of
help along the way, and for allowing me to be the release manager! Enjoy the
new release!

Cheers,
  Chris

Re: [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Chris Mattmann

Hi Guys,

 Alrighty, that's 4 binding votes (Sami, Andrzej, me, and Dennis), so I
think we can safely move forward with the release process. I will finish the
release up when I get back to my home computer tonight (~5pm Pacific
Standard Time, Los Angeles).

 Thanks, and I will get this thing wrapped up tonight! :-)

Cheers, 
 Chris



On 4/4/07 8:04 AM, Sami Siren [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Folks,
  
 I have posted a candidate for the Apache Nutch 0.9 release at
  
  http://people.apache.org/~mattmann/nutch_0.9/rc2/
 Please vote on releasing these packages as Apache Nutch 0.9.
 
 +1
 
 --
  Sami Siren

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Nutch Release 0.9 - Waiting for release to propagate to mirrors

2007-04-04 Thread Chris Mattmann

Hi Guys,

 I've just moved forward with step 13 in the release process (waiting for
release to propogate to mirrors). Should I just go ahead and do the other
steps (update Nutch site, update Lucene site, Update javadoc, create version
in JIRA, etc.)? It seems that I could do these without the release having
propagated to the mirrors as of yet. What do you guys think?

 Thanks!

Cheers,
  Chris

Re: [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann

Hi Guys,

 I think we're discussing about the same thing(improving the process), I
 just don't think 0.9 is out yet :)
 
 
 But to wrap it up for me:
 
 +1 for creating 0.9 branch after fixing the bug (and removing the tag),
 creating new rc
 and starting a vote.
 
 
 +1.

+1.

So, that's 3 binding votes to change the process. It looks like we have
enough to get started. I will begin work tonight (my time, Los Angeles, PST)
on removing the tag, and starting the process over again.

In the meanwhile, Dennis, do you have the patch that fixes the issue with
Hadoop? If so, ,could you commit it ASAP to the trunk. Once that's done,
I'll remove the tag, and star the release process over again, and get an RC
out for a vote. Then, we can move forward from there.

Thanks, guys!

Cheers,
  Chris

 
 
 I still propose that we discuss a bit more (in a separate thread) before
 rewriting the how to release
 page in wiki.
 
 I agree - the current release process didn't fare too well in this
 particular situation ...
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Chris Mattmann

Hi Dennis,

 Thanks for taking care of this. :-) Could you update CHANGES.txt as well?
Once you take care of that, in about 2 hrs (when I get home), I'll begin the
release process again.

Thanks!

Cheers,
  Chris



On 4/2/07 2:40 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Author: kubes
 Date: Mon Apr  2 14:40:10 2007
 New Revision: 524932
 
 URL: http://svn.apache.org/viewvc?view=revrev=524932
 Log:
 NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob.  Patch
 supplied originally by Michael Stack and updated by Doğacan Güney.
 
 Modified:
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 
 Modified: 
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/segm
 ent/SegmentMerger.java?view=diffrev=524932r1=524931r2=524932
 ==
 --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 (original)
 +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 Mon Apr  2 14:40:10 2007
 @@ -18,17 +18,37 @@
  package org.apache.nutch.segment;
  
  import java.io.IOException;
 -import java.util.*;
 +import java.util.ArrayList;
 +import java.util.HashMap;
 +import java.util.Iterator;
 +import java.util.TreeMap;
  
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
 -
 -import org.apache.hadoop.conf.*;
 +import org.apache.hadoop.conf.Configuration;
 +import org.apache.hadoop.conf.Configured;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.fs.PathFilter;
 -import org.apache.hadoop.io.*;
 -import org.apache.hadoop.mapred.*;
 +import org.apache.hadoop.io.MapFile;
 +import org.apache.hadoop.io.SequenceFile;
 +import org.apache.hadoop.io.Text;
 +import org.apache.hadoop.io.UTF8;
 +import org.apache.hadoop.io.Writable;
 +import org.apache.hadoop.io.WritableComparable;
 +import org.apache.hadoop.mapred.FileSplit;
 +import org.apache.hadoop.mapred.InputSplit;
 +import org.apache.hadoop.mapred.JobClient;
 +import org.apache.hadoop.mapred.JobConf;
 +import org.apache.hadoop.mapred.Mapper;
 +import org.apache.hadoop.mapred.OutputCollector;
 +import org.apache.hadoop.mapred.OutputFormatBase;
 +import org.apache.hadoop.mapred.RecordReader;
 +import org.apache.hadoop.mapred.RecordWriter;
 +import org.apache.hadoop.mapred.Reducer;
 +import org.apache.hadoop.mapred.Reporter;
 +import org.apache.hadoop.mapred.SequenceFileInputFormat;
 +import org.apache.hadoop.mapred.SequenceFileRecordReader;
  import org.apache.hadoop.util.Progressable;
  import org.apache.nutch.crawl.CrawlDatum;
  import org.apache.nutch.crawl.Generator;
 @@ -39,6 +59,7 @@
  import org.apache.nutch.parse.ParseText;
  import org.apache.nutch.protocol.Content;
  import org.apache.nutch.util.NutchConfiguration;
 +import org.apache.nutch.util.NutchJob;
  
  /**
   * This tool takes several segments and merges their data together. Only the
 @@ -482,7 +503,7 @@
  if (LOG.isInfoEnabled()) {
LOG.info(Merging  + segs.length +  segments to  + out + / +
 segmentName);
  }
 -JobConf job = new JobConf(getConf());
 +JobConf job = new NutchJob(getConf());
  job.setJobName(mergesegs  + out + / + segmentName);
  job.setBoolean(segment.merger.filter, filter);
  job.setLong(segment.merger.slice, slice);
 
 Modified: 
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/segm
 ent/SegmentReader.java?view=diffrev=524932r1=524931r2=524932
 ==
 --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 (original)
 +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 Mon Apr  2 14:40:10 2007
 @@ -17,18 +17,48 @@
  
  package org.apache.nutch.segment;
  
 -import java.io.*;
 +import java.io.BufferedReader;
 +import java.io.BufferedWriter;
 +import java.io.IOException;
 +import java.io.InputStreamReader;
 +import java.io.OutputStreamWriter;
 +import java.io.PrintStream;
 +import java.io.PrintWriter;
 +import java.io.Writer;
  import java.text.SimpleDateFormat;
 -import java.util.*;
 +import java.util.ArrayList;
 +import java.util.Arrays;
 +import java.util.Date;
 +import java.util.HashMap;
 +import java.util.Iterator;
 +import java.util.List;
 +import java.util.Map;
  
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
 -
  import org.apache.hadoop.conf.Configuration;
  import org.apache.hadoop.conf.Configured;
 -import org.apache.hadoop.fs.*;
 -import org.apache.hadoop.io.*;
 -import org.apache.hadoop.mapred.*;
 +import org.apache.hadoop.fs.FileSystem;
 +import

Re: svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Chris Mattmann

Hi Dennis,

 No problem! :-) You did it really fast quite honestly. I will start the
release process shortly...

Take care!

Cheers,
  Chris





On 4/2/07 6:21 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Chris,
 
 I have updated changes and resolved and closed the issue.  Sorry about
 not getting to it sooner.
 
 Dennis Kubes
 
 Chris Mattmann wrote:
 Hi Dennis,
 
  Thanks for taking care of this. :-) Could you update CHANGES.txt as well?
 Once you take care of that, in about 2 hrs (when I get home), I'll begin the
 release process again.
 
 Thanks!
 
 Cheers,
   Chris
 
 
 
 On 4/2/07 2:40 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: kubes
 Date: Mon Apr  2 14:40:10 2007
 New Revision: 524932
 
 URL: http://svn.apache.org/viewvc?view=revrev=524932
 Log:
 NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob.  Patch
 supplied originally by Michael Stack and updated by Doğacan Güney.
 
 Modified:
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 
 Modified: 
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/se
 gm
 ent/SegmentMerger.java?view=diffrev=524932r1=524931r2=524932
 
 ==
 --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 (original)
 +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java
 Mon Apr  2 14:40:10 2007
 @@ -18,17 +18,37 @@
  package org.apache.nutch.segment;
  
  import java.io.IOException;
 -import java.util.*;
 +import java.util.ArrayList;
 +import java.util.HashMap;
 +import java.util.Iterator;
 +import java.util.TreeMap;
  
  import org.apache.commons.logging.Log;
  import org.apache.commons.logging.LogFactory;
 -
 -import org.apache.hadoop.conf.*;
 +import org.apache.hadoop.conf.Configuration;
 +import org.apache.hadoop.conf.Configured;
  import org.apache.hadoop.fs.FileSystem;
  import org.apache.hadoop.fs.Path;
  import org.apache.hadoop.fs.PathFilter;
 -import org.apache.hadoop.io.*;
 -import org.apache.hadoop.mapred.*;
 +import org.apache.hadoop.io.MapFile;
 +import org.apache.hadoop.io.SequenceFile;
 +import org.apache.hadoop.io.Text;
 +import org.apache.hadoop.io.UTF8;
 +import org.apache.hadoop.io.Writable;
 +import org.apache.hadoop.io.WritableComparable;
 +import org.apache.hadoop.mapred.FileSplit;
 +import org.apache.hadoop.mapred.InputSplit;
 +import org.apache.hadoop.mapred.JobClient;
 +import org.apache.hadoop.mapred.JobConf;
 +import org.apache.hadoop.mapred.Mapper;
 +import org.apache.hadoop.mapred.OutputCollector;
 +import org.apache.hadoop.mapred.OutputFormatBase;
 +import org.apache.hadoop.mapred.RecordReader;
 +import org.apache.hadoop.mapred.RecordWriter;
 +import org.apache.hadoop.mapred.Reducer;
 +import org.apache.hadoop.mapred.Reporter;
 +import org.apache.hadoop.mapred.SequenceFileInputFormat;
 +import org.apache.hadoop.mapred.SequenceFileRecordReader;
  import org.apache.hadoop.util.Progressable;
  import org.apache.nutch.crawl.CrawlDatum;
  import org.apache.nutch.crawl.Generator;
 @@ -39,6 +59,7 @@
  import org.apache.nutch.parse.ParseText;
  import org.apache.nutch.protocol.Content;
  import org.apache.nutch.util.NutchConfiguration;
 +import org.apache.nutch.util.NutchJob;
  
  /**
   * This tool takes several segments and merges their data together. Only
 the
 @@ -482,7 +503,7 @@
  if (LOG.isInfoEnabled()) {
LOG.info(Merging  + segs.length +  segments to  + out + / +
 segmentName);
  }
 -JobConf job = new JobConf(getConf());
 +JobConf job = new NutchJob(getConf());
  job.setJobName(mergesegs  + out + / + segmentName);
  job.setBoolean(segment.merger.filter, filter);
  job.setLong(segment.merger.slice, slice);
 
 Modified: 
 lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/se
 gm
 ent/SegmentReader.java?view=diffrev=524932r1=524931r2=524932
 
 ==
 --- lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 (original)
 +++ lucene/nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
 Mon Apr  2 14:40:10 2007
 @@ -17,18 +17,48 @@
  
  package org.apache.nutch.segment;
  
 -import java.io.*;
 +import java.io.BufferedReader;
 +import java.io.BufferedWriter;
 +import java.io.IOException;
 +import java.io.InputStreamReader;
 +import java.io.OutputStreamWriter;
 +import java.io.PrintStream;
 +import java.io.PrintWriter;
 +import java.io.Writer;
  import java.text.SimpleDateFormat;
 -import java.util.*;
 +import java.util.ArrayList;
 +import java.util.Arrays;
 +import java.util.Date;
 +import java.util.HashMap;
 +import java.util.Iterator;
 +import java.util.List

[VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann

Hi Folks,
 
I have posted a candidate for the Apache Nutch 0.9 release at
 
 http://people.apache.org/~mattmann/nutch_0.9/rc2/
 
See the included CHANGES-0.9.txt file for details on release
contents and latest changes. The release was made from the 0.9-dev trunk,
including the recent patch applied by Dennis. I've also created a branch for
this release candidate at:
http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9.
 
Please vote on releasing these packages as Apache Nutch 0.9.
The vote is open for the next 72 hours. Only votes from Nutch
committers are binding, but everyone is welcome to check the release
candidate and voice their approval or disapproval. The vote  passes if
at least three binding +1 votes are cast.
 
[ ] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...
 
Thanks!
 
Cheers,

 Chris

Re: [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Chris Mattmann

Folks,

 As an FYI, here is a link to the log of the steps that I followed to get to
this point in the release:

http://people.apache.org/~mattmann/NUTCH_0.9_release_log_v2.doc

Cheers,
  Chris



On 4/2/07 10:52 PM, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Folks,
  
 I have posted a candidate for the Apache Nutch 0.9 release at
  
  http://people.apache.org/~mattmann/nutch_0.9/rc2/
  
 See the included CHANGES-0.9.txt file for details on release
 contents and latest changes. The release was made from the 0.9-dev trunk,
 including the recent patch applied by Dennis. I've also created a branch for
 this release candidate at:
 http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9.
  
 Please vote on releasing these packages as Apache Nutch 0.9.
 The vote is open for the next 72 hours. Only votes from Nutch
 committers are binding, but everyone is welcome to check the release
 candidate and voice their approval or disapproval. The vote  passes if
 at least three binding +1 votes are cast.
  
 [ ] +1 Release the packages as Apache Nutch 0.9
 [ ] -1 Do not release the packages because...
  
 Thanks!
  
 Cheers,
 
  Chris

Re: [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Chris Mattmann

Well, it's just going to add more work for me, but in the end, it's probably
something that needs to be in there. I could go either way on this though,
as in, if we don't commit it, 0.9.1 shouldn't be far off. Here's my +1 for
going ahead and committing it...


On 3/28/07 10:21 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Yes.  This seems to have fixed the problem.  All, do we want to create a
 JIRA and commit this for the 0.9 release?
 
 Dennis
 
 Andrzej Bialecki wrote:
 Doğacan Güney wrote:
 Hi,
 
 On 3/28/07, Dennis Kubes [EMAIL PROTECTED] wrote:
 
 This is definitely a hadoop problem.  This is similar to the classpath
 issues that we were encountering before with Hadoop and the
 ReductTaskRunner.  When I include the nutch-*.jar in the hadoop class
 path the errors go away.  Not a fix but it proves the point that this is
 an issue with Hadoop class loading.
 
 Dennis Kubes
 
 
 Dennis, you were running SegmentMerger, I presume? This occurs probably
 because in SegmentMerger and SegmentReader's dump Nutch uses JobConf
 instead
 of NutchJob. Because of this Hadoop can't find the necessary job file.
 
 I put a simple patch at
 http://www.ceng.metu.edu.tr/~e1345172/use-nutch-job.patch . Can you
 try it
 with this?
 
 
 Duh, the patch seems to be exactly what's needed - thanks Doğacan!
 
 In the future we should rework the test suite to execute using a clean
 Hadoop installation, i.e. one where Hadoop daemons are started without
 Nutch classes on the classpath.

Re: Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Chris Mattmann

My +1 for 1.0.0. I already changed it to 0.10.0, but this can be easily
reverted, and was probably something that I should have brought to the
attention of the dev list before I did that (sorry about that). In any case,
I think 1.0.0 makes a lot of sense, politically, and software wise. Nutch is
production quality software (we use it in production environments here at
JPL), and deserves to have a 1.0.0 release...

My 2 cents,
  Chris



On 3/28/07 11:38 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Hi all,
 
 I know it's a trivial issue, but still ... When this release is out, I
 propose that we should name the next release 1.0.0, and not 0.10.0. The
 effect is purely psychological, but it also reflects our confidence in
 the platform.
 
 Many Open Source projects are afraid of going to 1.0.0 and seem to be
 unable to ever reach this level, as if it were a magic step beyond which
 they are obliged to make some implied but unjustified promises ...
 Perhaps it's because in the commercial world everyone knows what a 1.0.0
 release means :) The downside of the version numbering that never
 reaches 1.0.0 is that casual users don't know how usable the software is
 - e.g. Nutch 0.10.0 could possibly mean that there are still 90 releases
 to go before it becomes usable.
 
 Therefore I propose the following:
 
 * shorten the release cycle, so that we can make a release at least once
 every quarter. This was discussed before, and I hope we can make it
 happen, especially with the help of new forces that joined the team ;)
 
 * call the next version 1.0.0, and continue in increments of 0.1.0 for
 each bi-monhtly or quarterly release,
 
 * make critical bugfix / maintenance releases using increments of 0.0.1
 - although the need for such would be greatly diminished with the
 shorter release cycle.
 
 * once we arrive at versions greater than x.5.0 we should plan for a big
 release (increment of 1.0.0).
 
 * we should use only single digits for small increments, i.e. limit them
 to values between 0-9.
 
 What do you think?

Re: [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann

I've gone ahead and figured out how to generate my GPG public key :-) It
wasn't as hard as I thought. Anyways, I placed my gpg.txt file in:

~mattmann/gpg.txt

On people.apache.org. I've also added my GPG key to the KEYS file in the
nutch dist directory, /www/www.apache.org/dist/lucene/nutch/, using the same
convention as the others. To get the header, I did a gpg --list-keys.


Thanks!

Cheers,
  Chris



On 3/27/07 8:14 AM, Chris Mattmann [EMAIL PROTECTED] wrote:

 Hi Sami,
 
 A very limited acid test shows that I can do crawling and searching
 through web app so that part is ok.
 
 Great! Similar tests of my own showed the same.
 
 
 About signatures: I can't find your public gpg key anywhere (to verify
 the signature), not in KEYS file nor in keyservers I checked. Am i just
 blind?
 
 Yeah, in my release log, I actually noted this. I was having a hard time
 figuring out how to generate my public gpg key. Do you know what command to
 run? I know where the KEYS file is in the dist directory, so I'm guessing I
 just:
 
 1. Generate my public gpg key (I already have my private one I guess)
 2. Add that public gpg key to the KEYS file in the Nutch dist directory on
 pepole.apache.org
 
 Am I right about this? If so, could you tell me the command to run to
 generate my public gpg key?
 
 
 The md5 format used differs from rest of lucene sub projects.
 
 According to the Apache sign and release guide (
 http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step), I ran the
 following command:
 
 openssl md5  nutch-0.9.tar.gz  nutch-0.9.tar.gz.md5
 
 To create
 it in similar format as the rest of lucene one could use
 
   md5sum file  file.md5
 
 We should probably adopt to same convention or wdot?
 
 It's fine by me, but, just for my reference, what's the difference between
 using the openssl md5 versus md5sum? If you want me to regenerate it, just
 let me know...
 
 Cheers,
   Chris
 
 
 
 --
  Sami Siren

Re: [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Chris Mattmann

Hey Sami,

 
 Well the sum itself is obviously the same :) The point in this is to use
 same
 conventions in Lucene family, not strictly required, but still IMO it just
 looks better.

Okey dok -- I will run the md5sum command, and generate a .md5 for the nutch
release that matches that.

I will put it in the same place as the current md5 -- it should be there in
5 mins.

Thanks!

Cheers,
  Chris


 
 --
  Sami Siren

Initiation of 0.9 release process

2007-03-26 Thread Chris Mattmann

Hi Folks,

  As your friendly neighborhood 0.9 release manager, I just wanted to give
you all a heads up that I'd like to begin the release process today. If I
hear no objections by 00:00:00 UTC time, I will begin the release process
then. I will notify the list as soon as I'm done.

 Thanks!

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Initiation of 0.9 release process

2007-03-26 Thread Chris Mattmann

Hey Dennis,

 I'm basically going to follow the release process on the wiki (pointed to
by Doug), and the steps that I discussed with you and Sami (posted to the
dev list). In terms of help, if there's anything in those steps that I get
stuck on, I'll hollar at ya. Otherwise, if the process goes smoothly, I can
probably get it done on my own. Thanks for the offer: I'll be sure to call
on you if I get stuck. :-)


Cheers,
  Chris



On 3/26/07 10:06 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Let me know if I can help in any way?
 
 Dennis Kubes
 
 Chris Mattmann wrote:
 Hi Folks,
 
   As your friendly neighborhood 0.9 release manager, I just wanted to give
 you all a heads up that I'd like to begin the release process today. If I
 hear no objections by 00:00:00 UTC time, I will begin the release process
 then. I will notify the list as soon as I'm done.
 
  Thanks!
 
 Cheers,
   Chris
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not reflect
 those of either NASA, JPL, or the California Institute of Technology.

Nutch 0 .9 release progress update

2007-03-26 Thread Chris Mattmann

Hi Folks,

 Just to update everyone on progress. I've made it to Step 13 (waiting for
release to appear on mirrors) in the Release Process:

  http://wiki.apache.org/nutch/Release_HOWTO

 You can view a full log of the fun that I've been having by going to:

  http://people.apache.org/~mattmann/NUTCH_0.9_release_log.doc

 Tomorrow when I wade up (here in Los Angeles, Pacific Standard Time), I
will go ahead and wrap up the rest of the process. Thanks to all the folks
who've given me guidance along the way. It's been interesting figuring out
the process.

Thanks!

Cheers,
  Chris

Re: Nutch 0 .9 release progress update

2007-03-26 Thread Chris Mattmann

Hi Sami,

 Thanks for the heads up! :-) Okay, so I did the following:

1. Removed nutch-0.9.* from
people.apache.org:/www/www.apache.org/dist/lucene/nutch

2. Removed CHANGES-0.9.txt from the same place

 I will send out a separate email calling for a vote (thanks for the pointer
to the example!)

Thanks!

Cheers,
  Chris



On 3/26/07 10:22 PM, Sami Siren [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Folks,
 
  Just to update everyone on progress. I've made it to Step 13 (waiting for
 release to appear on mirrors) in the Release Process:
 
 Chris, thanks for your work so far.
 
 Seems like we're missing one important point in the rtfm: release review
  vote.
 
 Every apache release should be voted before it is made official. Three
 binding votes are required (I believe we now have enough active
 committers to do it this way?).
 
 So please put the artifacts in a staging area and call a vote before
 going further. (there's a nice example here for a vote mail:
 http://www.mail-archive.com/dev@jackrabbit.apache.org/msg04641.html)
 
 --
  Sami Siren

[VOTE] Release Apache Nutch 0.9

2007-03-26 Thread Chris Mattmann

Hi Folks,

I have posted a candidate for the Apache Nutch 0.9 release at

 http://people.apache.org/~mattmann/nutch_0.9/

See the included CHANGES-0.9.txt file for details on release
contents and latest changes. The release was made from the 0.9-dev trunk.

Please vote on releasing these packages as Apache Nutch 0.9.
The vote is open for the next 72 hours. Only votes from Nutch
committers are binding, but everyone is welcome to check the release
candidate and voice their approval or disapproval. The vote  passes if
at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 0.9
[ ] -1 Do not release the packages because...

Thanks!

Cheers,
  Chris

Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann

Hi Dennis,

 Not to nit-pick, but the place where you inserted your change isn't at the
end (where they typically should be placed). You inserted in the middle of
the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
the unreleased changes section). Could you please append your changes to the
end of the file, and recommit?

 Thanks a lot!

Cheers,
  Chris



On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Author: kubes
 Date: Sat Mar 10 10:03:07 2007
 New Revision: 516759
 
 URL: http://svn.apache.org/viewvc?view=revrev=516759
 Log:
 Updated to reflect commits of NUTCH-233 and NUTCH-436.
 
 Modified:
 lucene/nutch/trunk/CHANGES.txt
 
 Modified: lucene/nutch/trunk/CHANGES.txt
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=5167
 59r1=516758r2=516759
 ==
 --- lucene/nutch/trunk/CHANGES.txt (original)
 +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
 @@ -50,6 +50,13 @@
  
  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
  
 +18. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
 +Groschupf via kubes)
 +
 +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
 + path is empty (kubes)
 +
 +
** WARNING !!! 
* This upgrade breaks data format compatibility. A tool 'convertdb'   *
* was added to migrate existing CrawlDb-s to the new format. Segment data *

Re: svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Chris Mattmann

Dennis,

 No probs. Thanks, a lot!

Cheers,
  Chris



On 3/10/07 5:35 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 
 
 Chris Mattmann wrote:
 Hi Dennis,
 
  Not to nit-pick, but the place where you inserted your change isn't at the
 end (where they typically should be placed). You inserted in the middle of
 the file, throwing off the numbering (there are now 2 sets of 18, and 19 in
 the unreleased changes section). Could you please append your changes to the
 end of the file, and recommit?
 
  Thanks a lot!
 
 Cheers,
   Chris
 
 Sorry about that.  I say the warning message thinking it was a version
 break.  Everything should be fixed now.
 
 Dennis Kubes
 
 
 
 On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: kubes
 Date: Sat Mar 10 10:03:07 2007
 New Revision: 516759
 
 URL: http://svn.apache.org/viewvc?view=revrev=516759
 Log:
 Updated to reflect commits of NUTCH-233 and NUTCH-436.
 
 Modified:
 lucene/nutch/trunk/CHANGES.txt
 
 Modified: lucene/nutch/trunk/CHANGES.txt
 URL: 
 http://svn.apache.org/viewvc/lucene/nutch/trunk/CHANGES.txt?view=diffrev=51
 67
 59r1=516758r2=516759
 
 ==
 --- lucene/nutch/trunk/CHANGES.txt (original)
 +++ lucene/nutch/trunk/CHANGES.txt Sat Mar 10 10:03:07 2007
 @@ -50,6 +50,13 @@
  
  17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
  
 +18. NUTCH-233 - Wrong regular expression hangs reduce process forever
 (Stefan
 +Groschupf via kubes)
 +
 +19. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
 + path is empty (kubes)
 +
 +
** WARNING !!!
 
* This upgrade breaks data format compatibility. A tool 'convertdb'
 *
* was added to migrate existing CrawlDb-s to the new format. Segment data
 *

Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2007-03-08 Thread Chris Mattmann

Hi Andrzej,
 
  Yep, +1. I also want to make a small update, where instead of creating a
new NutchConf object, to just pass it through (maybe via the protocol
layer?). Does this make sense?

Cheers,
  Chris



On 3/8/07 1:47 PM, Andrzej Bialecki  (JIRA) [EMAIL PROTECTED] wrote:

 
 [ 
 https://issues.apache.org/jira/browse/NUTCH-384?page=com.atlassian.jira.plugin
 .system.issuetabpanels:comment-tabpanel#action_12479442 ]
 
 Andrzej Bialecki  commented on NUTCH-384:
 -
 
 +1 - although the patch needs whitespace cleanup before committing
 (indentation should be 2 literal spaces, if keyword should be separated by
 one space from the parens).
 
 Protocol-file plugin does not allow the parse plugins framework to operate
 properly
 -
 --
 
 Key: NUTCH-384
 URL: https://issues.apache.org/jira/browse/NUTCH-384
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Paul Ramirez
 Assigned To: Chris A. Mattmann
 Attachments: file_protocol_mime_patch.diff
 
 
 When using the file protocol one can not map a parse plugin to a content
 type. The only way to get the plugin called is through the default plugin.
 The issue is that the content type never gets mapped. Currently the content
 type does not get set by the file protocol.

Re: [jira] Commented: (NUTCH-384) Protocol-file plugin does not allow the parse plugins framework to operate properly

2007-03-08 Thread Chris Mattmann

Hi Andrzej,

 Ah, yep, you're right. I just did a cursory inspection, and hadn't applied
the patch (yet). I didn't notice it was in the main method. Kk, sounds good.
I am applying patch now, and will test later this afternoon, fix the
whitespace stuff, and then commit.

Thanks!

Cheers,
  Chris



On 3/8/07 1:55 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Andrzej,
  
   Yep, +1. I also want to make a small update, where instead of creating a
 new NutchConf object, to just pass it through (maybe via the protocol
 layer?). Does this make sense?
   
 
 I'm not sure what you mean - the only place where this patch creates a
 Configuration object is in File.main(), which is innocuous.

0.9 release

2007-03-07 Thread Chris Mattmann

Hi Folks,

  As suggested by Sami, I'm moving this discussion to the nutch-dev list.
Seems like I am the guy that is going to do the Nutch 0.9 release :-)
However, it seems also that there are some issues that need to be sorted out
first. I'd like to follow up to Andrzej's email about loose ends before
moving forward with the release. So, here are my questions:

1. What remaining issues out there need to be applied to the sources, (or
have patches contributed, then applied) and make it into 0.9? There were
some discussions about this, however, I don't think we have a concrete set
yet. The answer I'm looking for would be something like:

A. NUTCH-XXX (has a patch), NUTCH-YYY (has a patch) before 0.9 is made
B. NUTCH-ZZZ (patch in progress) before 0.9 is made
C. We've got enough in 0.9-dev in the trunk right now to make a 0.9 release

2. Any outstanding things that need to get done that aren't really code that
needs to get committed, e.g., things we need to close the loop on

3. Release Manager: I've got this taken care of, as soon as you all give me
the green light. 

 So, please, committer-brethren, let me know what you think about 1-3, as it
would help me understand how to move forward.

Thanks!

Cheers,
  Chris

Re: Issues pending before 0.9 release

2007-03-05 Thread Chris Mattmann

Hi Guys,

 Blocker
 
 * NUTCH-400 (Update  add missing license headers) - I believe this is
 fixed and should be closed

+1, thanks to Sami for closing it.

 
 * NUTCH-353 (pages that serverside forwards will be refetched every
 time) - this was partially fixed in NUTCH-273, but a more complete
 solution would require significant changes to LinkDb. As there are no
 patches implementing this, I left it open, but it's no longer as
 critical as it was before. I propose to move it to Major and address
 it in the next release.

+1

 
 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
 propose to apply the fix provided by Sean Dean and close this issue for now.

+1

 
 Critical
 
 * NUTCH-436 (Incorrect handling of relative paths when the embedded URL
 path is empty). There is no patch available yet. If someone could
 contribute a patch I'd like to see this fixed before the release.

Looks like Dennis is on this one

 
 * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
 certainly not critical (as this is an optional new feature). I propose
 to change it to Major, and make a decision - do we want another plugin
 like parse-mp3 or parse-rtf, or not.

Let's hold off on this: it's not necessary for 0.9, and I don't think
there's been a bunch of traffic on the list identifying this as critical to
get into the sources for the release

 
 * NUTCH-381 (Ignore external link not work as expected) - I'll try to
 reproduce it, and if I find an easy fix I'd like to apply it before the
 release.

+1

 
 * NUTCH-277 (Fetcher dies because of max. redirects) - I wasn't able
 to reproduce it. If there is no updated information on this I propose to
 close it with Can't reproduce.

+1, I had to do something similar with NUTCH-258

 
 * NUTCH-167 (Observation of META NAME=ROBOTS CONTENT=NOARCHIVE) -
 there's a patch which I tested in a limited production env. If there are
 no objections I'd like to apply it before the release.

+1

 
 Major
 =
 There are 84 major issues, but some of them are either invalid, or
 should be minor, or no longer apply and should be closed. Please
 review them if you can and provide some comments or recommendations if
 you think you have some new information.

I will spend some time going through JIRA today and see if there's any
issues that I can find that:

1. Have a patch already
2. Sound like something quick, easy, and not so far-reaching across the
entire Nutch API

 
 
 One decision also that we need to make is which version of Hadoop should
 be included in the release. Current trunk uses 0.10.1, I have a set of
 production-tested patches that use 0.11.2, and today the Hadoop team
 released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
 before our release). The most conservative option is to stay with
 0.10.1, but by the time people start using Nutch this will be a fairly
 old version already. I propose to upgrade to 0.11.2. We could use 0.12.1
 - but in this case with the expectation that we release less than stable
 version of Nutch to be soon followed by a minor stable release ...

I'd agree with the upgrade to 0.11.2, +1


Cheers,
  Chris

P.S. I am going to contact Pitor and coordinate with him: I'd like to be the
release manager for this Nutch release.

Re: Welcome Dennis Kubes as Nutch committer

2007-02-28 Thread Chris Mattmann

Dennis,

 I take my coffee black: with a single creamer ;) Okay, okay, sorry: I
thought we were talking about *real* hazing ;)

Cheers,
  Chris



On 2/28/07 12:31 PM, Dennis Kubes [EMAIL PROTECTED] wrote:

 Hi All,
 
 Thank you Andrzej for your kind words. I am looking forward to working
 together with everyone and I hope I can continue to be too inquisitive.
 
 I don't know if I can introduce myself shortly but I will try. ;) For
 those that don't know me I am based in Plano (Dallas), Texas.  I am 28
 and have been programming for about 12 years.
 
 So as first commit I need to add my name and re-publish the website.
 Let the hazing begin.
 
 Dennis Kubes
 
 Andrzej Bialecki wrote:
 Hi all,
 
 Some time ago I proposed to Lucene PMC that Dennis should become a Nutch
 committer.
 
 Dennis has been found guilty of providing too many good quality patches,
 sending too many supportive emails to the mailing lists, and generally
 being too inquisitive in nature, which led to a constant stream of
 comments, suggestions and patches. We weren't able to keep up -
 something had to be done about it ... ;)
 
 I'm glad to announce that Lucene PMC has voted in his favor.
 Congratulations and welcome aboard!
 
 (The tradition on Apache projects is that new committers should
 (shortly) introduce themselves, and as the first commit they should put
 their name in the Credits section of the website and re-publish the
 website).
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: log guards

2007-02-13 Thread Chris Mattmann

Hi Doug, and Jerome,

  Ah, yes, the log guard conversation. I remember this from a while back.
Hmmm, do you guys know what issue that this recorded as in JIRA? I have some
free time recently, so I will be able to add this to my list of Nutch stuff
to work on, and would be happy to take the lead on removing the guards where
needed, and reviewing whether or not the debug ones make sense where they
are. 

Cheers,
  Chris



On 2/13/07 11:17 AM, Jérôme Charron [EMAIL PROTECTED] wrote:

 These guards were all introduced by a patch some time ago.  I complained
 at the time and it was promised that this would be repaired, but it has
 not yet been.
 
 Yes, Sorry Doug that's my own fault
 I really don't have time to fix this   :-(
 
 Best regards
 
 Jérôme

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-08 Thread Chris Mattmann

Hi Doug,

  Okay, I see your points. It seems like this would be really useful for
some current folks, and for Nutch going forward. I see that there has been
some initial work today and preparing patches. I'd be happy to shepherd this
into the sources. I will begin reviewing what's required, and contacting the
folks who've begun work on this issue.

Thanks!

Cheers,
  Chris



On 2/7/07 1:31 PM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
  Got it. So, the logic behind this is, why bother waiting until the
 following fetch to parse (and create ParseData objects from) the RSS items
 out of the feed. Okay, I get it, assuming that the RSS feed has *all* of the
 RSS metadata in it. However, it's perfectly acceptable to have feeds that
 simply have a title, description, and link in it.
 
 Almost.  The feed may have less than the referenced page, but it's also
 a lot easier to parse, since the link could be an anchor within a large
 page, or could be a page that has lots of navigation links, spam
 comments, etc.  So feed entries are generally much more precise than the
 pages they reference, and may make for a higher-quality search experience.
 
 I guess this is still
 valuable metadata information to have, however, the only caveat is that the
 implication of the proposed change is:
 
 1. We won't have cached copies, or fetched copies of the Content represented
 by the item links. Therefore, in this model, we won't be able to pull up a
 Nutch cache of the page corresponding to the RSS item, because we are
 circumventing the fetch step
 
 Good point.  We indeed wouldn't have these URLs in the cache.
 
 2. It sounds like a pretty fundamental API shift in Nutch, to support a
 single type of content, RSS. Even if there are more content types that
 follow this model, as Doug and Renaud both pointed out, there aren't a
 multitude of them (perhaps archive files, but can you think of any others)?
 
 Also true.  On the other hand, Nutch provides 98% of an RSS search
 engine.  It'd be a shame to have to re-invent everything else and it
 would be great if Nutch could evolve to support RSS well.
 
 Could image search might also benefit from this?  One could generate a
 Parse for each image on a page whose text was from the page.  Product
 search too, perhaps.
 
 The other main thing that comes to mind about this for me is it prevents the
 fetched Content for the RSS items from being able to provide useful
 metadata, in the sense that it doesn't explicitly fetch the content. What if
 we wanted to apply some super cool metadata extractor X that used
 word-stemming, HTML design analysis, and other techniques to extract
 metadata from the content pointed to by an RSS item link? In the proposed
 model, we assume that the RSS xml item tag already contains all necessary
 metadata for indexing, which in my mind, limits the model. Does what I am
 saying make sense? I'm not shooting down the issue, I'm just trying to
 brainstorm a bit here about the issue.
 
 Sure, the RSS feed may contain less than the page it references, but
 that might be all that one wishes to index.  Otherwise, if, e.g., a blog
   includes titles from other recent posts you're going to get lots of
 false positives.  Ideally Nutch should support various options:
 searching the feed only, searching the referenced page only, or perhaps
 searching both.
 
 Doug

Re: RSS-fecter and index individul-how can i realize this function

2007-02-07 Thread Chris Mattmann

Guys,

 Sorry to be so thick-headed, but could someone explain to me in really
simple language what this change is requesting that is different from the
current Nutch API? I still don't get it, sorry...

Cheers,
  Chris



On 2/7/07 9:58 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Renaud Richardet wrote:
 I see. I was thinking that I could index the feed items without having
 to fetch them individually.
 
 Okay, so if Parser#parse returned a MapString,Parse, then the URL for
 each parse should be that of its link, since you don't want to fetch
 that separately.  Right?
 
 So now the question is, how much impact would this change to the Parser
 API have on the rest of Nutch?  It would require changes to all Parser
 implementations, to ParseSegement, to ParseUtil, and to Fetcher.  But,
 as far as I can tell, most of these changes look straightforward.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-06 Thread Chris Mattmann

Hi Doug,

 Since the target of the link must still be indexed separately from the
 item itself, how much use is all this?  If the RSS document is
 considered a single page that changes frequently, and item's links are
 considered ordinary outlinks, isn't much the same effect achieved?

IMHO, yes. That's what it's been hard for me to understand the real use case
for what Gal et al. are talking about. I've been trying to wrap my head
around it, but it seems to me the capability they require is sort of already
provided...

Cheers,
  Chris

 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: RSS-fecter and index individul-how can i realize this function

2007-02-01 Thread Chris Mattmann

Hi Gal, et al.,

  I'd like to be explicit when we talk about what the issue with the RSS
parsing plugin is here; I think we have had conversations similar to this
before and it seems that we keep talking around each other. I'd like to get
to the heart of this matter so that the issue (if there is an actual one)
gets addressed ;)

  Okay, so you mention below that the thing that you see missing from the
current RSS parsing plugin is the ability to store data in the CrawlDatum,
and parse it in the next fetch phase. Well, there are 2 options here for
what you refer to as it:

 1. If you're talking about the RSS file, then in fact, it is parsed, and
its data is stored in the CrawlDatum, akin to any other form of content that
is fetched, parsed and indexed.

 2. If you're talking about the item links within the RSS file, in fact,
they are parsed (eventually), and their data stored in the CrawlDatum, akin
to any other form of content that is fetched, parsed, and indexed. This is
accomplished by adding the RSS items as Outlinks when the RSS file is
parsed: in this fashion, we go after all of the links in the RSS file, and
make sure that we index their content as well.

Thus, if you had an RSS file R that contained links in it to a PDF file A,
and another HTML page P, then not only would R get fetched, parsed, and
indexed, but so would A and P, because they are item links within R. Then
queries that would match R (the physical RSS file), would additionally match
things such as P and A, and all 3 would be capable of being returned in a
Nutch query. Does this make sense? Is this the issue that you're talking
about? Am I nuts? ;)

Cheers,
  Chris




On 1/31/07 10:40 PM, Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi,
 
 Many sites provide RSS feeds for several reasons, usually to save bandwidth,
 to give the users concentrated data and so forth.
 
 Some of the RSS files supplied by sites are created specially for search
 engines where each RSS item represent a web page in the site.
 
 IMHO the only thing missing in the parse-rss plugin is storing the data in
 the CrawlDatum and parsing it in the next fetch phase. Maybe adding a new
 flag to CrawlDatum, that would flag the URL as parsable not fetchable?
 
 Just my two cents...
 
 Gal.
 
 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, January 31, 2007 8:44 AM
 To: nutch-dev@lucene.apache.org
 Subject: Re: RSS-fecter and index individul-how can i realize this function
 
 Hi there,
 
   With the explanation that you give below, it seems like parse-rss as it
 exists would address what you are trying to do. parse-rss parses an RSS
 channel as a set of items, and indexes overall metadata about the RSS file,
 including parse text, and index data, but it also adds each item (in the
 channel)'s URL as an Outlink, so that Nutch will process those pieces of
 content as well. The only thing that you suggest below that parse-rss
 currently doesn't do, is to allow you to associate the metadata fields
 category:, and author: with the item Outlink...
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:
 
 thx for ur reply .
 mybe i didn't tell clearly .
  I want to index the item as a
 individual page .then when i search the some
 thing for example nutch-open
 source, the nutch return a hit which contain
 
title : nutch-open source
 
 description : nutch nutch nutch nutch  nutch
url :
 http://lucene.apache.org/nutch
category : news
   author  : kauu
 
 so , is
 the plugin parse-rss can satisfy what i need?
 
 item
 titlenutch--open
 source/title
description
 
nutch nutch nutch nutch
 nutch
 /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
 categorynews
 /category
 
 
 authorkauu/author
 
 
 
 On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:
 
 Hi there,
 
 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already
 
 supported by the current RSS parse plugin?
 
 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?
 
 Cheers,
   Chris
 
 
 
 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:
 
 Hi folks :
 
What's I want to
 do is to separate a rss file into several pages .
 
   Just as what has
 been discussed before. I want fetch a rss page and
 index
 it as different
 documents in the index. So the searcher can search the
 Item's info as a
 individual hit.
 
  What's my opinion create a protocol for fetch the rss
 page and store it
 as
 several one which just contain one ITEM tag .but
 the unique key is the
 url ,
 so how can I store them with the ITEM's link
 tag as the unique key for a
 document.
 
   So my question is how

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

The current RSS parser, parse-rss, does in fact index individual items that
are pointed to by an RSS document. The items are added as Nutch Outlinks,
and added to the overall queue of URLs to fetch. Doesn't this satisfy what
you mention below? Or am I missing something?

Cheers,
Chris

On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

Hi folks :

What’s I want to do is to separate a rss file into several pages .

Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

What’s my opinion create a protocol for fetch the rss page and store it as
several one which just contain one ITEM tag .but the unique key is the url ,
so how can I store them with the ITEM’s link tag as the unique key for a
document.

So my question is how to realize this function in nutch-.0.8.x.

I’ve check the code of the plug-in protocol-http’s code ,but I can’t
find the code where to store a page to a document. I want to separate the
rss page to several ones before storing it as a document but several ones.

So any one can give me some hints?

Any reply will be appreciated !

ITEM’s structure

item

title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title

description暴风雪横扫欧洲，导致多次航班延误 1月24日，几架民航客机在德
国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部的慕尼黑机场
清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...

/description

linkhttp://news.sohu.com/20070125
http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
link

category搜狐焦点图新闻/category

author[EMAIL PROTECTED]
/author

pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate

comments
http://comment.news.sohu.com
http://comment.news.sohu.com/comment/topic.jsp?id=247833847
/comment/topic.jsp?id=247833847/comments

/item

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

On 1/30/07 7:00 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

Chris,

I saw your name associated with the rss parser in nutch. My understanding is
that nutch is using feedparser. I had two questions:

1. Have you looked at vtd as an rss parser?

I haven't in fact; what are its benefits over those of commons-feedparser?

2. Any view on asynchronous communication as the underlying protocol? I do
not believe that feedparser uses that at this point.

I'm not sure exactly what asynchronous communication when parsing rss feeds
affords you: what type of communications are you talking about above? Nutch
handles the communications layer for fetching content using a pluggable,
Protocol-based model. The only feature that Nutch's rss parser uses from the
underlying feedparser library is its object model and callback framework for
parsing RSS/Atom/Feed XML documents. When you mention asynchronous above,
are you talking about the protocol for fetching the different RSS documents?

Thanks!

Cheers,
Chris

Thanks

-Original Message-
From: Chris Mattmann [EMAIL PROTECTED]
Date: Tue, 30 Jan 2007 18:16:44
To:nutch-dev@lucene.apache.org
Subject: Re: RSS-fecter and index individul-how can i realize this function

Hi there,

I could most likely be of assistance, if you gave me some more information.
For instance: I'm wondering if the use case you describe below is already
supported by the current RSS parse plugin?

Cheers,
Chris

On 1/30/07 6:01 PM, kauu [EMAIL PROTECTED] wrote:

Hi folks :

What’s I want to do is to separate a rss file into several pages .

Just as what has been discussed before. I want fetch a rss page and index
it as different documents in the index. So the searcher can search the
Item’s info as a individual hit.

So my question is how to realize this function in nutch-.0.8.x.

So any one can give me some hints?

Any reply will be appreciated !

ITEM’s structure

item

title欧洲暴风雪后发制人 致航班延误交通混乱(组图)/title

/description

linkhttp://news.sohu.com/20070125
http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
link

category搜狐焦点图新闻/category

author[EMAIL PROTECTED]
/author

pubDateThu, 25 Jan 2007 11:29:11 +0800/pubDate

comments
http://comment.news.sohu.com
http://comment.news.sohu.com/comment/topic.jsp?id=247833847
/comment/topic.jsp?id=247833847/comments

/item

Re: RSS-fecter and index individul-how can i realize this function

2007-01-30 Thread Chris Mattmann

Hi there,

  With the explanation that you give below, it seems like parse-rss as it
exists would address what you are trying to do. parse-rss parses an RSS
channel as a set of items, and indexes overall metadata about the RSS file,
including parse text, and index data, but it also adds each item (in the
channel)'s URL as an Outlink, so that Nutch will process those pieces of
content as well. The only thing that you suggest below that parse-rss
currently doesn't do, is to allow you to associate the metadata fields
category:, and author: with the item Outlink...

Cheers,
  Chris



On 1/30/07 7:30 PM, kauu [EMAIL PROTECTED] wrote:

 thx for ur reply .
mybe i didn't tell clearly .
 I want to index the item as a
 individual page .then when i search the some
thing for example nutch-open
 source, the nutch return a hit which contain

   title : nutch-open source

 description : nutch nutch nutch nutch  nutch
   url :
 http://lucene.apache.org/nutch
   category : news
  author  : kauu

so , is
 the plugin parse-rss can satisfy what i need?

item
titlenutch--open
 source/title
   description

nutch nutch nutch nutch
 nutch
  /description
 
 
 
 linkhttp://lucene.apache.org/nutch/link
 
 
  categorynews
 /category
 
 
  authorkauu/author



On 1/31/07, Chris
 Mattmann [EMAIL PROTECTED] wrote:

 Hi there,

 I could most
 likely be of assistance, if you gave me some more
 information.
 For
 instance: I'm wondering if the use case you describe below is already

 supported by the current RSS parse plugin?

 The current RSS parser,
 parse-rss, does in fact index individual items
 that
 are pointed to by an
 RSS document. The items are added as Nutch Outlinks,
 and added to the
 overall queue of URLs to fetch. Doesn't this satisfy what
 you mention below?
 Or am I missing something?

 Cheers,
   Chris



 On 1/30/07 6:01 PM,
 kauu [EMAIL PROTECTED] wrote:

  Hi folks :
 
 What's I want to
 do is to separate a rss file into several pages .
 
Just as what has
 been discussed before. I want fetch a rss page and
 index
  it as different
 documents in the index. So the searcher can search the
  Item's info as a
 individual hit.
 
   What's my opinion create a protocol for fetch the rss
 page and store it
 as
  several one which just contain one ITEM tag .but
 the unique key is the
 url ,
  so how can I store them with the ITEM's link
 tag as the unique key for a
  document.
 
So my question is how to
 realize this function in nutch-.0.8.x.
 
I've check the code of the
 plug-in protocol-http's code ,but I can't
  find the code where to store a
 page to a document. I want to separate
 the
  rss page to several ones
 before storing it as a document but several
 ones.
 
So any one can
 give me some hints?
 
  Any reply will be appreciated !
 
 
 
 

 
ITEM's structure
 
   item
 
 
  title欧洲暴风雪后发制人 致航班
 延误交通混乱(组图)/title
 
 
  description暴风雪横扫欧洲，导致多次航班延误 1
 月24日，几架民航客机在德
  国斯图加特机场内等待去除机身上冰雪。1月24日，工作人员在德国南部
 的慕尼黑机场
  清扫飞机跑道上的积雪。 据报道，迟来的暴风雪连续两天横扫中...
 

 
 
  /description
 
 
 
 linkhttp://news.sohu.com/20070125
 
 http://news.sohu.com/20070125/n247833568.shtml /n247833568.shtml/
 
 link
 
 
  category搜狐焦点图新闻/category
 
 
 
 author[EMAIL PROTECTED]
  /author
 
 
  pubDateThu, 25 Jan 2007
 11:29:11 +0800/pubDate
 
 
  comments
 
 http://comment.news.sohu.com
 
 http://comment.news.sohu.com/comment/topic.jsp?id=247833847
 
 /comment/topic.jsp?id=247833847/comments
 
 
  /item
 
 

 





--
www.babatu.com

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann

Hi Doug,

 So, does this render the patch that I wrote obsolete?

Cheers,
  Chris



On 1/25/07 10:08 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Scott Ganyo (JIRA) wrote:
  ... since Hadoop hijacks and reassigns all log formatters (also a bad
 practice!) in the org.apache.hadoop.util.LogFormatter static constructor ...
 
 FYI, Hadoop no longer does this.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2007-01-25 Thread Chris Mattmann

 It's at least out-of-date and perhaps obsolete.  A quick read of
 Fetcher.java looks like there might be a case where a fatal error is
 logged but the fetcher doesn't exit, in FetcherThread#output().
 

So this raises an interesting question:

People (such as Scott G.) out there -- are you folks still experiencing
similar problems? Do the recent Hadoop changes alleviate the bad behavior
you were experiencing? If so, then maybe this issue should be closed...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Reviving Nutch 0.7

2007-01-22 Thread Chris Mattmann

 
 Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
 please consider the following:
 
 One option would be re factoring the code in a way that the parts that are
 usable to other projects like protocols?, parsers (this actually was
 proposed by
 Jukka Zitting some time last year) and stuff would be modified to be
 independent
 of nutch (and hadoop) code. Yeah, this is easy to say, but would require
 significant amount of work.
 
 The more focused,smaller chunks of nutch would probably also get bigger
 audience (perhaps also outside nutch land) and that way perhaps more people
 willing to work for them.
 
 Don't know about others but at least I would be more willing to work towards
 this goal than the one where there would be practically many separate
 projects,
 each sharing common functionality but different code base.

+1 ;)

This was actually the project proposed by Jerome Charron and myself, called
Tika. We went so far as to create a project proposal, and send it out to
the nutch-dev list, as well as the Lucene PMC for potential Lucene
sub-project goodness. I could probably dig up the proposal should the need
arise.

Good ol' Jukka then took that effort and created us a project within Google
code, that still lives in there in fact:

http://code.google.com/p/tika/

There hasn't be active development on it because:

1. None of us (I'm speaking for Jerome, and myself here) ended up having the
time to shepherd it going forward

2. There was little, if any response, from the proposal to the nutch-dev
list, and folks willing to contribute (besides people like Jukka)

3. I think, as you correctly note above, most people thought it to be too
much of a Herculean effort that wouldn't pay the necessary dividends in the
end to undertake it


In any case, I think that, if we are going to maintain separate branches of
the source, in fact, really parallel projects, then an undertaking such as
Tika is properly needed ...

Cheers,
   Chris




 
 --
  Sami Siren

Re: How to Become a Nutch Developer

2007-01-21 Thread Chris Mattmann

Hi Dennis,


On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote:

 All,
 
 I am working on a How to Become a Nutch Developer document for the
 wiki and I need some input.
 
 I need an overview of how the process for JIRA works?  If I am a
 developer new to Nutch and just starting to look at the JIRA and I want
 to start working on some piece of functionality or to help with bug
 fixes where would I look.

JIRA provides a lot of search facilities: it's actually kind of nice. The
starting point for browsing bugs and other types of issues is:

http://issues.apache.org/jira/browse/NUTCH

(in general, for all Apache projects that use JIRA, you'll find that their
issue tracking system boils down to:

http://issues.apache.org/jira/browse/APACHE_PROJ_JIRA_ID
)

From there, you can access canned filters for open issues like:
Blocker
Critical
Major
Minor
Trivial

For more detailed search capabilities, click on the Find Issues button at
the top breadcrumb bar. Search capabilities there include the ability to
look for issues by developer, status, issue type, and to combine such fields
using AND, and OR. Additionally, you can issue a free text query across all
issues by using the free text box there.

 
 Would I just choose something that is unscheduled and begin working on it?

That's a good starting point: additionally, high priority issues marked as
Blockers, Critical and Major are always good because the sooner we
(the committers) get a patch for those, the sooner we'll be testing it for
inclusion into the sources.

 
 What if I see something that I want to work on but it is scheduled to
 somebody else?

Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you
don't have to do that. ;) Just speak up on the mailing list, and volunteer
your support. One of the people listed in the group nutch-developers in
JIRA (e.g., the committers) can reassign the issue to you so long as the
other gent it was assigned to doesn't mind...

 
 Are items only scheduled to committers or can they be scheduled to
 developers as well?  If they can be scheduled to regular developers how
 does someone get their name on the list to be scheduled items?

Items can be scheduled to folks listed in the nutch-developers group within
JIRA. Most of these folks are the committers, however, not all of them are.
I'm not entirely sure how folks get into that group (maybe Doug?), however,
that's the real criteria for having a JIRA issue officially assigned to you.
However, that doesn't mean that you can't work on things in lieu of that. If
there's an issue that you'd like to contribute to, please, prepare a patch,
attach it to JIRA, and then speak up on the mailing list. Chances are, with
the recent busy schedules of the committers (including myself) besides Sami,
and Andrzej, the committers don't have time to prepare patches for the issue
assigned to them. If you contribute a great patch, the committer will pick
it up, test it, apply it, and you'll get the same effect as if the issue
were directly assigned to you.
 
 Should I submit a JIRA and/or notify the list before I start working on
 something?  What is the common process for this?

Yup, that's pretty much it. Voice your desire to work on a particular task
on the nutch-dev list. Many of the developers on that list have been around
for a while now, and they know what's been discussed, and implemented
before.
 
 When I submit a JIRA is there anything else I need to do either in the
 JIRA system or with the mailing lists, committers, etc?

Nope: the nutch-dev list is automatically notified by all JIRA issue
submissions, and the committers (and rest of the folks) will pick up on this
and act accordingly.

 
 Getting this information together in one place will go a long way toward
 helping others to start contributing more and more.  Thanks for all your
 input.

No probs, glad to be of service :-)

Cheers,
  Chris

 
 Dennis Kubes

Re: Next Nutch release

2007-01-16 Thread Chris Mattmann

Folks,

 When would you like to make the release? I've been working on NUTCH-185,
but got a bit bogged down with other work. If there is interest in having
NUTCH-185 included in the release, I could make a push to get out a patch by
week's end...

 As for the rest, my +1 for NUTCH-61 being included sooner rather than
later. It seems that the patch has garnered enough use and attention that
folks would like to see it in the release. I think the email from the user
trying to manage a terabyte of data a few days back was particularly
telling.

Cheers,
  Chris



On 1/16/07 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Sami Siren wrote:
 Hello,
 
 It has been a while from a previous release (0.8.1) and looking at the
 great fixes done in trunk I'd start thinking about baking a new release
 soon.
 
 Looking at the jira roadmaps there are 1 blocking issues (fixing the
 license headers) for 0.8.2 and two other blocking issues for 0.9.0 of
 which I think NUTCH-233 is safe to put in.
   
 
 Agreed. The replacement regex mentioned in the original comment seems
 safe enough, and simpler.
 
 The top 10 voted issues are currently:
 
 NUTCH-61Adaptive re-fetch interval. Detecting umodified content
   
 
 Well ... I'm of a split mind on this. I can bring this patch up to date
 and apply it before 0.9.0, if we understand that this is a 0 release
 ... ;) Otherwise I'd prefer to wait with it right after the release.
 
 I would like also to proceed with NUTCH-339 (Fetcher2 patches + plus
 some changes I made in the meantime), since I'd like to expose the new
 fetcher to a broader audience, and it doesn't affect the existing
 implementation.
 
 
 NUTCH-48  Did you mean query enhancement/refignment feature
 NUTCH-251  Administration GUI
 NUTCH-289  CrawlDatum should store IP address
   
 
 I'm still not entirely convinced about this - and there is already a
 mechanism in place to support it if someone really wishes to keep this
 particular info (CrawlDatum.metaData).
 
 NUTCH-36  Chinese in Nutch
 NUTCH-185  XMLParser is configurable xml parser plugin.   NUTCH-59  meta
 data support in webdb
 NUTCH-92  DistributedSearch incorrectly scores results   NUTCH-68  
 
 This is too intrusive to fix just before the release - and needs
 additional discussion.
 
 
 NUTCH-68 A
 tool to generate arbitrary fetchlists  
 
 Easy to port this to 0.9.0 - I can do this.
 
 
 NUTCH-87  Efficient
 site-specific crawling for a large number of sites

Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann

Hi Sami,

On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Author: siren
 Date: Sat Dec  9 14:27:07 2006
 New Revision: 485076
 
 URL: http://svn.apache.org/viewvc?view=revrev=485076
 Log:
 Optimize SpellCheckedMetadata further by taking into account the fact that it
 is used only for http-headers.
 
 I am starting to believe that spellchecking should just be an utility method
 used by http protocol plugins.

I think that right now I'm -1 for this change. I would make note of all the
comments on NUTCH-139, from which this code was born. In the end, I think
what we all realized was that the spell checking capabilities is necessary,
but not everywhere, as you point out. However, I don't think it's limited
entirely to HTTP headers (what you've currently changed the code to). I
think it should be implemented as a protocol layer service, also providing
spell checking support to other protocol plugins, like protocol-file, etc.,
where field headers run the risk of being misspelled as well. What's to stop
someone from implementing protocol-file++ that returns different file header
keys than that of protocol-file? Just b/c HTTP is the most pervasively used
plugin right now, I think it's convenient to assume that only HTTP protocol
field keys may need spell checking services.

Just my 2 cents...

Cheers,
  Chris

Re: svn commit: r485076 - in /lucene/nutch/trunk/src: java/org/apache/nutch/metadata/SpellCheckedMetadata.java test/org/apache/nutch/metadata/TestSpellCheckedMetadata.java

2006-12-09 Thread Chris Mattmann

Hi Sami,

 Indeed, I see your point. I guess what I was advocating for was more of a
ProtocolHeaders interface, that lives in org.apache.nutch.metadata. Then, we
could update the code that you have below to use ProtocolHeaders.class
rather than HttpHeaders.class. We would then make ProtocolHeaders extend
HttpHeaders, so that it by default inherits all of the HttpHeaders, while
still allowing more ProtocolHeader met keys (e.g., we could have an
interface for FileHeaders, etc.).

 What do you think about that? Alternatively we could just create a
ProtocolHeaders interface in org.apache.nutch.metadata that aggreates all
the met key fields from HttpHeaders, and it would be the place that the met
key fields for FileHeaders, etc. could go into.

Let me know what you think, and thanks!

Cheers,
  Chris



On 12/9/06 3:53 PM, Sami Siren [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Sami,
 
 On 12/9/06 2:27 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 
 Author: siren
 Date: Sat Dec  9 14:27:07 2006
 New Revision: 485076
 
 URL: http://svn.apache.org/viewvc?view=revrev=485076
 Log:
 Optimize SpellCheckedMetadata further by taking into account the fact that
 it
 is used only for http-headers.
 
 I am starting to believe that spellchecking should just be an utility method
 used by http protocol plugins.
 
 I think that right now I'm -1 for this change. I would make note of all the
 comments on NUTCH-139, from which this code was born. In the end, I think
 what we all realized was that the spell checking capabilities is necessary,
 but not everywhere, as you point out. However, I don't think it's limited
 entirely to HTTP headers (what you've currently changed the code to). I
 think it should be implemented as a protocol layer service, also providing
 spell checking support to other protocol plugins, like protocol-file, etc.,
 
 In protocol file all headers are artificial an generated in nutch code
 so if there's spelling mistake there then we should fix the code
 generating the headers and not rely on spellchecking in the first place.
 
 where field headers run the risk of being misspelled as well. What's to stop
 someone from implementing protocol-file++ that returns different file header
 keys than that of protocol-file? Just b/c HTTP is the most pervasively used
 plugin right now, I think it's convenient to assume that only HTTP protocol
 field keys may need spell checking services.
 
 If there's a real need for spell checking on other keys one can just add
 more classes to the array no big deal.
 
 --
  Sami Siren

Re: [jira] Updated: (NUTCH-379) ParseUtil does not pass through the content's URL to the ParserFactory

2006-10-14 Thread Chris Mattmann

Hi Guys,

 Can we disable the selection of released versions within JIRA for issues
so that people like me don't continue to get confused?

Thanks!

Cheers,
  Chris



On 10/13/06 9:32 AM, Sami Siren (JIRA) [EMAIL PROTECTED] wrote:

  [ http://issues.apache.org/jira/browse/NUTCH-379?page=all ]
 
 Sami Siren updated NUTCH-379:
 -
 
 Fix Version/s: (was: 0.8.1)
(was: 0.8)
 
 cannot fix released versions
 
 ParseUtil does not pass through the content's URL to the ParserFactory
 --
 
 Key: NUTCH-379
 URL: http://issues.apache.org/jira/browse/NUTCH-379
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.8, 0.9.0
 Environment: Power Mac Dual G5, 2.0 Ghz, although fix is independent
 of environment
Reporter: Chris A. Mattmann
 Assigned To: Chris A. Mattmann
 Fix For: 0.8.2, 0.9.0
 
 Attachments: NUTCH-379.Mattmann.100406.patch.txt
 
 
 Currently the ParseUtil class that is called by the Fetcher to actually
 perform the parsing of content does not forward thorugh the content's url for
 use in the ParserFactory. A bigger issue, however, is that the url (and for
 that matter, the pathSuffix) is no longer used to determine which parsing
 plugin should be called. My colleague at JPL discovered that more major bug
 and will soon input a JIRA issue for it. However, in the meantime, this small
 patch at least sets up the forwarding of the content's URL to the
 ParserFactory.

Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

Hi Folks,

 I noticed that Nutch now requires JDK 5 in order to compile, due to recent
changes to the PluginRepository and some other classes. I think that this is
a good move, however, I wasn't sure that I had seen any official
announcement that Nutch now requires 1.5...

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

 
 The switch to 1.5 format was also logged on jira issue
 http://issues.apache.org/jira/browse/NUTCH-360
 --
  Sami Siren

Ahh, I didn't see this. Way to go Sami, I love it when people actually keep
records of changes! ;)

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch requires JDK 1.5 now?

2006-10-03 Thread Chris Mattmann

Hey Guys,

 Speaking of which, I noticed that Sami's issue below is a Task in JIRA,
which reminded me of a task that I input a long time ago that would be nice
to fix real quick (for those with JIRA permissions to do so):

http://issues.apache.org/jira/browse/NUTCH-304

We should really change the email address for JIRA to not use the Apache
incubator one anymore, and to use to Lucene one.

Sound good? If so, could someone with permissions please take care of it?
:-)

Cheers,
  Chris



On 10/3/06 9:04 AM, Sami Siren [EMAIL PROTECTED] wrote:

 Andrzej Bialecki wrote:
 Chris Mattmann wrote:
 Hi Folks,
 
  I noticed that Nutch now requires JDK 5 in order to compile, due to
 recent
 changes to the PluginRepository and some other classes. I think that
 this is
 a good move, however, I wasn't sure that I had seen any official
 announcement that Nutch now requires 1.5...
   
 
 This is a proactive change - as soon as we upgrade to Hadoop 0.6.x we
 will lose 1.4 compatibility anyway, so we may as well prepare in advance.
 
 Also, Now refers to the unreleased 0.9, we will keep branch 0.8.x
 compatible with 1.4.
 
 
 The switch to 1.5 format was also logged on jira issue
 http://issues.apache.org/jira/browse/NUTCH-360
 --
  Sami Siren

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Patch Available status?

2006-08-31 Thread Chris Mattmann

Hi Doug,

 
 But the nutch-developers Jira group pretty closely corresponds to
 Nutch's committers, so perhaps all committers should be permitted to
 close, although this should be exercised with caution, only at releases,
 since closes cannot be undone in this workflow.
 
 Another alternative would be to construct a new workflow that just adds
 the Patch Available status and still permits issues to be re-opened.
 
 Which sounds best for Nutch?

Good question. Well, my personal preference would be for one that allows
issue closes to be undone, as I've seen several cases (even some recent ones
such as NUTCH-258) where someone in the nutch-developers group has closed an
issue (including myself) that users in fact don't believe is resolved.

So my +1 for having the 2nd option above: an alternative workflow to that of
the Hadoop one that simply adds the Patch Available status and still
permits issues to be re-opened.

Just my 2 cents.

Thanks!

Cheers,
  Chris
 
 
 Doug

Re: Patch Available status?

2006-08-30 Thread Chris Mattmann

Hi Doug and Andrzej,

  +1. I think that workflow makes a lot of sense. Currently users in the
nutch-developers group can close and resolve issues. In the Hadoop workflow,
would this continue to be the case?

Cheers,
  Chris



On 8/30/06 3:14 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Doug Cutting wrote:
 Sami Siren wrote:
 I am not able to do it either, or then I just don't know how, can
 Doug help us here?
 
 This requires a change the the project's workflow.  I'd be happy to
 move Nutch to use the workflow we use for Hadoop, which supports
 Patch Available.
 
 This workflow has one other non-default feature, which is that bugs,
 once closed, cannot be re-opened.  This works as follows: Only project
 administrators are allowed to close issues.  Bugs are resolved as
 they're fixed, and only closed when a release is made.  This keeps the
 release notes Jira generates from changing after a release is made.
 
 Would you like me to switch Nutch to use this Jira workflow?
 
 +1, this would finally make sense with the resolved vs. closed ...

Re: 0.8 not loading plugins

2006-08-17 Thread Chris Mattmann

Hi Chris,

 It seems from your email message that your plugin is located in
$NUTCH_HOME/build/custom-meta? Is this where your plugin * code * is
currently stored? If so, this is the wrong location and the most likely
reason that your plugin isn't being loaded.

 Plugin code should live in $NUTCH_HOME/src/plugins, so in your case, you'd
have /usr/local/nutch-0.8/src/plugin/custom-meta, with the underlying plugin
code dir structure underneath there. Then, to deploy your plugin to the
build directory (which is $NUTCH_HOME/build/plugins), you would type: ant
deploy.

Give this a shot and see if that fixes it.

Cheers,
  Chris



On 8/17/06 3:05 PM, Chris Stephens [EMAIL PROTECTED] wrote:

 Its definitely not trying to load my plugin, I added that debug setting
 and didn't see anything regarding my plugin.  One thing I noticed is
 that my plugin is not in the plugins directory.  At what point do the
 plugs get copied there?  Here is the output from my compile:
 
 compile:
  [echo] Compiling plugin: custom-meta
 [javac] Compiling 3 source files to
 /usr/local/nutch-0.8/build/custom-meta/classes
 [javac] Note: Some input files use or override a deprecated API.
 [javac] Note: Recompile with -Xlint:deprecation for details.
 
 jar:
   [jar] Building jar:
 /usr/local/nutch-0.8/build/custom-meta/custom-meta.jar
 
 deps-test:
 
 deploy:
  [copy] Copying 1 file to /usr/local/nutch-0.8/build/plugins/custom-meta
 
 HUYLEBROECK Jeremy RD-ILAB-SSF wrote:
 Did you check if your plugin.xml is read by putting the plugin package
 in debug mode?
 (put this in the log4j.properties)
 log4j.logger.org.apache.nutch.plugin=DEBUG
 
 
 -Original Message-
 From: Chris Stephens [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 17, 2006 2:30 PM
 To: nutch-dev@lucene.apache.org
 Subject: Re: 0.8 not loading plugins
 
 I have this line in src/plugin/build.xml under the deploy section:
 
 ant dir=custom-meta target=deploy /
 
 The plugin is compiling ok.  I spent several days getting errors on
 compile and investing how to port them to 0.8.
 
 Jonathan Addison wrote:
   
 Hi Chris,
 
 Chris Stephens wrote:
 
 I think I finally have my plugin ported to 0.8, however I cannot get
 my plugin to load.
 
 My plugin.includes file in conf/nutch-site.xml has the following for
 its plugin.includes value:
 
 valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-basic
 |query-(basic|site|url)|summary-basic|scoring-opic|custom-meta/value
   
 My plugin is the 'custom-meta' entry at the end.
 
 My plugin never shows up in the Registered Plugins list in the
 hadoop.log, and lines in my plugin that run logger.info never show up
   
 
   
 as well.  Is there a step I am missing with 0.8, what should I do
 next to debug the problem?
   
 Have you also added your plugin to plugin/build.xml?
 
 Thank you,
 
 Chris Stephens

Re: Tika update

2006-08-16 Thread Chris Mattmann

Hi Jukka,

 Thanks for your email. Indeed, there was discussion on the Lucene PMC email
list, about the Tika project. It was decided by the powers that be to
discuss it more on the Nutch mailing list before moving forward with any
vote on making Tika a sub-project of Apache Lucene. With regards to that, my
action was to send the Tika proposal to the nutch-dev list, and help to
start up a discussion on Tika, to get feedback from the community. Seeing as
though you lighted the fire under this (thanks!), it's only appropriate for
me to send out the Tika project proposal sent to the Lucene PMC. So, here it
is, attached. I'd love to here feedback from the Nutch community on what it
thinks of such a project.

Cheers,
   Chris



On 8/16/06 4:06 AM, Jukka Zitting [EMAIL PROTECTED] wrote:

 Hi,

There was recently discussion on perhaps starting a new
 Lucene
sub-project, named Tika, to create a general-purpose library from
 the
parser components and other features in Nutch that might interest a
wider
 audience. To keep things rolling we've created a temporary
staging area for
 the project at http://code.google.com/p/tika/ on
Google Code, and I've started
 to flesh out a potential project
structure using Maven 2.

Note that the
 project materials in svn refer to the project as Apache
Tika even though the
 project has *not* been officially accepted. The
reason for this is   that the
 Google Code project is just a temporary
staging ground and I wanted to give a
 better idea of what the project
could look like if accepted. The jury is still
 out on whether to start
a project like this, so any comments and feedback on
 the idea are very
much welcome.

Most, if not all, code in Tika will be based
 on existing code from
Nutch and other Apache projects, so I'm not sure if the
 project needs
to go through the Incubator if accepted by the Lucene PMC.

So
 far the tika source tree contains just a modified version of my
TextExtractor
 code from the Apache Jackrabbit project, and Jérôme is
planning to add some of
 his stuff. The source tree at Google Code
should be considered just a
 playground for bringing things together
and discussing ideas, before migrating
 back to ASF infrastructure.

BR,

Jukka Zitting

--
Yukatan -
 http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting,
 and Java development

Re: Any plans to move to build Nutchusing Maven?

2006-08-16 Thread Chris Mattmann

Hi Steven,


On 8/16/06 7:36 AM, steven shingler [EMAIL PROTECTED] wrote:

 (This thread moved from the User List.)
 
 OK Lukas, lets open it up to the dev list! :)
 
 Particularly, does the group feel moving to Maven would be _a good thing_ ?

+1

I suggested this (however did not make any progress on realizing it ;) ) a
while back. I think it makes a * lot of sense *. Maven's dependency system
would significantly reduce the size of the CM'ed Nutch source code, as all
the jars required by Nutch could be referenced externally (plugins are a
different beast, but we're working on that). Additionally, Maven would allow
automatic generation of a soft of nightly build Nutch site, showing recent
commits, unit test results and more.

 
 Even if so, what are the problems?

The main problem I see is the plugin system, and how to appropriate
represent plugin dependencies in Maven (or just neglect to elegantly handle
them, and treat them like invididual projects, like nutch, which requires
CM'ing jar files). Additionally, I think it will probably require writing
some custom Jelly scripts to do all the neat ant build stuff that Nutch does
on the side (e.g., unpack Hadoop, etc.).

 
 There are currently two versions of Lucene in the Maven repos, but Hadoop
 would have to be added manually, I think.

It would probably make most sense to run a Maven repo explicitly for Nutch
off of the Lucene Nutch site. Something like
(http://lucene.apache.org/nutch/maven/) might be sensible.

Just my 2 cents.

Cheers,
  Chris

 
 All thoughts gratefully received.
 Cheers
 Steven
 
 On 8/16/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I would like to help. But first of all I would suggest to start wider
 discussion in dev list to get more feedback/suggestions. I think one
 problem can be that Nutch depends on both Lucene and Hadoop libraries
 and it won't be easy to maintain these dependencies if recent versions
 are not yet committed into some maven accesible repo.
 
 Regards,
 Lukas
 
 On 8/16/06, steven shingler [EMAIL PROTECTED] wrote:
 Well I'm up for giving it a try. My current work has me looking at both
 Nutch and Maven, so what better way to understand both projects :)
 
 I agree it is far from trivial - so if anyone here would like to
 collaborate
 on it, that would be great.
 Cheers,
 Steven
 
 
 On 8/15/06, Lukas Vlcek [EMAIL PROTECTED] wrote:
 
 Hi,
 
 I would warmly appreciate this activity. At least it would help more
 people to understand/join this great project. But I don't think this
 will be an easy step (this reminds me what N.Armstrong said on moon:
 That's one small step for [a] man, one giant leap for mankind.)
 :-)
 
 Regards,
 Lukas
 
 On 8/15/06, Sami Siren [EMAIL PROTECTED] wrote:
 steven shingler wrote:
 Hi all,
 
 I know this has come up at least once before, but I just thought
 I'd
 raise
 the question again:
 
 Are there any plans to move to building Nutch using Maven?
 
 Haven't heard of such activities, however if you or somebody else
 can put such thing together and it proves to be a good thing to do
 then
 I certainly don't have anything against it.
 
 --
   Sami Siren

Patch Available status?

2006-08-15 Thread Chris Mattmann

Hi Guys,

 I've seen on the Hadoop mailing list recently that there was a new status
added for issues in JIRA called Patch Available to let committers know
that a patch is ready for review to commit. How about we add this to the
Nutch jira instance as well? I tried doing this, but I don't think I have
the permissions to do so.

 I've got 2 patches for issues that are attached in jira that I'd like to
set as having this new status :-)

https://issues.apache.org/jira/browse/NUTCH-338
https://issues.apache.org/jira/browse/NUTCH-258

Cheers,
  Chris

Re: parse-plugins.xml

2006-08-03 Thread Chris Mattmann

Hi Marko,

   Thanks for your question. Basically it was set up as a sort of last
result of getting at least * some * information from the PDF file, albeit
littered with garbage. If indeed the parse-text does not really make sense
in terms of a backup parser to handle PDF files and get at least some text
to index, then we may think of either (a) removing it from the default
parse-plugins.xml, or (b) writing a simple PdfParser that can handle
truncation as a backup to the existing PdfParser. Basically the philosophy
behind each mimeType entry in parse-plugins.xml is to try and map the set of
existing Nutch parse-plugins to the available content types, giving each
mimeType as many options as possible in terms of getting some content out of
them. 

Cheers,
  Chris



On 8/3/06 4:04 AM, Marko Bauhardt [EMAIL PROTECTED] wrote:

 Hi all,
 i have a question about the parse-plugins.xml and application/pdf.
 Why is the TextParse used for parsing pdf files? The mimiType
 appliation/pdf is mapped to parse-pdf and parse-text. But the
 TextParser does not support pdf files.
 The problem is, if the pdf file is truncated the textparser parse
 this content and the indexer indexing waste. So what is the reason
 to map application/pdf to the parse-text plugin?
 
 
 
 mimeType name=application/pdf
 plugin id=parse-pdf /
 plugin id=parse-text /
 /mimeType
 
 
 Thanks for hints,
 Marko

Re: parse-plugins.xml

2006-08-03 Thread Chris Mattmann

Hey Andrzej,


On 8/3/06 8:19 AM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 Hi Marko,
 
Thanks for your question. Basically it was set up as a sort of last
 result of getting at least * some * information from the PDF file, albeit
 littered with garbage. If indeed the parse-text does not really make sense
   
 
 IMO it doesn't make sense. PDF text content, even if it's available in
 plain text, is usually compressed. The percentage of non-compressed PDFs
 out there in my experience is negligible.
 
 in terms of a backup parser to handle PDF files and get at least some text
 to index, then we may think of either (a) removing it from the default
   
 
 +1

Okey dok, you'll find a quick patch this at:

http://issues.apache.org/jira/browse/NUTCH-338

I decided to create an issue to just keep track of the fact that we made
this change, and additionally because I tried pasting the quick patch into
my email program here on my Mac and it looked like it was coming out weird
:-)

 
 parse-plugins.xml, or (b) writing a simple PdfParser that can handle
 truncation as a backup to the existing PdfParser. Basically the philosophy
   
 
 I think that simple PDF parser is an oxymoron ... ;)

Heh, I agree with you on that one. If everyone would just move to XML
DocBook, then it would be great! ;)


Thanks!

Cheers,
  Chris

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann

Folks,

 Before I (or someone else) reopens the issue, I think it's important to
understand the implications:

1) Having a *side-effect* of the entire system stop processing after merely
 logging a message at a certain event level is a poor practice.

I'm not sure that the Fetcher quitting is a * side-effect * as you call it.
In fact, I think it's clearly stated as the behavior of the system, both
within the code, and in several mailing list conversations I've seen over
the course of the past two years (I can dig these up, if needed).

 In fact, I believe that this would make a fantastic anti-pattern.  If this
 kind of behavior is *really* wanted (and I argue that it should not be below),
 it should be done through an explicit mechanism, not as a side-effect.

Again, the use of side-effect here is strange to me: how is an explicit
check for any LOG messages to the SEVERE level before quitting a
side-effect? 

 For example, did you realize that since Hadoop hijacks and reassigns all log
 formatters (also a bad practice!) in the org.apache.hadoop.util.LogFormatter
 static constructor that anyone using Nutch as a library and logs a SEVERE\
 error will suffer by having Nutch stop fetching?

I'm not convinced that having Nutch stop fetching when a SEVERE error is
logged is the wrong behavior. Let's think about what possible SEVERE errors
may typically be logged: Out of Memory error, potentially,
InterruptedExceptions in Threads (possibly), failure in any of the plugin
libraries critical to the fetch running (possibly), the list goes on and on.
So, in this case, you argue that the Fetcher should continue operating?

 2) Moreover, having the system stop processing forever more by use of a
 static(!) flag makes the use of the Nutch system as a library within a server
 or service environment impossible.  Once this logging is done, no more Fetcher
 processing in this run *or any other* can take place.

I've been using Nutch in a server environment (JSPs and Tomcat) within a
large-scale data system at NASA for the course of the past year, and have
never been impeded by the behavior of the fetcher. Can you be more specific
here as to the exact use-case that's failing in your scenario? I've also
been watching the mailing lists for the better course of almost 2 years, and
have seen little traffic (outside of the aforementioned clarifications/etc.
above) about this issue. I may be out on an island here, but again, I'm not
convinced that this is a core issue.

Just my 2 cents. If the votes continue that this is an issue, however, I'll
have no problem opening it up (or one of the committers can do it as well).

Cheers,
  Chris





On 6/5/06 7:11 AM, Stefan Groschupf (JIRA) [EMAIL PROTECTED] wrote:

 [ 
 http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ]
 
 Stefan Groschupf commented on NUTCH-258:
 
 
 Scott, 
 I agree with you. However we need a clean patch to solve the problem, we can
 not just comment things out of the code.
 So I vote for the issue and I vote to reopen this issue.
 
 Once Nutch logs a SEVERE log item, Nutch fails forevermore
 --
 
  Key: NUTCH-258
  URL: http://issues.apache.org/jira/browse/NUTCH-258
  Project: Nutch
 Type: Bug
 
   Components: fetcher
 Versions: 0.8-dev
  Environment: All
 Reporter: Scott Ganyo
 Priority: Critical
  Attachments: dumbfix.patch
 
 Once a SEVERE log item is written, Nutch shuts down any fetching forevermore.
 This is from the run() method in Fetcher.java:
 public void run() {
   synchronized (Fetcher.this) {activeThreads++;} // count threads
   
   try {
 UTF8 key = new UTF8();
 CrawlDatum datum = new CrawlDatum();
 
 while (true) {
   if (LogFormatter.hasLoggedSevere()) // something bad happened
 break;// exit
   
 Notice the last 2 lines.  This will prevent Nutch from ever Fetching again
 once this is hit as LogFormatter is storing this data as a static.
 (Also note that LogFormatter.hasLoggedSevere() is also checked in
 org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
 This must be fixed or Nutch cannot be run as any kind of long-running
 service.  Furthermore, I believe it is a poor decision to rely on a logging
 event to determine the state of the application - this could have any number
 of side-effects that would be extremely difficult to track down.  (As it has
 already for me.)

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810

Re: [jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Chris Mattmann

Hi Andrzej,

 
 The main problem, as Scott observed, is that the static flag affects all
 instances of the task executing inside the same JVM. If there are
 several Fetcher tasks (or any other tasks that check for SEVERE flag!),
 belonging to different jobs, all of them will quit. This is certainly
 not the intended behavior.
 

Got it.

   
 In fact, I believe that this would make a fantastic anti-pattern.  If this
 kind of behavior is *really* wanted (and I argue that it should not be
 below),
 it should be done through an explicit mechanism, not as a side-effect.
 
 
   
 
 I have a proposal for a simple solution: set a flag in the current
 Configuration instance, and check for this flag. The Configuration
 instance provides a task-specific context persisting throughout the
 lifetime of a task - but limited only to that task. Voila - problem
 solved. We get rid of the dubious use of LogFormatter (I hope Chris that
 even you would agree that this pattern is slightly .. unusual ;) )

What, unusual? Huh? :-)

 and 
 we gain flexible mechanism limited in scope to the current task, which
 ensures isolation from other tasks in the same JVM. How about that?

+1

I like your proposed solution. I haven't used multiple fetchers really
inside the same process too, much however, I do have an application that
calls fetches in more of a sequential way in the same JVM. So, I guess I
just never ran across the behavior. The thing I like about the proposed
solution is its separation and isolation of a task context, which I think
that Nutch (now relying on Hadoop as the underlying architectural computing
platform) needed to address.

So, to summarize, the proposed resolution is:

* add flag field in Configuration instance to signify whether or not a
SEVERE error has been logged within a task's context

* check this field within the fetcher to determine whether or not to stop
the fetcher, just for that fetching task identified by its Configuration
(and no others)

Is this representative of what you're proposing Andrzej? If so, I'd like to
take the lead on contributing a small patch that handles this, and then it
would be great if people like Scott could test this out in their existing
environments where this error was manifesting itself.

Thanks!

Cheers,
  Chris

(BTW: would you like me to re-open the JIRA issue, or do you want to do it?)

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Nutch Parser Bug

2006-04-25 Thread Chris Mattmann

Hi Alex,

 I also noticed this issue a while back. It's described here:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200510.mbox/%3c435
[EMAIL PROTECTED]

Cheers,
  Chris



On 4/25/06 2:41 PM, Alex [EMAIL PROTECTED] wrote:

 Hi there,
 
 I'm fairly new to nutch and in working on the nutch search I realize that when
 I try to search for terms such as #1 top item sales, the search seem to
 ignored everything after the # sign. I also tried with other symbols such as
 @, !, $, % , ^ , etc... those seem to be ignored. This seem to be a problem in
 the Query.parse method, Can this be add to the list of bug fix for the next
 build?  or is it something that's already been done?  Please adv. Thank you.
 
 Alex
 
 -
 Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+
 countries) for 2¢/min or less.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

0.8 release?

2006-04-12 Thread Chris Mattmann

Hi Guys,

  Any progress on the 0.8 release? Was there any resolution about which JIRA
issues to complete before the 0.8 release? We had a bit of conversation
there and some ideas, but no definitive answer...

Thanks for your help, and sorry to pester ;)

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann

+1


On 4/7/06 10:20 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 +1 for a release sooner rather than later.
 
 I think this is a good plan.  There's no reason we can't do another
 release in a month.  If it is back-compatbible we can call it 0.8.x and
 if it's incompatible we can call it 0.9.0.
 
 I'm going to make a Hadoop 0.1.1 release today that can be included in
 Nutch 0.8.0.  (With Hadoop we're going to aim for monthly releases, with
 potential bugfix releases between when serious bugs are found.  The big
 bug in Hadoop 0.1.0 is http://issues.apache.org/jira/browse/HADOOP-117.)
 
 So we could aim for a Nutch 0.8.0 release sometime next week.  Does that
 work for folks?
 
 Piotr, would you like to make this release, or should I?
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-07 Thread Chris Mattmann

Hi Andrzej,


On 4/7/06 12:18 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Do you guys have any additional insights / suggestions whether NUTCH-240
 and/or NUTCH-61 should be included in this release?

Looking at the JIRA popular issues pane for Nutch (
http://issues.apache.org/jira/browse/NUTCH?report=com.atlassian.jira.plugin.
system.project:popularissues-panel), I note that NUTCH-61 is the most
popular issue right now with 7 votes. Additionally, NUTCH-240 shares the 3rd
most votes (4) with NUTCH-134. So, all in all, there are 4 issues with = 4
votes in JIRA. Of those 4 issues, 3 of them all have attached patches in
JIRA. Would it be safe to say that the committers should focus on committing
NUTCH-61, NUTCh-240, and NUTCH-48, since these 3 issues all have attached
patch files, and then freeze it for the 0.8.0 release? As for my own
opinion, I recently downloaded and reviewed NUTCH-61, and really like the
patch. +1 on my end. I haven't tried out NUTCH-240 yet, but it seems to be a
logical extension point for Nutch to be able to plug in different scoring
components. So, +1 from me.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: 0.8 release schedule (was Re: latest build throws error - critical)

2006-04-06 Thread Chris Mattmann

+1 for a release sooner rather than later. Several interesting features
contributed since the 0.7 branch I believe are now tested and
production-worthy, at least in my environment. Hats off to the folks who
were able to split the MapReduce and NDFS into Hadoop -- I'm going to be
experimenting with that portion of the code over the next few weeks on a 16
node, 32 processor Opteron cluster at JPL that will be used as the
development machine for a large scale earth science data processing mission.
Because the Hadoop code is in its own project now, I can leverage and test
the Hadoop processing and HDFS capability without having to include all the
search engine specific stuff. Ya! :-)

Cheers,
  Chris



On 4/6/06 12:59 PM, Andrzej Bialecki [EMAIL PROTECTED] wrote:

 Doug Cutting wrote:
 TDLN wrote:
 I mean, how do others keep uptodate with the main codeline? Do you
 advice updating everyday?
 
 Should we make a 0.8.0 release soon?  What features are still missing
 that we'd like to get into this release?
 
 I think we should make a release soon - instabilities related to Hadoop
 split are mostly gone now, and we need to endorse the new architecture
 more officially...
 
 The adaptive fetch and scoring API functionality are the top
 priority for me. While the scoring API change is pretty innocuous, we
 just need to clean it up, the adaptive fetch changes have a big
 potential for wrecking the main re-fetch cycle ... ;)
 
 We could do it in two ways: I could apply this patch and let people run
 with it for a while, fixing bugs as they pop up - but then it will be
 another 3-4 weeks I suppose. Or we could wait with this after the release.

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Null Pointer exception in AnalyzerFactory?

2006-03-13 Thread Chris Mattmann

Hi Folks,

  I updated to the latest SVN revision (385691) today, and I am now seeing a
Null Pointer exception in the AnalyzerFactory.java class. It seems that in
some cases, the method:

  private Extension getExtension(String lang) { Extension extension =
(Extension) this.conf.getObject(lang);if (extension == null) {
extension = findExtension(lang);  if (extension != null) {
this.conf.setObject(lang, extension);  }}return extension;  }


Has a null lang parameter passed to it, which causes a NullPointer
exception at line: 81 in
src/java/org/apache/nutch/analyzer/AnalyzerFactory.java

I found that if I checked for null in the lang variable, and returned null
if lang == null, that my crawl finished. Here is a small patch that will fix
the crawl:

Index: 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java ===
--- 
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(revision 385691) +++
/Users/mattmann/src/nutch/src/java/org/apache/nutch/analysis/AnalyzerFactory
.java(working copy) @@ -78,14 +78,19 @@private Extension
getExtension(String lang) { -Extension extension = (Extension)
this.conf.getObject(lang); -if (extension == null) { -  extension =
findExtension(lang); -  if (extension != null) { -
this.conf.setObject(lang, extension); -  } -} -return extension;
+if(lang == null){ +return null; +} +else{ +
Extension extension = (Extension) this.conf.getObject(lang); +if
(extension == null) { +  extension = findExtension(lang); +
if (extension != null) { +this.conf.setObject(lang, extension);
+  } +} +return extension;+}   }
private Extension findExtension(String lang) {


NOTE: not sure if returning null is the right thing to do here, but hey, at
least it made my crawl finish! :-)

Cheers,
  Chris



__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246

___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Hi Stefan,

 after a short time I already had 1602 time this lines in my
 tasktracker log files.
 060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
 file:/home/joa/nutch/conf/parse-plugins.xml
 
 Sounds like this file is loaded 1602 (after lets say 3 minutes) I
 guess that wasn't the goal or do I oversee anything?

It certainly wasn't the goal at all. After NUTCH-88, Jerome and I had the
following line in the ParserFactory.java class:

  /** List of parser plugins. */
  private static final ParsePluginList PARSE_PLUGIN_LIST =
  new ParsePluginsReader().parse();


(see revision 326889)

Looking at the revision history for the ParserFactory file, after the
application of NUTCH-169, the above changes to:


  private ParsePluginList parsePluginList;

//... code here

public ParserFactory(NutchConf nutchConf) {
this.nutchConf = nutchConf;
this.extensionPoint = nutchConf.getPluginRepository().getExtensionPoint(
Parser.X_POINT_ID);
this.parsePluginList = new ParsePluginsReader().parse(nutchConf);

if (this.extensionPoint == null) {
  throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);
}
if (this.parsePluginList == null) {
  throw new RuntimeException(
  Parse Plugins preferences could not be loaded.);
}
  }


Thus, every time the ParserFactory is constructed, the parse-plugins.xml
file is read (it's the result of the call to
ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded 1602 times,
I'd guess that the ParserFactory is loaded 1602 times? Additionally, I'm
wondering why the parse-plugins.xml configuration parameters aren't declared
as final static anymore?

 That could be a serious performance improvement to just load this
 file once.

Yup, I think that's the reason we made it final static. If there is no
reason to not have it final static, I would suggest that it be put back to
final static. There may be a problem however, now since NUTCH-169, the
loading requires an existing Configuration object I believe. So, we may need
a static Configuration object as well. Thoughts? 

 I was not able to find the code that is logging this statement, has
 anyone a idea where this happens?

The statement gets logged within the ParsePluginsReader.java class, line 98:

ppInputStream = conf.getConfResourceAsInputStream(
  conf.get(PP_FILE_PROP));

HTH,
  Chris


 
 Thanks.
 Stefan
 -
 blog: http://www.find23.org
 company: http://www.media-style.com

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Hi Stefan,


 Hi Chris,
 thanks for the clarification.

No probs. 

 Do you think we can we somehow cache it in the nutchConf instance,
 since this is the way we doing this on other places as well?

Yeah I think we can. Here is a small patch to the ParserFactory that should
do the trick. Give it a test and let me know if it works. If it does, I
would say +1 to the committers to get this into the sources ASAP, no?

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,13 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject(parsePluginList) != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject(parsePluginList);
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);

+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);


Cheers,
  Chris

 Cheers,
 Stefan
 
 Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
 
  Hi Stefan,
 
  after a short time I already had 1602 time this lines in my
  tasktracker log files.
  060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
  file:/home/joa/nutch/conf/parse-plugins.xml
 
  Sounds like this file is loaded 1602 (after lets say 3 minutes) I
  guess that wasn't the goal or do I oversee anything?
 
  It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
  had the
  following line in the ParserFactory.java class:
 
/** List of parser plugins. */
private static final ParsePluginList PARSE_PLUGIN_LIST =
new ParsePluginsReader().parse();
 
 
  (see revision 326889)
 
  Looking at the revision history for the ParserFactory file, after the
  application of NUTCH-169, the above changes to:
 
 
private ParsePluginList parsePluginList;
 
  //... code here
 
  public ParserFactory(NutchConf nutchConf) {
  this.nutchConf = nutchConf;
  this.extensionPoint = nutchConf.getPluginRepository
  ().getExtensionPoint(
  Parser.X_POINT_ID);
  this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
 
  if (this.extensionPoint == null) {
throw new RuntimeException(x point  + Parser.X_POINT_ID + 
  not
  found.);
  }
  if (this.parsePluginList == null) {
throw new RuntimeException(
Parse Plugins preferences could not be loaded.);
  }
}
 
 
  Thus, every time the ParserFactory is constructed, the parse-
  plugins.xml
  file is read (it's the result of the call to
  ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded
  1602 times,
  I'd guess that the ParserFactory is loaded 1602 times?
  Additionally, I'm
  wondering why the parse-plugins.xml configuration parameters aren't
  declared
  as final static anymore?
 
  That could be a serious performance improvement to just load this
  file once.
 
  Yup, I think that's the reason we made it final static. If there is no
  reason to not have it final static, I would suggest that it be put
  back to
  final static. There may be a problem however, now since NUTCH-169, the
  loading requires an existing Configuration object I believe. So, we
  may need
  a static Configuration object as well. Thoughts?
 
  I was not able to find the code that is logging this statement, has
  anyone a idea where this happens?
 
  The statement gets logged within the ParsePluginsReader.java class,
  line 98:
 
  ppInputStream = conf.getConfResourceAsInputStream(
conf.get(PP_FILE_PROP));
 
  HTH,
Chris
 
 
 
  Thanks.
  Stefan
  -
  blog: http://www.find23.org
  company: http://www.media-style.com
 
 
 
 
 -
 blog: http://www.find23.org
 company: http://www.media-style.com

RE: found resource parse-plugins.xm?

2006-03-06 Thread Chris Mattmann

Sorry,

 My last patch was missing one line. Here's the update:

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===
--- src/java/org/apache/nutch/parse/ParserFactory.java  (revision 383463)
+++ src/java/org/apache/nutch/parse/ParserFactory.java  (working copy)
@@ -55,7 +55,14 @@
 this.conf = conf;
 this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
 Parser.X_POINT_ID);
-this.parsePluginList = new ParsePluginsReader().parse(conf);
+
+if(conf.getObject(parsePluginList) != null){
+   this.parsePluginList =
(ParsePluginList)conf.getObject(parsePluginList);
+}
+else{
+this.parsePluginList = new ParsePluginsReader().parse(conf);
+conf.setObject(parsePluginList, this.parsePluginList);
+}
 
 if (this.extensionPoint == null) {
   throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
found.);


 -Original Message-
 From: Chris Mattmann [mailto:[EMAIL PROTECTED]
 Sent: Monday, March 06, 2006 7:51 PM
 To: 'nutch-dev@lucene.apache.org'
 Subject: RE: found resource parse-plugins.xm?
 
 Hi Stefan,
 
 
  Hi Chris,
  thanks for the clarification.
 
 No probs.
 
  Do you think we can we somehow cache it in the nutchConf instance,
  since this is the way we doing this on other places as well?
 
 Yeah I think we can. Here is a small patch to the ParserFactory that
 should do the trick. Give it a test and let me know if it works. If it
 does, I would say +1 to the committers to get this into the sources ASAP,
 no?
 
 Index: src/java/org/apache/nutch/parse/ParserFactory.java
 ===
 --- src/java/org/apache/nutch/parse/ParserFactory.java(revision
 383463)
 +++ src/java/org/apache/nutch/parse/ParserFactory.java(working
copy)
 @@ -55,7 +55,13 @@
  this.conf = conf;
  this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
  Parser.X_POINT_ID);
 -this.parsePluginList = new ParsePluginsReader().parse(conf);
 +
 +if(conf.getObject(parsePluginList) != null){
 + this.parsePluginList =
 (ParsePluginList)conf.getObject(parsePluginList);
 +}
 +else{
 +this.parsePluginList = new ParsePluginsReader().parse(conf);
 
 +}
 
  if (this.extensionPoint == null) {
throw new RuntimeException(x point  + Parser.X_POINT_ID +  not
 found.);
 
 
 Cheers,
   Chris
 
  Cheers,
  Stefan
 
  Am 07.03.2006 um 04:38 schrieb Chris Mattmann:
 
   Hi Stefan,
  
   after a short time I already had 1602 time this lines in my
   tasktracker log files.
   060307 022707 task_m_2bu9o4  found resource parse-plugins.xml at
   file:/home/joa/nutch/conf/parse-plugins.xml
  
   Sounds like this file is loaded 1602 (after lets say 3 minutes) I
   guess that wasn't the goal or do I oversee anything?
  
   It certainly wasn't the goal at all. After NUTCH-88, Jerome and I
   had the
   following line in the ParserFactory.java class:
  
 /** List of parser plugins. */
 private static final ParsePluginList PARSE_PLUGIN_LIST =
 new ParsePluginsReader().parse();
  
  
   (see revision 326889)
  
   Looking at the revision history for the ParserFactory file, after the
   application of NUTCH-169, the above changes to:
  
  
 private ParsePluginList parsePluginList;
  
   //... code here
  
   public ParserFactory(NutchConf nutchConf) {
   this.nutchConf = nutchConf;
   this.extensionPoint = nutchConf.getPluginRepository
   ().getExtensionPoint(
   Parser.X_POINT_ID);
   this.parsePluginList = new ParsePluginsReader().parse(nutchConf);
  
   if (this.extensionPoint == null) {
 throw new RuntimeException(x point  + Parser.X_POINT_ID + 
   not
   found.);
   }
   if (this.parsePluginList == null) {
 throw new RuntimeException(
 Parse Plugins preferences could not be loaded.);
   }
 }
  
  
   Thus, every time the ParserFactory is constructed, the parse-
   plugins.xml
   file is read (it's the result of the call to
   ParsePluginsReader().parse(nutchConf)). So, if the fie is loaded
   1602 times,
   I'd guess that the ParserFactory is loaded 1602 times?
   Additionally, I'm
   wondering why the parse-plugins.xml configuration parameters aren't
   declared
   as final static anymore?
  
   That could be a serious performance improvement to just load this
   file once.
  
   Yup, I think that's the reason we made it final static. If there is no
   reason to not have it final static, I would suggest that it be put
   back to
   final static. There may be a problem however, now since NUTCH-169, the
   loading requires an existing Configuration object I believe. So, we
   may need
   a static Configuration object as well. Thoughts?
  
   I was not able to find the code that is logging this statement, has
   anyone a idea where this happens?
  
   The statement gets logged

Re: ignore eclipse .project and .classpath

2006-02-09 Thread Chris Mattmann

Thanks a lot!

Cheers,
  Chris



On 2/9/06 12:13 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Done.
 
 - Original Message 
 From: Stefan Groschupf [EMAIL PROTECTED]
 To: nutch-dev@lucene.apache.org
 Sent: Wed 08 Feb 2006 03:15:15 PM EST
 Subject: Re: ignore eclipse .project and .classpath
 
 +1
 
 Am 08.02.2006 um 06:16 schrieb Chris Mattmann:
 
 Hi Folks,
 
 
 
  Just wondering if someone could add to the svn:ignore property for
 Nutch
 the files:
 
 
 
 .classpath
 
 .project
 
 
 
 I happen to use eclipse to do Nutch development and always ignore
 these
 files in my other eclipse projects as well.
 
 
 
 Cheers,
 
   Chris
 
 
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not
 reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net

ignore eclipse .project and .classpath

2006-02-07 Thread Chris Mattmann

Hi Folks,

 

 Just wondering if someone could add to the svn:ignore property for Nutch
the files:

 

.classpath

.project

 

I happen to use eclipse to do Nutch development and always ignore these
files in my other eclipse projects as well.

 

Cheers,

  Chris

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-15 Thread Chris Mattmann

Hi Gail,

 Check out:

http://wiki.apache.org/nutch/ParserFactoryImprovementProposal/

That's the way that the parser factory currently works. Also added, but not
described in that proposal is the ability to call a parser by its id, which
is a method present in ParseUtil.java.

G'luck!

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Gal Nitzan (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Sunday, January 15, 2006 4:10 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Updated: (NUTCH-179) Proposition: Enable Nutch to use a
 parser plugin not just based on content type
 
  [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
 
 Gal Nitzan updated NUTCH-179:
 -
 
 Description:
 Sorry, please close this issue.
 
 I figured that if I set my parse plugin first. I can always be called
 first and than decide if I want to parse or not.
 
   was:
 Somtime there are requirements of the real world (usually your boss)
 where a special parse is required for a certain site. Though the content
 type is text/html, a specialized parser is needed.
 
 Sample: I am required to crawl certain sites where some of them are
 partners sites. when fetching from the partners site I need to look for
 certain entries in the text and boost the score.
 
 Currently the ParserFactory looks for a plugin based only on the content
 type.
 
 Facing this issue myself I noticed that it would give a very easy
 implementation for others if ParserFactory could use NutchConf to check
 for certain properties and if matched to use the correct plugin based on
 the url and not just the content type.
 
 The implementation shouldn be to complicated.
 
 Looking to hear more ideas.
 
 
  Proposition: Enable Nutch to use a parser plugin not just based on
 content type
  
 ---
 
   Key: NUTCH-179
   URL: http://issues.apache.org/jira/browse/NUTCH-179
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.8-dev
  Reporter: Gal Nitzan
 
 
  Sorry, please close this issue.
  I figured that if I set my parse plugin first. I can always be called
 first and than decide if I want to parse or not.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira

RE: [jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Chris Mattmann

Guys,

 My apologies for the spamming comments -- I tried to submit my comment
through JIRA one time and it kept giving me service unavailable. So I
resubmitted like 5 times, on the fifth time it finally went through -- but I
guess the other comments went through too. I'll try and remove them right
away.

 Sorry again.

Cheers,
  Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


 -Original Message-
 From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Thursday, January 05, 2006 8:04 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Commented: (NUTCH-139) Standard metadata property names in
 the ParseData metadata
 
 [ http://issues.apache.org/jira/browse/NUTCH-
 139?page=comments#action_12361922 ]
 
 Doug Cutting commented on NUTCH-139:
 
 
 One more thing.  Content length should also not need to be stored in the
 metadata as an x-nutch value.  The content length is simply the length of
 the Content's data.  The protocol may have truncated the content, in which
 case perhaps we need an x-nutch-truncated-content metadata property or
 something, but we should not be overwriting the HTTP Content-Length
 header, nor should we trust that it reflects the length of the data
 actually fetched.
 
 
  Standard metadata property names in the ParseData metadata
  --
 
   Key: NUTCH-139
   URL: http://issues.apache.org/jira/browse/NUTCH-139
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
   Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB
 RAM, although bug is independent of environment
  Reporter: Chris A. Mattmann
  Assignee: Chris A. Mattmann
  Priority: Minor
   Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
   Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
 NUTCH-139.jc.review.patch.txt
 
  Currently, people are free to name their string-based properties
 anything that they want, such as having names of Content-type, content-
 TyPe, CONTENT_TYPE all having the same meaning. Stefan G. I believe
 proposed a solution in which all property names be converted to lower
 case, but in essence this really only fixes half the problem right (the
 case of identifying that CONTENT_TYPE
  and conTeNT_TyPE and all the permutations are really the same). What
 about
  if I named it Content Type, or ContentType?
   I propose that a way to correct this would be to create a standard set
 of named Strings in the ParseData class that the protocol framework and
 the parsing framework could use to identify common properties such as
 Content-type, Creator, Language, etc.
   The properties would be defined at the top of the ParseData class,
 something like:
   public class ParseData{
 .
  public static final String CONTENT_TYPE = content-type;
  public static final String CREATOR = creator;
 
  }
  In this fashion, users could at least know what the name of the standard
 properties that they can obtain from the ParseData are, for example by
 making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to
 get the content type or a call to
 ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of
 course, this wouldn't preclude users from doing what they are currently
 doing, it would just provide a standard method of obtaining some of the
 more common, critical metadata without pouring over the code base to
 figure out what they are named.
  I'll contribute a patch near the end of the this week, or beg. of next
 week that addresses this issue.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira

Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Folks,

 I was just thinking about the ParseData java.util.Properties metaata object
and thinking about the way that we store names in there. Currently, people
are free to name their string-based properties anything that they want, such
as having names of Content-type, content-TyPe, CONTENT_TYPE all having
the same meaning. Stefan G. I believe proposed a solution in which all
property names be converted to lower case, but in essence this really only
fixes half the problem right (the case of identifying that CONTENT_TYPE
and conTeNT-TyPE and all the permutations are really the same). What about
if named it Content Type, or ContentType?

 I propose that a way to correct this would be to create a standard set of
named Strings in the ParseData class that the protocol framework and the
parsing framework could use to identify common properties such as
Content-type, Creator, Language, etc.

 The properties would be defined at the top of the ParseData class,
something like:

 public class ParseData{

   .

public static final String CONTENT_TYPE = content-type;
public static final String CREATOR = creator;

   

}


In this fashion, users could at least know what the name of the standard
properties that they can obtain from the ParseData are, for example by
making a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get
the content type or a call to
ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of course,
this wouldn't preclude users from doing what they are currently doing, it
would just provide a standard method of obtaining some of the more common,
critical metadata without pouring over the code base to figure out what they
are named.

What do you all think? If you guys think that this is a good solution, I'll
create an issue in JIRA about it and contribute a patch near the end of the
week.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Stefan,

 Thanks. Yup, I noticed it and I think it will really help out a lot. Great
job to the both of you :-)

Cheers,
  Chris



On 12/13/05 10:59 AM, Stefan Groschupf [EMAIL PROTECTED] wrote:

 +1!
 BTW, did you notice  that Jerome committed a patch that makes Content
 meta data now case insensitive?
 
 Stefan
 
 Am 13.12.2005 um 18:07 schrieb Chris Mattmann:
 
 Hi Folks,
 
  I was just thinking about the ParseData java.util.Properties
 metaata object
 and thinking about the way that we store names in there. Currently,
 people
 are free to name their string-based properties anything that they
 want, such
 as having names of Content-type, content-TyPe, CONTENT_TYPE
 all having
 the same meaning. Stefan G. I believe proposed a solution in which all
 property names be converted to lower case, but in essence this
 really only
 fixes half the problem right (the case of identifying that
 CONTENT_TYPE
 and conTeNT-TyPE and all the permutations are really the same).
 What about
 if named it Content Type, or ContentType?
 
  I propose that a way to correct this would be to create a standard
 set of
 named Strings in the ParseData class that the protocol framework
 and the
 parsing framework could use to identify common properties such as
 Content-type, Creator, Language, etc.
 
  The properties would be defined at the top of the ParseData class,
 something like:
 
  public class ParseData{
 
.
 
 public static final String CONTENT_TYPE = content-type;
 public static final String CREATOR = creator;
 

 
 }
 
 
 In this fashion, users could at least know what the name of the
 standard
 properties that they can obtain from the ParseData are, for example by
 making a call to ParseData.getMetadata().get
 (ParseData.CONTENT_TYPE) to get
 the content type or a call to
 ParseData.getMetadata().set(ParseData.CONTENT_TYPE, text/xml); Of
 course,
 this wouldn't preclude users from doing what they are currently
 doing, it
 would just provide a standard method of obtaining some of the more
 common,
 critical metadata without pouring over the code base to figure out
 what they
 are named.
 
 What do you all think? If you guys think that this is a good
 solution, I'll
 create an issue in JIRA about it and contribute a patch near the
 end of the
 week.
 
 Cheers,
   Chris
 
 __
 Chris A. Mattmann
 [EMAIL PROTECTED]
 Staff Member
 Modeling and Data Management Systems Section (387)
 Data Management Systems and Technologies Group
 
 _
 Jet Propulsion LaboratoryPasadena, CA
 Office: 171-266BMailstop:  171-246
 ___
 
 Disclaimer:  The opinions presented within are my own and do not
 reflect
 those of either NASA, JPL, or the California Institute of Technology.
 
 
 
 
 ---
 company:http://www.media-style.com
 forum:http://www.text-mining.org
 blog:http://www.find23.net
 
 

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Idea about aliases in the parse-plugins.xml file

2005-12-13 Thread Chris Mattmann

Hi Folks,

  Jerome and I have been talking about an idea to address the current issue
raised by Stefan G. about having a mapping of mimeType-list of pluginIds
rather than mimeType-list of extensionIds in the parse-plugins.xml file.
We've come up with the following proposed update that would seemingly fix
this problem.

  We propose to have the concept of aliases in the parse-plugins.xml file,
defined at the end of the file, something lie:

 parse-plugins


   mimeType name=text/html
  plugin id=parse-html/
   /mimeType

.
  
   aliases
   alias name=parse-html
extension-point=org.apache.nutch.parse.html.HtmlParser/

   
   alias name=parse-html2 extension-point=my.other.html.Parser/
   
   
   /aliases
/parse-plugins



What do you guys think? This approach would be flexible enough to allow the
mapping of extensionIds to mimeTypes, but without impacting the current
pluginId concept.

Comments welcome.

Cheers,
  Chris

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Chris Mattmann

Hi Guys,

 Okay, that makes sense then. I will create an issue in JIRA later today
describing the update, and then begin working on this over the next few
days.

Thanks for your responses and reviews.

Cheers,
  Chris



On 12/13/05 12:45 PM, Jérôme Charron [EMAIL PROTECTED] wrote:

 I agree, too. Perhaps we should use the names as they appear in the
 Dublin Core for those properties that are defined there
 
 A big YES!
 
 
 - just prepended
 them with X-nutch- in order to avoid name-clashes with other
 properties (e.g. blindly copied from the protocol headers).
 
 Another big YES!
 
 
 --
 http://motrech.free.fr/
 http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

NUTCH-112: Link in cached.jsp page to cached content is an absolute link

2005-12-06 Thread Chris Mattmann

Hi Guys,

 

  Just wondering if any of the committers checked out
http://issues.apache.org/jira/browse/NUTCH-112. Turns out the link to the
cached.jsp page to the cached content contains an absolute link which makes
the link mess up when you don't deploy the nutch webapp in the root context.
I've attached a pretty simple patch to the issue, and tested it. It would be
nice to have this included for those people like me who are using Nutch
deployed at a context other than root, e.g., http://myhost/nutch/.

 

 

Thanks,

  Chris 

 

 

__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)

Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Jerome,

 I think that this is a great idea and ensures that there isn't replication
of so-called management information across the system. It could be easily
implemented as a utility method because we have utility java classes that
represent the ParsePluginList, that you could get the mimeTypes from.
Additionally, we could create a utility method that searches the extension
point list for parsing plugins and returns a boolean true or false whether
they are activated or not. Using this information, I believe that the url
filtering would be a snap.

 

+1

Cheers,
  Chris



On 12/1/05 12:11 PM, Jérôme Charron [EMAIL PROTECTED] wrote:

 Suggestion:
 For consistency purpose, and easy of nutch management, why not filtering the
 extensions based on the activated plugins?
 By looking at the mime-types defined in the parse-plugins.xml file and the
 activated plugins, we know which content-types will be parsed.
 So, by getting the file extensions associated to each content-type, we can
 build a list of file extensions to include (other ones will be excluded) in
 the fecth process.
 No?
 
 Jérôme
 
 --
 http://motrech.free.fr/
 http://www.frutch.org/

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: Urlfilter Patch

2005-12-01 Thread Chris Mattmann

Hi Doug,

 
 Chris Mattmann wrote:
In principle, the mimeType system should give us some guidance on
  determining the appropriate mimeType for the content, regardless of
 whether
  it ends in .foo, .bar or the like.
 
 Right, but the URL filters run long before we know the mime type, in
 order to try to keep us from fetching lots of stuff we can't process.
 The mime type is not known until we've fetched it.

Duh, you're right. Sorry about that. 

Matt Kangas wrote:
 The latter is not strictly true. Nutch could issue an HTTP HEAD  
 before the HTTP GET, and determine the mime-type before actually  
 grabbing the content.
 
 It's not how Nutch works now, but this might be more useful than a 
 super-detailed set of regexes...


I liked Matt's idea of the HEAD request though. I wonder if some benchmarks
on performance of this would be useful, because in some cases (such as
focused crawling, or non-whole-internet crawling, such as intranet, etc.),
it would seem that the performance penalty of performing the HEAD to get the
content-type would be useful, and worth the cost...

Cheers,
  Chris

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann

Hi Stefan,


 -1!
 Xsl is terrible slow!

You have to consider what the XSL will be used for. Our proposal suggests
XSL as a means of intermediate transformation of markup content on the
backend, as Jerome suggested in his reply. This means that whenever markup
content is encountered, specifically, XML based content, then XSL will be
used to create an intermediary parse-out xml file, containing the fields
to index. I don't think, given the percentage of xml-based markup content
out there (of course excluding html), compared to regular content, that
this would significantly degrade performance. 

 Xml will blow up memory and storage usage.

Possibly, but I would think that we would do it in a clever fashion. For
instance, the parse-out xml files would most likely be small (~kb) files
that could be deleted if space is a concern. It could be a parameterized
option. 

 Dublin core may is good for semantic web, but not for a content storage.

I completely disagree with that. In fact, I think many people would disagree
with that in fact. Dublin core is a standard metadata model for electronic
resources. It is by no means the entire spectrum of metadata that could be
stored for electronic content. However, rather than creating your own
author field, or content creator, or document creator, or whatever you
want to call it, I think it would be nice to provide the DC metadata because
at least it is well known and provides interoperability with other content
storage systems. Check out DSpace from MIT. Check out ISO-11179 registry
systems. Check out the ISO standard OAIS reference model for archiving
systems. Each of these systems has recognized that standard metadata is an
important concern in any content management system.

 In general the goal must be to minimalize memory usage and improve
 performance such a parser would increase memory usage and definitely
 slow down parsing.

I dont think it would slow down parsing significantly, as I mentioned above
markup content represents a small portion of the amount of content out
there.

 The magic world is minimalism.
 So I vote against this suggestion!
 Stefan

In general, this proposal represents a step forward in being able to parse
generic XML content in Nutch, which is a very challenging problem. Thanks
for your suggestions, however, I think that our proposal would help Nutch to
move forward in being to handle generic forms of XML markup content.


Cheers,
   Chris Mattmann

 
 
 
 
 
 Am 24.11.2005 um 00:01 schrieb Jérôme Charron:
 
  Hi,
 
  We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and
  me) just
  add a new proposal on the nutch Wiki:
  http://wiki.apache.org/nutch/MarkupLanguageParserProposal
 
  Here is the Summary of Issue:
  Currently, Nutch provides some specific markup language parsing
  plugins:
  one for handling HTML, another one for RSS, but no generic XML parsing
  plugin. This is extremely cumbersome as adding support for a new
  markup
  language implies that you have to develop the whole XML parsing
  code from
  scratch. This methodology causes: (1) code duplication, with little
  or no
  reuse of common pieces of XML parsing code, and (2) dependency library
  duplication, where many XML parsing plugins may rely on similar xml
  parsing
  libraries, such as jaxen, or jdom, or dom4j, etc., but each parsing
  plugin
  keeps its own local copy of these libraries. It is also very
  difficult to
  identify precisely the type of XML content encountered during a
  parse. That
  difficult issue is outside the scope of this proposal, and will be
  identified in a future proposal.
 
  Thanks for your feedback, comments, suggestions (and votes).
 
  Regards
 
  Jérôme
 
  --
  http://motrech.free.fr/
  http://www.frutch.org/

RE: [proposal] Generic Markup Language Parser

2005-11-24 Thread Chris Mattmann

Hi Stefan, and Jerome,

 A  mail archive is a amazing source of information, isn't it?! :-)
 To answer your question, just ask your self how many pages per second
 your plan to fetch and parse and how much queries per second a lucene
 index is able to handle - and you can deliver in the ui.
 I have here something like 200++ to a maximal 20 queries per second.
 http://wiki.apache.org/nutch/HardwareRequirements

I'm not sure that our proposal affects the ui, really at all. Parsing occurs
only during a fetch, which creates the index for the ui, no? So, why mention
the amount of queries per second that the ui can handle?

 
 Speed improvement in ui can be done by caching components you use to
 assemble the ui. There are some ways to improve speed
 But seriously I don't think there will be any pages  that contains
 'cacheable' items until parsing.
 Until last years there is one thing I notice that matters in a search
 engine - minimalism.
 There is no usage in nutch of a  logging library, 

Correct me if I'm wrong, but isn't log4j used a lot within Nutch? :-)

 no RMI and no meta
 data in the web db. Why?
 Minimalism.
 Minimalism == speed, speed == scalability, scalability == serious
 enterprise search engine projects.
 
 I don't think it would be a good move to slow down html parsing (most
 used parser) to make rss parser writing more easier for developers.

This proposal isn't meant for RSS, that's seriously constraining the scope.
The proposal is meant for making writing * XML * parsers easier. Note the
XML. RSS is a significantly small subset of XML as a whole. And, there
currently exists no default support for generic XML documents in Nutch.


 BTW, we already have a html and feed parser that works, as far I know.
 I guess 90 % of the nutch users use the html parser but only 10 % the
 feed-parser (since blogs are mostly html as well).

This may or may not be true however I wouldn't be surprised if it was
because it is representative of the division of content on the web -- HTML
definitely is orders of magnitude more pervasive than RSS.

 
  From my perspective we have much more general things to solve in
 nutch (manageability, monitoring, ndfs block based task-routing, more
 dynamic search servers) than improving thing we already have.

I would tend to agree with Jerome on this one -- these seem to be the items
on your agenda: a representative set indeed, but by no means an exhaustive
set of what's needed to improve, and benefit Nutch. One of the motivations
behind our proposal was several emails posted to the Nutch list by users
interested in crawling blogs and RSS:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69417.html

One of my replies to this thread was a message on October 19th, 2005, which
really identified the main problem:

http://www.opensubscriber.com/message/nutch-general@lists.sourceforge.net/23
69576.html

There is a lack of a general XML parser in Nutch that would allow it to deal
with general XML content based on user defined schemas and DTDs. Our
proposal would be the initial step towards a solution to this overall
problem. At least, that's part of its intention.


 Anyway as you may know we have a plugin system and one goal of the
 plugin system is to give developers the freedom to develop custom
 plugins. :-)

Indeed. And our goal is help developers in their endeavors by providing at
starting point and generic solution for XML based parsing plugins :-)

Cheers,
  Chris


 
 Cheers,
 Stefan
 B-)
 
 P.S. Do you think it makes sense to run another public nutch mailing
 list, since 'THE nutch [...]' (mailing list  is nutch-
 [EMAIL PROTECTED]), 'Isn't it?'
 http://www.mail-archive.com/nutch-user@lucene.apache.org/msg01513.html
 
 
 
 Am 24.11.2005 um 19:28 schrieb Jérôme Charron:
 
  Hi Stefan,
 
  And thanks for taking time to read the doc and giving us your
  feedback.
 
  -1!
  Xsl is terrible slow!
  Xml will blow up memory and storage usage.
 
  But there still something I don't understand...
  Regarding a previous discussion we had about the use of OpenSearch
  API to
  replace Servlet = HTML by Servlet = XML = HTML (using xsl),
  here is a copy of one of my comment:
 
  In my opinion, it is the front-end dreamed architecture. But more
  pragmatically, I'm not sure it's a good idea. XSL transformation is a
  rather slow process!! And the Nutch front-end must be very responsive.
 
  and then your response and Doug response too:
  Stefan:
  We already done experiments using XSLT.
  There are some ways to improve speed, however it is 20 ++ % slower
  then jsp.
  Doug:
  I don't think this would make a significant impact on overall Nutch
  search
  performance.
  (the complete thread is available at
  http://www.mail-archive.com/nutch-developers@lists.sourceforge.net/
  msg03811.html
  )
 
  I'm a little bit confused... why the use of xsl must be considered
  as too
  time and memory expansive in the back-end process,
  but not in the front-end?

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann

Hi Doug,


On 10/17/05 11:38 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
  So, one thing it seems is that fields to be indexed, and used in a field
 query must be fully lowercase to work? Additionally, it seems that they
 can't have symbols in them, such as _, is that correct? Would you guys
 consider this to be a bug?
 
 Yes, this sounds like a bug.

Okay, I will look and see if I can figure out why this is happening and if I
can, I will try and submit a patch.


 
 Performing Lucene Query:
 
 using filter QueryFilter(+contactemail:[EMAIL PROTECTED]) and
 numHits = 20
 
 051016 190347 11 total hits: 0
 
 A query whose only clause has a boost of 0.0 will return no results.
 Nutch uses the convention that clauses whose boost is 0.0 may be
 converted to filters, for efficiency.  A filter affects the set of hits,
 but not their ranking.  So a boost of 0.0 is used to declare that a
 clause does not affect ranking and may not be used in isolation.  This
 makes it akin to searching for filetype:pdf on Google--filetype is
 only used to filter other queries and may not be a standalone query.

Okay, this makes sense. In fact, when I do a query now for:

contactemail:[EMAIL PROTECTED] specimen

The query actually works. Of the 3 documents I indexed only one of them has
the contactemail [EMAIL PROTECTED], and so I only got one result
back. So your answer there makes total sense. So, my question to you then
is, what type of QueryFilter should I develop in order to get my query for
contactemail:email address to work as a standalone query? For instance,
right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
the right way to do it now. Is there a class in Nutch that I can sub-class
to get most of the functionality for doing a type:value query as a
standalone query?

Thanks for the help.

Cheers,
  Chris

 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: developing a parse-/index-/query- plugin set

2005-10-17 Thread Chris Mattmann

Hi Doug,

 Thanks, that worked.

Cheers,
  Chris



On 10/17/05 11:56 AM, Doug Cutting [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
 So, my question to you then
 is, what type of QueryFilter should I develop in order to get my query for
 contactemail:email address to work as a standalone query? For instance,
 right now I'm sub-classing the RawFieldQueryFilter, which doesn't seem to be
 the right way to do it now. Is there a class in Nutch that I can sub-class
 to get most of the functionality for doing a type:value query as a
 standalone query?
 
 You can simply pass a non-zero boost to the RawFieldQueryFilter
 constructor, e.g.:
 
 public class MyQueryFilter extends RawFieldQueryFilter {
public MyQueryFilter() {
  super(myfield, 1.0f);
}
 }
 
 Or you can implement QueryFilter directly.  There's not that much to it.
 
 Doug

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-12 Thread Chris Mattmann

Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.

Cheers,
 Chris


__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 12, 2005 5:19 PM
 To: nutch-dev@incubator.apache.org
 Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
 characters
 
  [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
 
 [EMAIL PROTECTED] updated NUTCH-110:
 
 
 Attachment: fixIllegalXmlChars.patch
 
 Attached patch runs all xml text through a check for bad xml characters.
 This patch is brutal dropping silently illegal characters.  Patch was made
 after hunting xalan, jdk, and nutch itself for a method that would do the
 above filtering but was unable to find any such method -- perhaps an
 oversight on my part?
 
  OpenSearchServlet outputs illegal xml characters
  
 
   Key: NUTCH-110
   URL: http://issues.apache.org/jira/browse/NUTCH-110
   Project: Nutch
  Type: Bug
Components: searcher
  Versions: 0.7
   Environment: linux, jdk 1.5
  Reporter: [EMAIL PROTECTED]
   Attachments: fixIllegalXmlChars.patch
 
  OpenSearchServlet does not check text-to-output for illegal xml
 characters; dependent on  search result, its possible for OSS to output
 xml that is not well-formed.  For example, if text has the character FF
 character in it -- -- i.e. the ascii character at position (decimal) 12 --
 the produced XML will show the FF character as '#12;' The
 character/entity '#12;' is not legal in XML according to
 http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira

Re: failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-27 Thread Chris Mattmann

You know what the crazy thing is:

Seemingly, all tests pass now. And I didn't change a thing. Honest. I swear.

Very strange, indeed, but I'm happy because at least the tests are passing!
:-)

Cheers,
  Chris



On 9/27/05 12:29 PM, Paul Baclace [EMAIL PROTECTED] wrote:

 Chris Mattmann wrote:
  I just noticed after checking out the latest SVN of Nutch that I am
 currently failing the TestSegmentMergeTool Junit test when I type ant test
 for Nutch. 
 
 I'm on the mapred branch, not the trunk, and all tests pass.
 
 One thing I have noticed is that it is best to start with 'ant clean'
 and if you made any mods to the conf files, rewind them back by copying
 the x.template files to x.
 
 Paul

__
Chris A. Mattmann
[EMAIL PROTECTED]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
 
_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
___
 
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

failing of org.apache.nutch.tools.TestSegmentMergeTool?

2005-09-26 Thread Chris Mattmann

Hi there,

 

 I just noticed after checking out the latest SVN of Nutch that I am
currently failing the TestSegmentMergeTool Junit test when I type ant test
for Nutch. Is anyone experiencing the same problem? Here is the relevant
information which I captured out of the
$NUTCH_HOME/build/test/TEST-org.apache.nutch.tools.TestSegmentMergeTool.txt
file:

 

Testsuite: org.apache.nutch.tools.TestSegmentMergeTool

Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 46.256 sec

- Standard Error -

050926 215316 parsing
file:/C:/Program%20Files/eclipse/workspace/nutch/conf/nutch-default.xml

050926 215316 parsing
file:/C:/Program%20Files/eclipse/workspace/nutch/build/test/classes/nutch-si
te.xml

050926 215316 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer

050926 215321 No FS indicated, using default:local

050926 215321 * Opening 10 segments:

050926 215321  - segment seg0: 500 records.

050926 215321  - segment seg1: 500 records.

050926 215321  - segment seg2: 500 records.

050926 215321  - segment seg3: 500 records.

050926 215321  - segment seg4: 500 records.

050926 215321  - segment seg5: 500 records.

050926 215321  - segment seg6: 500 records.

050926 215321  - segment seg7: 500 records.

050926 215321  - segment seg8: 500 records.

050926 215321  - segment seg9: 500 records.

050926 215321 * TOTAL 5000 input records in 10 segments.

050926 215321 * Creating master index...

050926 215328 * Creating index took 6356 ms

050926 215328 * Optimizing index took 0 ms

050926 215328 * Removing duplicate entries...

050926 215328 * Deduplicating took 652 ms

050926 215328 * Merging all segments into output

050926 215333 * Merging took 4381 ms

050926 215333 * Deleting old segments...

050926 215333 Finished SegmentMergeTool: INPUT: 5000 - OUTPUT: 5000 entries
in 12.15 s (416.6 entries/sec).

050926 215339 No FS indicated, using default:local

050926 215339 * Opening 10 segments:

050926 215339  - segment seg0: 500 records.

050926 215339  - segment seg1: 500 records.

050926 215339  - segment seg2: 500 records.

050926 215339  - segment seg3: 500 records.

050926 215339  - segment seg4: 500 records.

050926 215339  - segment seg5: 500 records.

050926 215339  - segment seg6: 500 records.

050926 215339  - segment seg7: 500 records.

050926 215339  - segment seg8: 500 records.

050926 215339  - segment seg9: 500 records.

050926 215339 * TOTAL 5000 input records in 10 segments.

050926 215339 * Creating master index...

050926 215344 * Creating index took 5083 ms

050926 215344 * Optimizing index took 0 ms

050926 215344 * Removing duplicate entries...

050926 215344 * Deduplicating took 150 ms

050926 215344 * Merging all segments into output

050926 215345 * Merging took 662 ms

050926 215345 * Deleting old segments...

050926 215345 Finished SegmentMergeTool: INPUT: 5000 - OUTPUT: 500 entries
in 6.316 s (833. entries/sec).

java.lang.Exception: Missing or invalid 'fetcher' or 'fetcher_output'
directory in
c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index

at
org.apache.nutch.segment.SegmentReader.isParsedSegment(SegmentReader.java:16
8)

at
org.apache.nutch.segment.SegmentReader.init(SegmentReader.java:143)

at
org.apache.nutch.segment.SegmentReader.init(SegmentReader.java:82)

at
org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo
ol.java:185)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:324)

at junit.framework.TestCase.runTest(TestCase.java:154)

at junit.framework.TestCase.runBare(TestCase.java:127)

at junit.framework.TestResult$1.protect(TestResult.java:106)

at junit.framework.TestResult.runProtected(TestResult.java:124)

at junit.framework.TestResult.run(TestResult.java:109)

at junit.framework.TestCase.run(TestCase.java:118)

at junit.framework.TestSuite.runTest(TestSuite.java:208)

at junit.framework.TestSuite.run(TestSuite.java:203)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu
nner.java:289)

at
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR
unner.java:523)

junit.framework.AssertionFailedError: Missing or invalid 'fetcher' or
'fetcher_output' directory in
c:\DOCUME~1\mattmann\LOCALS~1\Temp\.smttest63088\output\.fastmerge_index

at junit.framework.Assert.fail(Assert.java:47)

at
org.apache.nutch.tools.TestSegmentMergeTool.testSameMerge(TestSegmentMergeTo
ol.java:190)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at

Re: [Nutch-cvs] [Nutch Wiki] Update of ParserFactoryImprovementProposal by ChrisMattmann

2005-09-15 Thread Chris Mattmann

Hi Otis,

 Point taken. In actuality since both convey the same information I think
that it's okay to support both, but by default say we could code the initial
plugins specified in parse-plugins.xml without the order= attribute. Fair
enough?

Cheers,
  Chris



On 9/15/05 3:23 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:

 Well, you have to tell users about order=N somewhere in the docs.
 Instead of telling them about order=N, tell them that the order in
 XML matters.  Either case requires education, and the latter one
 requires less typing and avoids the case described in the proposal.
 
 Otis
 
 --- Sébastien LE CALLONNEC [EMAIL PROTECTED] wrote:
 
 Hi Otis,
 
 
 This issue arose during our discussion for this proposal, and my
 feeling was that the XML specification doesn't state that the order
 is
 significant in an XML file.  I therefore read the spec again, and
 indeed didn't find anything on that subject...
 
 I think it is somehow reasonable to consider that a parser _might_
 return the elements in a different order—though, as I mentioned to
 Chris  Jerome, that would be quite unheard of, and, to be honnest,
 rather irritating.
 
 What do you think?
 
 
 Regards,
 Sebastien.
 
 
 
 Quick comment about order=N and the paragraph that describes how
 to
 deal with cases where people mess things up and enter multiple
 plugins
 for the same content type and the same order:
 
 - Why is the order attribute even needed?  It looks like a
 redundant
 piece of information - why not derive order from the order of
 plugin
 definitions in the XML file?
 
 For instance:
 Instead of this:
 
   mimeType name=*
   plugin id=”parse-text” order=”1”/
   plugin id=”another-one-default-parser” order=”2”/
  
   /mimeType
 
 We have this:
 
   mimeType name=*
   plugin id=”parse-text”/
   plugin id=”another-one-default-parser”/
  
   /mimeType
 
 parse-text first, another-one-default-parser second.  Less typing,
 and
 we avoid the case of equal ordering all together.
 
 Otis
 
 
 --- Apache Wiki [EMAIL PROTECTED] wrote:
 
 Dear Wiki user,
 
 You have subscribed to a wiki page or wiki category on Nutch
 Wiki
 for change notification.
 
 The following page has been changed by ChrisMattmann:
 http://wiki.apache.org/nutch/ParserFactoryImprovementProposal
 
 The comment on the change is:
 Initial Draft of ParserFactoryImprovementProposal
 
 New page:
 = Parser Factory Improvement Proposal =
 
 
 == Summary of Issue ==
 Currently Nutch provides a plugin mechanism wherein plugins
 register
 certain metadata about themselves, including their id, classname,
 and
 so forth. In particular, the set of parsing plugins register
 which
 contentTypes and file suffixes they can support with a
 PluginRepository.
 
 One â€œadopted practiceâ€� in current Nutch parsing plugins
 (committed in Subversion, e.g., see parse-pdf, parse-rss, etc.)
 has
 also been to verify that the content type passed to it during a
 fetch
 is indeed one of the contentTypes that it supports (be it
 application/xml, or application/pdf, etc.). This practice is
 cumbersome for a few reasons:
 
  *Any updates to supported content types for a parsing plugin
 will
 require a recompilation of the plugin code
  *Checking for â€œhard codedâ€� content types within the parsing
 plugin is a duplication of information that already exists in the
 pluginâ€™s descriptor file, plugin.xml
  *By the time that content gets to a parsing plugin, (e.g., the
 parsing plugin is returned by the ParserFactory, and provided
 content
 during a fetch), the ParsingFactory should have already ensured
 that
 the appropriate plugin is getting called for a particular
 contentType.
 
 In addition to this problem is the fact that several parsing
 plugins
 may all support many of the same content types. For instance, the
 parse-js plugin may be the only well suited parsing plugin for
 javascript, but perhaps it may also provided a good enough
 heuristic
 parser for plain text as well, and so it may support both types.
 However, there may be a parsing plugin for text (which there
 is!),
 parse-text, whose primary purpose is to parse plain text as well.
 
 == Suggested Remedy ==
 To deal with ensuring the desired parsing plugin is called for
 the
 appropriate content type, and to in effect, â€œkill two birds
 with
 one stoneâ€�, we propose that there be a parsing plugin
 preference
 list for each content type that Nutch knows how to handle, i.e.,
 each
 content type available via the mimeType system. Therefore, during
 a
 fetch, once the appropriate mimeType has been determined for
 content,
 and the ParserFactory is tasked with returning a parsing plugin,
 the
 ParserFactory should consult a preference list for that
 contentType,
 allowing it to determine which plugin has the highest preference
 for
 the contentType. That parsing plugin should be returned via the
 ParserFactory to the fetcher. If there is any problem using the
 initial returned parsing plugin for a

RE: [jira] Commented: (NUTCH-30) rss feed parser

2005-07-30 Thread Chris Mattmann

Hi Folks,
 
  I response to Michael's comment, I've went ahead and uploaded a working
patch and an updated patch and source distribution for the parse-rss plugin.
The latest patch and source work against the new protocol and parsing APIs
by Andrzej. The patch was made against the latest SVN from 73005.

The patch and source distro are zipped up in the file: parse-rss-73005.zip.
Here is a direct link:
http://issues.apache.org/jira/secure/attachment/12311475/parse-rss-73005.zip


Thanks!

Cheers,
  Chris Mattmann
__
Chris A. Mattmann
[EMAIL PROTECTED] 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_
Jet Propulsion LaboratoryPasadena, CA
Office: 171-266BMailstop:  171-246
Phone:  818-354-8810
___

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

 -Original Message-
 From: Michael Nebel (JIRA) [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, July 27, 2005 8:42 AM
 To: [EMAIL PROTECTED]
 Subject: [jira] Commented: (NUTCH-30) rss feed parser
 
 [ http://issues.apache.org/jira/browse/NUTCH-
 30?page=comments#action_12316928 ]
 
 Michael Nebel commented on NUTCH-30:
 
 
 I loaded the latest sources from the svn yesterday and tried to integrate
 this plugin (I used the Zip from Hasan) . I found:
 
 - getParse throws a ParseException which isn't supported by getParse
 - the call to new ParseData needs a new parameter ParseStatus
 
 My fixes are far from perfect (I just identified the problems by now), so
 I'm not creating a patch. :-(
 
  rss feed parser
  ---
 
   Key: NUTCH-30
   URL: http://issues.apache.org/jira/browse/NUTCH-30
   Project: Nutch
  Type: Improvement
Components: fetcher
  Reporter: Stefan Grroschupf
  Assignee: Chris A. Mattmann
  Priority: Minor
   Attachments: RSSParserPatch.txt, RSS_Parser.zip, parse-rss-1.0-
 040605.zip, parse-rss-patch.txt, parse-rss-srcbin-incl-path.zip, parse-
 rss.zip, parseRss.zip
 
  A simple rss feed parser supporting:
  rss and atom:
  + version 0.3
  +  version 09
  + version 10
  + version 20
  Converting of different rss versions  is done via xslt.
  The xslt was contributed by Frank Henze - Thanks!
 
 --
 This message is automatically generated by JIRA.
 -
 If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
 -
 For more information on JIRA, see:
http://www.atlassian.com/software/jira

95 matches

Mail list logo