[VOTE] Apache Nutch 1.1 Release Candidate #3

2010-05-08 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc3/

The major differences between this release and rc #2 are the application of:
NUTCH-816, NUTCH-732, NUTCH-815, NUTCH-814, and NUTCH-812 based on feedback
from prior release candidates.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/


In response to several user requests during the last RC cycle, I've also
included *binary* releases (labeled as apache-nutch-1.1-bin.tar.gz and
apache-nutch-1.1-bin.zip). This addresses Sami Siren's request that the
tutorial be updated to reflect the fact that this release is a source-only
release.

Sami also requested to integrate RAT into the build, however, in the
interest of getting this 1.1 out and getting going on the Nutch TLP, my
proposal is:

* run RAT and integrate into the build on releases post 1.1



Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Only votes from Nutch PMC are binding, but folks are welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
Hi Matthew,

I think Julien may have a fix for this in TIKA-379 [1]. I’ll take a look at
Julien’s patch and see if there is a way to get it committed sooner rather
than later.

One way to help me do that ― since you already have an environment and set
of use cases where this is reproduceable can you apply TIKA-379 to a local
checkout of tika trunk (I’ll show you how) and then let me know if that
fixes parse-tika for you?

Here are the steps:

svn co http://svn.apache.org/repos/asf/lucene/tika/trunk ./tika
cd tika
wget "http://bit.ly/bXeLkf"; (if you don't have SSL support, then manually
download the linked file)
patch -p0 < TIKA-379-3.patch
mvn install package

Then grab tika-parsers and tika-core out of the respective tika-core/target
and tika-parsers/target directories and drop those jars in your
parse-tika/lib folder, replacing their originals. Then, try your nutch crawl
again.

See if that works. In the meanwhile, I'll inspect Julien's patch.

Thanks!

Cheers,
Chris

On 5/4/10 9:02 PM, "matthew a. grisius"  wrote:

> Hi Chris,
> 
> It appears to me that parse-tika has trouble with HTML FRAMESETS/FRAMES
> and/or javascript. Using the parse-html suggested work around I am able
> to process my simple test cases such as javadoc which does include
> simple embedded javascript (of course I can't verify that it is actually
> parsing it though). I expanded my testing to include two more complex
> examples that heavily use HTML FRAMESET/FRAME and more complex
> javascript:
> 
> 134 mb, 11,269 files
> 1.9 gb, 133,978 files
> 
> They both fail at the top level with the similar errors such as:
> 
> fetching
> http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> ocCommon.js
> fetching
> http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocQuickRefs/DSDo
> cBanner.htm
> -finishing thread FetcherThread, activeThreads=8
> -finishing thread FetcherThread, activeThreads=7
> -finishing thread FetcherThread, activeThreads=9
> -finishing thread FetcherThread, activeThreads=6
> -finishing thread FetcherThread, activeThreads=5
> -finishing thread FetcherThread, activeThreads=4
> -finishing thread FetcherThread, activeThreads=3
> Error parsing:
> http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> ocCommon.js: UNKNOWN!(-56,0): Can't retrieve Tika parser for mime-type
> text/javascript
> Attempting to finish item from unknown queue:
> org.apache.nutch.fetcher.fetcher$fetchi...@1532fc
> fetch of
> http://192.168.1.101:8080/technical/general/CAADoc/online/CAADocJavaScript/DSD
> ocCommon.js failed with: java.lang.ArrayIndexOutOfBoundsException: -56
> -finishing thread FetcherThread, activeThreads=2
> 
> I tried several property settings to mimic the previous work around and
> could not solve it. Any suggestions?
> 
> So, I'm not sure how to categorize the issues more accurately. I have
> many javadoc sets and lots of simple HTML that will now parse, but I
> have other examples such as the two mentioned above that won't parse and
> therefore can't be crawled. It seems to me to be systematic rather than
> exceptional. I cannot believe that I'm the only one who will experience
> these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks
> for asking.
> 
> -m.
> 
> 
> 
> On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote:
>> Hi Matthew,
>> 
>> Awesome! Glad it worked. Now my next question < how often are you seeing
>> that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
>> trying to process? Or just some of them? Or particular ones (categories of
>> them). The reason I ask is that I¹m trying to determine whether I should
>> commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
>> systematic thing versus an exception.
>> 
>> Let me know and thanks!
>> 
>> Cheers,
>> Chris
>> 
>> 
>> On 5/3/10 9:04 AM, "matthew a. grisius"  wrote:
>> 
>>> Hi Chris,
>>> 
>>> Yes, that worked. I caught up on email and noticed that Arpit also
>>> mentioned the same thing. Sorry I missed it, thanks to both of you!
>>> 
>>> -m.
>>> 
>>> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
>>>> Hi Matthew,
>>>> 
>>>>>> Hi Matthew,
>>>>>> 
>>>>>> There is an open issue with Tika (e.g.
>>>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>>>>>> differences betwen parse-html and parse-tika. Note that you can specify :
>>>>>> *parse-(html|pdf) 

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
Hi Matthew,

Awesome! Glad it worked. Now my next question < how often are you seeing
that parse-tika doesn¹t work on HTML files? Is it all HTML that you are
trying to process? Or just some of them? Or particular ones (categories of
them). The reason I ask is that I¹m trying to determine whether I should
commit the update below to 1.1 so it goes out with the 1.1 RC and if it¹s a
systematic thing versus an exception.

Let me know and thanks!

Cheers,
Chris


On 5/3/10 9:04 AM, "matthew a. grisius"  wrote:

> Hi Chris,
> 
> Yes, that worked. I caught up on email and noticed that Arpit also
> mentioned the same thing. Sorry I missed it, thanks to both of you!
> 
> -m.
> 
> On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote:
>> Hi Matthew,
>> 
>>>> Hi Matthew,
>>>> 
>>>> There is an open issue with Tika (e.g.
>>>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>>>> differences betwen parse-html and parse-tika. Note that you can specify :
>>>> *parse-(html|pdf) *in order to get both HTML and PDF files.
>>> 
>>> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
>>> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
>>> PDFs, but has problems with some html. Nutch 1.1 includes more current
>>> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.
>> 
>> Interesting: well one solution comes to mind. Can you test this out?
>> 
>> * uncomment the lines:
>> 
>> 
>> 
>> 
>> 
>> In conf/parse-plugins.xml.
>> 
>> * try your crawl again.
>> 
>>> 
>>> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
>>> with the attached file
>> 
>> Thanks! Let me know what happens after you uncomment the line above.
>> 
>> Cheers,
>> Chris
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.mattm...@jpl.nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>> 
>> 
> 
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew,

>> Hi Matthew,
>> 
>> There is an open issue with Tika (e.g.
>> https://issues.apache.org/jira/browse/TIKA-379) that could explain the
>> differences betwen parse-html and parse-tika. Note that you can specify :
>> *parse-(html|pdf) *in order to get both HTML and PDF files.
> 
> The reason that I am trying Nutch 1.1 is that parse-pdf for Nutch 1.0
> rejects fully 10% of my PDFs. Nutch 1.1 parse-tika parses all of my
> PDFs, but has problems with some html. Nutch 1.1 includes more current
> PDFBox jar files, e.g. 1.1.0, whereas Nutch 1.0 includes 0.7.4.

Interesting: well one solution comes to mind. Can you test this out?

* uncomment the lines:





In conf/parse-plugins.xml.

* try your crawl again.

> 
> I submitted NUTCH-817 https://issues.apache.org/jira/browse/NUTCH-817
> with the attached file

Thanks! Let me know what happens after you uncomment the line above.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Mattmann, Chris A (388J)
Hi Phil,

Thanks for your comments. Mine below:

>> Unfortunately some parts of the documentation on Nutch (namely the
>> tutorial,
>> and other parts of the static site) have been out of date for a while. This
>> has occurred really independent of the releases, and independent of the
>> wiki
>> [1], which hasn't really fallen out of date as quick.
>> 
> 
> While documentation may not be part of the code, it's certainly part of the
> project. And it's just as important as the code. Yes, I know that
> documentation is the bane of programmers everywhere. I'm a coder. I get it.
> But when you change the way things work in a fundamental way that leaves all
> of  your documentation behind, it's time to spend some time on it.

Sure. So, what fundamental way has Nutch changed from 1.0 to 1.1? Can you
elaborate? Also, in terms of spending time on Nutch's documentation, I'll
try to as I get more time (as I'm sure other committers will as well), but
I'd also say: if there's something to be improved, by all means, go for it,
and patches welcome to contribute it back.

> 
> 
>>> 
>>> For example, my find of broken code in bin/nutch crawl, a most basic way
>> of
>>> getting it running.
>> 
>> Can you elaborate on your find of broken code? Did you file a JIRA issue
>> for
>> this in the Nutch JIRA system [2] ?
>> 
> 
> Yes, it led to another release. The bug fix I contributed was incorporated.

Great!

>> 
>> The more information you provide here about your environment and your
>> situation that caused the error, as well as e.g., detailed information (a
>> stack trace, an exception, something), the easier it is to track down what
>> you're seeing.
>> 
> 
> Yes, that was all in the unanswered emails. it would be easier for you to
> search your inbox than for me to send it all over again.

I wouldn't assume that the inboxes of folks watching the list are always
centered on the Nutch mailing lists. Realize that many of us are subscribed
to several mailing lists, and sometimes, emails go unanswered for a while.

> 
>> That said, one thing to realize is that this is open source software, so in
>> the end, as they say in Apache, "those that do, decide", or "patches
>> welcome!" In other words, if there are things that you see that could be
>> fixed, improved, made more configurable, etc., including the code, but
>> *also
>> the documentation*, then by all means we'd appreciate your feedback and
>> contribution. Nutch is not simply a product of the developers that
>> contribute their (potentially and often unsalaried) time to work on it, but
>> of its user community as well.
>> 
> 
> I've been the leader of a major open source project for over 10 years. Last
> fall I relinquished the reins of that project to a new project leader, so I
> think I know how it works. We wrote an open source cross platform compiler
> for xBase (Clipper) code named Harbour Project, now in release 2.0.
> 
> That would be why I not only raised the flag that it's not ready to release,
> but I tracked down a bug and submitted a bug fix.
> 
> And I'm still saying it's not ready to release. There's still another bug
> that I have found that goes unanswered.

Right, so then you know that bugs aren't just "bugs" -- they must come with
a priority. There are several categories, "High", "Medium", "Critical", or
"Blocker", just to name a few.

When I cut a release as the Release Manager (RM), I always run unit tests
and try and at least run a basic crawl first before cutting the RC. So,
hopefully that catches anything that would be a big problem, but sometimes
even that process breaks down since not everyone has e.g., large scale
deployments, or maybe we're missing a unit test we need, etc. I'd say at ~10
releases of Nutch to date, and many many features, etc., we have fairly
decent regression.

>> 
>> In certain cases you are right, but I would take your above comments as
>> verbatim across the board. For example, if you believe there is
>> documentation lacking, then the first step is typically to file JIRA issues
>> to alert committers and other users of Nutch of your concern and then have
>> discussion on the lists regarding the issues. At some point a patch is
>> produced, and then attached to the issue, where the committers can review
>> the patches and then work to get them committed to the code base.
>> 
>> Nutch has a number of unit tests for regression that ship with the product
>> that tell me that it's not broken, and users that are able to make it work
>> in their environments. There have been some recent bug fixes in the 1.1 RC
>> that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
>> that's natural.
>> 
> 
> No, not we. Me. I found a bug, told you about it and provided the fix.
> Before I did that, I told you that your release candidate was broken. Just
> like I'm still saying, unless I'm doing something grossly wrong, it's still
> broken.

Right, gotcha. I didn't map that you had been the guy that contributed the
patch. Thanks

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
Hi Matthew,

Thanks for your feedback. If you have any specific 
updates/improvements/actionable items based on your comments below, we'd love 
to have you contribute them back in the form of contributions to the community. 
Otherwise, we will take your feedback, put it into the queue of other items in 
the Nutch issue tracking system for those who are committers on the project to 
work on, as time permits.

Apache has a process for meritocracy [1] in terms of contributing to projects 
and being recognized for those contributions - we welcome feedback and 
actionable things in the forms of patches that improve the code, documentation, 
add new features, etc., while maintaining backwards compatibility with existing 
deployments and existing users.

Thanks and hope to see some issues/feedback/patches continue to come!

Cheers,
Chris

[1] http://www.apache.org/foundation/how-it-works.html#meritocracy

On 4/28/10 7:27 AM, "matthew a. grisius"  wrote:

I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that maybe Nutch might not be the 'right tool' for 'little
projects' based on documentation, discussion list feedback, etc. . . .

-m.

On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote:
> On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> >
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> > open for the next 72 hours.
> >
>
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.
>
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.
>
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)
>
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.
>
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.
>
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.
>
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.
>
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.
>
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?
>
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?
>
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.
>
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
&g

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
Hi Phil,

Thanks very much for the feedback. I¹d like to take a second to address your
points:

> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.

Unfortunately some parts of the documentation on Nutch (namely the tutorial,
and other parts of the static site) have been out of date for a while. This
has occurred really independent of the releases, and independent of the wiki
[1], which hasn't really fallen out of date as quick.

> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.

Can you elaborate on your find of broken code? Did you file a JIRA issue for
this in the Nutch JIRA system [2] ?

> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)

The more information you provide here about your environment and your
situation that caused the error, as well as e.g., detailed information (a
stack trace, an exception, something), the easier it is to track down what
you're seeing.

> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.

I wouldn't say developers have advanced beyond anything really for that
matter :) The number of active developers in Nutch these days is fairly
small, but interest and the user community is stable and there are some
pretty large scale deployments of Nutch to my knowledge. That said, those
folks have been following the mailing lists for a while, have been using the
software for a while and thus their level of entry into the documentation
may be at a little higher bar than that of a newer user such as yourself.

That said, one thing to realize is that this is open source software, so in
the end, as they say in Apache, "those that do, decide", or "patches
welcome!" In other words, if there are things that you see that could be
fixed, improved, made more configurable, etc., including the code, but *also
the documentation*, then by all means we'd appreciate your feedback and
contribution. Nutch is not simply a product of the developers that
contribute their (potentially and often unsalaried) time to work on it, but
of its user community as well.

> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.

I apologize that your questions have gone unanswered and that you're hitting
walls with regards to using Nutch. What questions did you ask? Perhaps it's
the detail that you are providing (or not providing), or perhaps it's the
way you're asking the questions. Or (even more likely) it's the fact that
this is an open source project and thus the committers get around to user
emails lists as one of the multiple items on their plate that they are
working on the project and us committers may have missed your question, or
perhaps those that had the time weren't particular experts in the one area
of Nutch that you were asking about. There could be a number of reasons.
Regardless, persistence is key as is *patience* and respectfulness. This has
always to my knowledge been a really friendly community, so if you hang
around and keep asking questions they will get answered I'm confident of
that.

> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.

In certain cases you are right, but I would take your above comments as
verbatim across the board. For example, if you believe there is
documentation lacking, then the first step is typically to file JIRA issues
to alert committers and other users of Nutch of your concern and then have
discussion on the lists regarding the issues. At some point a patch is
produced, and then attached to the issue, where the committers can review
the patches and then work to get them committed to the code base.

Nutch has a number of unit tests for regression that ship with the product
that tell me that it's not broken, and users that are able to make it work
in their environments. There have been some recent bug fixes in the 1.1 RC
that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
that's natural.  

> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and go

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hey Andrzej,

> Actually, we don't have a build target (yet) that produces a binary-only
> distribution that we can ship and which you can run out of the box (not
> counting the build/nutch.job alone, because it needs the Hadoop
> infrastructure to run).

I thought ant tar did this? That's what it sez on the release guide [1] and
what I'm familiar with when I did the Nutch 0.9 release.

> 
> The current mixed (source+binary) distribution worked well enough so
> far, but the size of the distribution is becoming a concern, hence the
> idea to ship only the source. We may have been too hasty with that,
> though... What do others think?

Good question, Andrzej. I'll wait for feedback from others. My pref is for
source-only, but I might be in the minority. :)

Cheers,
Chris

[1] http://wiki.apache.org/nutch/Release_HOWTO

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hi Grant,

Thanks. I think it actually makes sense to finish off 1.1, and since there is 
overlap with the Nutch PMC and the Lucene PMC and since the thread started in 
Lucene before the TLP, I think it would be great e.g., if Andrzej, and Sami 
could check the release and that way we still have the continuity and can 
safely push it out as the last Nutch rel under the Lucene umbrella...

Then all releases post 1.1 can cleanly be done under the auspices of the new 
PMC :)

Cheers,
Chris


On 4/26/10 5:34 AM, "Grant Ignersoll"  wrote:

Might I suggest, that since Nutch is now a TLP that you delay this release by a 
few weeks and have the vote done under the auspices of the Nutch PMC?

Cheers,
Grant

On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote:

> Hi Folks,
>
> I have posted an updated candidate for the Apache Nutch 1.1 release. The
> source code is at:
>
> http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/
>
> The major difference between this release and rc #1 is the application of
> NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
> as well as some commits by Sami Siren to fix missing ASL license headers.
>
> For more detailed information, see the included CHANGES.txt file for details
> on release contents and latest changes. The release was made using the Nutch
> release process, documented on the Wiki here:
>
> http://bit.ly/d5ugid
>
> A Nutch 1.1 tag is at:
>
> http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/
>
> 
> There was a request by Sami Siren that the tutorial be updated to reflect
> the fact that this release is a source-only release, as well as a request to
> integrate RAT into the build, however, in the interest of getting this 1.1
> out and getting going on the Nutch TLP, my proposal is:
>
> * update the docs independent of this release (the tutorial as it exists
> right now says 0.7 on it anyways and doesn't look like it's been updated in
> a while, so I think users can live with what's there and support on
> u...@nutch.apache.org or d...@nutch.apache.org until it's updated)
>
> * begin source only releases in general since we've long had the debate as
> to the size of the Nutch release. Most folks that use Nutch are likely
> familiar with running ant IMHO.
>
> * run RAT and integrate into the build
>
> 
>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>
> Since Nutch is now a TLP and has its own PMC, there is a question of who are
> the binding release VOTES in this particular thread. My gut reaction is that
> since I started this release while we were under the Lucene PMC, for
> continuity purposes, only votes from Lucene PMC are binding, but everyone
> (especially newly minted Nutch PMC members!) are  welcome to check the
> release candidate and voice their approval or disapproval. The vote passes
> if at least three binding +1 votes are cast.
>
> [ ] +1 Release the packages as Apache Nutch 1.1.
>
> [ ] -1 Do not release the packages because...
>
> Thanks!
>
> Cheers,
> Chris
>
> P.S. Here is my +1.
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>





++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hi David,

Thanks. In fact, running ant is probably simpler than running Nutch. The steps 
would be:


 *   what OS are you on (Ant is available for all of them to my knowledge)?
 *   if you need ant, grab a distro from ant.apache.org, otherwise, I'll assume 
that you've got ant installed and callable from the command line.
 *   unpack the nutch src distribution, cd into that directory, type "ant job", 
and there you go.

HTH! You could try it out by taking the Nutch src code from SVN at: 
http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1, and then trying the 
steps above.

Cheers,
Chris


On 4/26/10 7:24 AM, "David M. Cole"  wrote:

At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote:
>Most folks that use Nutch are likely
>familiar with running ant IMHO.

I guess then I fall into the category of "not most folks." Have been
running Nutch for about 14 months and I haven't a clue how to run ant.

If there's a place to vote to suggest that compiled versions still be
distributed, I vote for that.

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Coled...@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net>V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/>   F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-25 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted an updated candidate for the Apache Nutch 1.1 release. The
source code is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/

The major difference between this release and rc #1 is the application of
NUTCH-812 - Crawl.java incorrectly uses the Generator API resulting in NPE -
as well as some commits by Sami Siren to fix missing ASL license headers.

For more detailed information, see the included CHANGES.txt file for details
on release contents and latest changes. The release was made using the Nutch
release process, documented on the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/


There was a request by Sami Siren that the tutorial be updated to reflect
the fact that this release is a source-only release, as well as a request to
integrate RAT into the build, however, in the interest of getting this 1.1
out and getting going on the Nutch TLP, my proposal is:

* update the docs independent of this release (the tutorial as it exists
right now says 0.7 on it anyways and doesn't look like it's been updated in
a while, so I think users can live with what's there and support on
u...@nutch.apache.org or d...@nutch.apache.org until it's updated)

* begin source only releases in general since we've long had the debate as
to the size of the Nutch release. Most folks that use Nutch are likely
familiar with running ant IMHO.

* run RAT and integrate into the build



Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours.

Since Nutch is now a TLP and has its own PMC, there is a question of who are
the binding release VOTES in this particular thread. My gut reaction is that
since I started this release while we were under the Lucene PMC, for
continuity purposes, only votes from Lucene PMC are binding, but everyone
(especially newly minted Nutch PMC members!) are  welcome to check the
release candidate and voice their approval or disapproval. The vote passes
if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

P.S. Here is my +1.

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





Re: About Apache Nutch 1.1 Final Release

2010-04-17 Thread Mattmann, Chris A (388J)
Hey Andrzej,

You got it. I got bogged down yesterday but will apply this patch (was going to 
ask you about it) before I roll the RC.

Safe travels buddy!

Cheers,
Chris


On 4/16/10 11:55 PM, "Andrzej Bialecki"  wrote:

On 2010-04-17 05:45, Phil Barnett wrote:
> On Sat, 2010-04-10 at 18:22 +0200, Andrzej Bialecki wrote:
>
>> More details on this (your environment, OS, JDK version) and
>> logs/stacktraces would be highly appreciated! You mentioned that you
>> have some scripts - if you could extract relevant portions from them (or
>> copy the scripts) it would help us to ensure that it's not a simple
>> command-line error.
>
> I posted another thread tonight with the fixed code.

See here: https://issues.apache.org/jira/browse/NUTCH-812

>
> Can you please commit it for all of us?

I'm traveling today ... Chris, can you perhaps apply the patch before
you roll another RC?

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: About Apache Nutch 1.1 Final Release

2010-04-08 Thread Mattmann, Chris A (388J)
Hi there,

Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC member 
that's VOTE'd +1 on the release.

Hopefully in the next few days someone will have a chance to check...

Cheers,
Chris


On 4/8/10 8:54 PM, "yhdelgado"  wrote:



Hi. I have a question. When the Apache Nutch 1.1 Final Release, will be
released?. Grettings.
--
View this message in context: 
http://n3.nabble.com/About-Apache-Nutch-1-1-Final-Release-tp707586p707586.html
Sent from the Nutch - User mailing list archive at Nabble.com.



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-07 Thread Mattmann, Chris A (388J)
Hi,

This is a VOTE thread. Please do not post your user question on this thread as 
we are VOTE'ing on a particular release.

You can re-post a new thread with your question, and I would highly encourage 
it.

Thanks!

Cheers,
Chris



On 4/7/10 6:26 PM, "cefurkan0 cefurkan0"  wrote:

hi folks
do you know i can save parsed text while crawling event
how can i do this

ty

On 7 April 2010 20:11, tsmori  wrote:

>
> I'm not sure what exactly changed that made all my nullpointer errors go
> away, but I'm grateful for it, whatever it was.
>
> So, +1 from me, not that I'm even sure I get a vote in the matter, but if
> it's open to anyone on the list, I'm on board.
>
>
> --
> View this message in context:
> http://n3.nabble.com/VOTE-Apache-Nutch-1-1-Release-Candidate-1-tp702135p703534.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Oh, per usual, forgot to throw in my +1. So, +1!

Cheers,
Chris


On 4/7/10 1:14 AM, "Mattmann, Chris A (388J)"  
wrote:

Hi Folks,

I have posted a candidate for the Apache Nutch 1.1 release. The source code
is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/

See the included CHANGES.txt file for details on release contents and latest
changes. The release was made using the Nutch release process, documented on
the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours. Only votes from Lucene PMC are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Hi Folks,

I have posted a candidate for the Apache Nutch 1.1 release. The source code
is at:

http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/

See the included CHANGES.txt file for details on release contents and latest
changes. The release was made using the Nutch release process, documented on
the Wiki here:

http://bit.ly/d5ugid

A Nutch 1.1 tag is at:

http://svn.apache.org/repos/asf/lucene/nutch/tags/1.1/

Please vote on releasing these packages as Apache Nutch 1.1. The vote is
open for the next 72 hours. Only votes from Lucene PMC are binding, but
everyone is welcome to check the release candidate and voice their approval
or disapproval. The vote passes if at least three binding +1 votes are cast.

[ ] +1 Release the packages as Apache Nutch 1.1.

[ ] -1 Do not release the packages because...

Thanks!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Mattmann, Chris A (388J)
Hi Andrzej,

+1 from me.

Cheers,
Chris



On 4/1/10 10:23 AM, "Andrzej Bialecki"  wrote:

Hi all,

According to an earlier [DISCUSS] thread on the nutch-dev list I'm
calling for a vote on the proposal to make Nutch a top-level project.

To quickly recap the reasons and consequences of such move: the ASF
board is concerned about the size and diversity of goals across various
subprojects under the Lucene TLP, and suggests that each subproject
should evaluate whether becoming its own TLP would better serve the
project itself and the Lucene TLP.

We discussed this issue and expressed opinions that ranged from positive
(easier management, better exposure, better focus on the mission, not
really dependent on Lucene development) to neutral (no significant
reason, only political change) to moderately negative (increased admin
work, decreased exposure).

Therefore, the proposal is to separate Nutch from under Lucene TLP and
form a top-level project with its own PMC, own svn and own site.

Please indicate one of the following:

[ ] +1 - yes, I vote for the proposal
[ ] -1 - no, I vote against the proposal (because ...)

(Please note that anyone in the Nutch community is invited to express
their opinion, though only Nutch committers cast binding votes.)

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: need your support

2010-01-20 Thread Mattmann, Chris A (388J)
Hi Sahar,

Can you post your:


 1.  crawl-urlfilter
 2.  nutch-site.xml

Also how are you running this program below?

I'm CC'ing nutch-user@ so the community can benefit from this thread.

Cheers,
Chris



On 1/20/10 1:42 PM, "sahar elkazaz"  wrote:


Dear/ sirur

I have follow all steps on your article to run nutch

and use this java program to access the segments:

 package nutch;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.nutch.searcher.Hit;
import org.apache.nutch.searcher.HitDetails;
import org.apache.nutch.searcher.Hits;
import org.apache.nutch.searcher.NutchBean;
import org.apache.nutch.searcher.Query;
import org.apache.nutch.searcher.Summary;
import org.apache.nutch.util.NutchConfiguration;
public class nutch   {
  /** For debugging. */
  public static void main(String[] args) throws Exception {
 Configuration conf = NutchConfiguration.create();
   conf = NutchConfiguration.create();
  NutchBean bean = new NutchBean(conf);
Query query = Query.parse("animal" +
"", conf);
Hits hits = bean.search(query, 10);
System.out.println("Total hits: " + hits.getTotal());
int length = (int)Math.min(hits.getTotal(), 10);
Hit[] show = hits.getHits(0, length);
HitDetails[] details = bean.getDetails(show);
 Summary[] summaries = bean.getSummary(details, query);
 for ( int i = 0; i (FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
 &nb sp;  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:89)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation 
.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at 
org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:50)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:7 7)
at nutch.nutch.main(nutch.java:25)
10/01/20 22:29:28 INFO searcher.SearchBean: opening indexes in crawl/indexes
10/01/20 22:29:28 WARN fs.FileSystem: uri=file:///
javax.security.auth.login.LoginException: Login failed:
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at 
org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at 
org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at 
org.apache.hadoop.fs.FileSystem$Cache$Key.(FileSystem.java:1438)
 ;at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.IndexSearcher.(IndexSearcher.java:59)
at 
org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:77)
at 
org.apache.nutch.searcher.LuceneSearchBean.(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.(NutchBean.java:77)
at nutch.nutch.main(nutch.java :25)
10/01/20 22:29:28 INFO plugin.PluginRepository: Plugins: looking in: 
D:\nutch-1.0\plugins
10/01/20 22:29:28 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
[true]
10/01/20 22:29:28 INFO plugin.PluginRepository: Registered Plugins:
10/01/20 22:29:28 INFO plugin.PluginRepository: the nutch core 
extension points (nutch-extensionpoints)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Query Filter 
(query-basic)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic URL Normalizer 
(urlnormalizer-basic)
10/01/20 22:29:28 INFO plugin.PluginRepository: Html Parse Plug-in 
(parse-html)
10/01/20 22:29:28 INFO plugin.PluginRepository: Basic Indexing Filter 
(index-basic)
10/01/20 22:29:28 INFO plugin.Plugi nRepository: Site Query Filter 
(

Re: Nutch & Lucene Installation Instructions

2010-01-06 Thread Mattmann, Chris A (388J)
Hi Ken,

My guess is that your URL filter isn't accepting the URLs that are being 
fetched, so no content is being indexed. You should check your 
$NUTCH_HOME/conf/crawl-urlfilter.txt file and make sure the defaults are 
changed to match your expectations of the sites you are going to crawl.

One thing also: you should also consider asking your question on 
nutch-user@lucene.apache.org (who I'm CC'ing my reply to) so that others can 
benefit from your question(s).

Best of luck!

Cheers,
Chris


On 1/4/10 5:32 PM, "Ken Ly"  wrote:

http://www-scf.usc.edu/~csci572/2007Spring/nutch_lucene_installation.html

Hello Chris,

I read your instruction to install Nutch 1.0.  I believe I got everything 
correct, but when I did a search on "apache" I couldn't get any result to show 
up.

ant and ant war were successful.

bin/nutch crawl urls/seed -dir crawl -depth 3 -topN 50
rm -rf /opt/tomcat/webapps/ROOT*
cp nutch*.war /opt/tomcat/webapps/ROOT.war
/opt/tomcat/bin/catalina.sh start

I went to the site to do a search for "apache", but I didn't get any result for 
"apache"

Do you have any clue where I messed up?

- Ken

--
# /opt/nutch-1.0/urls/seed
http://lucene.apache.org/nutch/

# /opt/nutch-1.0/build/nutch.xml



# /opt/nutch-1.0/conf/crawl-urlfilter.txt
+^http://([a-z0-9]*\.)*apache.org

# echo $JAVA_HOME
/usr/java/jdk1.6.0_17


# echo $NUTCH_JAVA_HOME
/usr/java/jdk1.6.0_17



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[ANNOUNCE] New Nutch Committer: Julien Nioche

2009-12-24 Thread Mattmann, Chris A (388J)
All,

A little while ago I nominated Julien Nioche to be Nutch committer based on
his contributions to the Nutch project (10+ patches in this release alone,
and all the mailing list help and thoughtful design discussion). I'm happy
to announce that the Lucene PMC has voted to make Julien a Nutch committer!

Julien, welcome to the team. The typical first committer task is to modify
the Nutch Forrest credits page and add yourself to the website. If you'd
like to say something about yourself and your background, feel free to do so
as well.

Welcome!

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.mattm...@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




Re: Web search engine Nutch

2009-10-30 Thread Mattmann, Chris A (388J)
Hi Hari,

Please check out the Nutch website, and 0.8 tutorial here:

http://lucene.apache.org/nutch/tutorial8.html

Much of it is still applicable in terms of the configuration you¹re looking
for. Also, please ask your questions to nutch-user@lucene.apache.org, so the
rest of the community can benefit from the answers. I¹m CC¹ing the list on
my reply.

Cheers,
Chris


On 10/30/09 3:39 AM, "hari"  wrote:

> Hi Chris,
>  
> i got the Nutch-1.0 war file.
>  
> Please help me out Configuration part.
>  
> hadoop-site.xml -- what i need write here tags.
> nutch-site.xml
>  
> Please give me some sample files.
>  
> i am doing only web site search engine.
>  
> regards,
> Hari
>  
>>  
>>  
>>  
>> - Original Message -
>>  
>> From:  Chris  Mattmann   ; mattm...@apache.org
>>  
>> To: hari   ; mattm...@apache.org
>>  
>> Sent: Friday, October 30, 2009 11:17  AM
>>  
>> Subject: Re: Web search engine Nutch
>>  
>> 
>> Hi Hari,
>> 
>> You need some type of GNU tar  executable < what platform are you on? On
>> Windows, Winzip should work, on Mac,  you can use tar, same as for *nix
>> systems. The command is: tar xvzf  nutch-1.0.tar.gz.
>> 
>> HTH,
>> Chris
>> 
>> P.S. Please feel free to send  your reply to nutch-user@lucene.apache.org <
>> that mailing list is monitored by many users.
>> 
>> On 10/29/09 10:34 PM,  "hari"   wrote:
>> 
>>  
>>> Hi  Chris,
>>> 
>>> i can  see nutch-1.0.tar.gz
>>> file.
>>> 
>>> but i  dont know how to UnZip this file(tar.gz)
>>> 
>>> Regards,
>>> Hari
>>>  
 
 - Original Message -
  
 From:  Chris  Mattmann    ; mattm...@apache.org
  
 To: hari    ; mattm...@apache.org
  
 Sent: Friday, October 30, 2009 10:55   AM
  
 Subject: Re: Web search engine  Nutch
  
 
 Hi Hari,
 
 The best place to get Nutch is   from:
 
 http://www.apache.org/dyn/closer.cgi/lucene/nutch/
 
 HTH,
 Chris
 
 
 On   10/29/09 10:15 PM, "hari"wrote:
 
  
  
> Hariprasath
 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> 
> 
>