Re: Web search engine Nutch

2009-10-30 Thread Mattmann, Chris A (388J)
Hi Hari, Please check out the Nutch website, and 0.8 tutorial here: http://lucene.apache.org/nutch/tutorial8.html Much of it is still applicable in terms of the configuration you¹re looking for. Also, please ask your questions to nutch-user@lucene.apache.org, so the rest of the community can

[ANNOUNCE] New Nutch Committer: Julien Nioche

2009-12-24 Thread Mattmann, Chris A (388J)
All, A little while ago I nominated Julien Nioche to be Nutch committer based on his contributions to the Nutch project (10+ patches in this release alone, and all the mailing list help and thoughtful design discussion). I'm happy to announce that the Lucene PMC has voted to make Julien a Nutch

Re: Nutch Lucene Installation Instructions

2010-01-06 Thread Mattmann, Chris A (388J)
Hi Ken, My guess is that your URL filter isn't accepting the URLs that are being fetched, so no content is being indexed. You should check your $NUTCH_HOME/conf/crawl-urlfilter.txt file and make sure the defaults are changed to match your expectations of the sites you are going to crawl. One

Re: need your support

2010-01-20 Thread Mattmann, Chris A (388J)
Hi Sahar, Can you post your: 1. crawl-urlfilter 2. nutch-site.xml Also how are you running this program below? I'm CC'ing nutch-user@ so the community can benefit from this thread. Cheers, Chris On 1/20/10 1:42 PM, sahar elkazaz saharelka...@hotmail.com wrote: Dear/ sirur I have

Re: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Mattmann, Chris A (388J)
Hi Andrzej, +1 from me. Cheers, Chris On 4/1/10 10:23 AM, Andrzej Bialecki a...@getopt.org wrote: Hi all, According to an earlier [DISCUSS] thread on the nutch-dev list I'm calling for a vote on the proposal to make Nutch a top-level project. To quickly recap the reasons and consequences

[VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process,

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Oh, per usual, forgot to throw in my +1. So, +1! Cheers, Chris On 4/7/10 1:14 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-07 Thread Mattmann, Chris A (388J)
Hi, This is a VOTE thread. Please do not post your user question on this thread as we are VOTE'ing on a particular release. You can re-post a new thread with your question, and I would highly encourage it. Thanks! Cheers, Chris On 4/7/10 6:26 PM, cefurkan0 cefurkan0 cefurk...@gmail.com

Re: About Apache Nutch 1.1 Final Release

2010-04-08 Thread Mattmann, Chris A (388J)
Hi there, Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC member that's VOTE'd +1 on the release. Hopefully in the next few days someone will have a chance to check... Cheers, Chris On 4/8/10 8:54 PM, yhdelgado yhdelg...@estudiantes.uci.cu wrote: Hi. I have a

Re: About Apache Nutch 1.1 Final Release

2010-04-17 Thread Mattmann, Chris A (388J)
Hey Andrzej, You got it. I got bogged down yesterday but will apply this patch (was going to ask you about it) before I roll the RC. Safe travels buddy! Cheers, Chris On 4/16/10 11:55 PM, Andrzej Bialecki a...@getopt.org wrote: On 2010-04-17 05:45, Phil Barnett wrote: On Sat, 2010-04-10 at

[VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-25 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
On 4/26/10 7:24 AM, David M. Cole d...@colegroup.com wrote: At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote: Most folks that use Nutch are likely familiar with running ant IMHO. I guess then I fall into the category of not most folks. Have been running Nutch for about 14 months and I

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
a TLP that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC? Cheers, Grant On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote: Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hey Andrzej, Actually, we don't have a build target (yet) that produces a binary-only distribution that we can ship and which you can run out of the box (not counting the build/nutch.job alone, because it needs the Hadoop infrastructure to run). I thought ant tar did this? That's what it sez

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks very much for the feedback. I¹d like to take a second to address your points: How do you test to see if Nutch works like the documentation says it works? I still find major differences between how existing documentation tells me, a newcomer to the project, how to get it

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Please vote on releasing these packages as Apache Nutch 1.1. The vote is open for the next 72 hours. How do you test to see if Nutch works like the documentation says it works? I still find major differences between how

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks for your comments. Mine below: Unfortunately some parts of the documentation on Nutch (namely the tutorial, and other parts of the static site) have been out of date for a while. This has occurred really independent of the releases, and independent of the wiki [1], which

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
on email and noticed that Arpit also mentioned the same thing. Sorry I missed it, thanks to both of you! -m. On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA

[VOTE] Apache Nutch 1.1 Release Candidate #3

2010-05-08 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc3/ The major differences between this release and rc #2 are the application of: NUTCH-816, NUTCH-732, NUTCH-815, NUTCH-814, and NUTCH-812