Re: Web search engine Nutch

2009-10-30 Thread Mattmann, Chris A (388J)
Hi Hari, Please check out the Nutch website, and 0.8 tutorial here: http://lucene.apache.org/nutch/tutorial8.html Much of it is still applicable in terms of the configuration you¹re looking for. Also, please ask your questions to nutch-user@lucene.apache.org, so the rest of the community can ben

[ANNOUNCE] New Nutch Committer: Julien Nioche

2009-12-24 Thread Mattmann, Chris A (388J)
All, A little while ago I nominated Julien Nioche to be Nutch committer based on his contributions to the Nutch project (10+ patches in this release alone, and all the mailing list help and thoughtful design discussion). I'm happy to announce that the Lucene PMC has voted to make Julien a Nutch co

Re: Nutch & Lucene Installation Instructions

2010-01-06 Thread Mattmann, Chris A (388J)
Hi Ken, My guess is that your URL filter isn't accepting the URLs that are being fetched, so no content is being indexed. You should check your $NUTCH_HOME/conf/crawl-urlfilter.txt file and make sure the defaults are changed to match your expectations of the sites you are going to crawl. One t

Re: need your support

2010-01-20 Thread Mattmann, Chris A (388J)
Hi Sahar, Can you post your: 1. crawl-urlfilter 2. nutch-site.xml Also how are you running this program below? I'm CC'ing nutch-user@ so the community can benefit from this thread. Cheers, Chris On 1/20/10 1:42 PM, "sahar elkazaz" wrote: Dear/ sirur I have follow all steps on your

Re: [VOTE] Nutch to become a top-level project (TLP)

2010-04-01 Thread Mattmann, Chris A (388J)
Hi Andrzej, +1 from me. Cheers, Chris On 4/1/10 10:23 AM, "Andrzej Bialecki" wrote: Hi all, According to an earlier [DISCUSS] thread on the nutch-dev list I'm calling for a vote on the proposal to make Nutch a top-level project. To quickly recap the reasons and consequences of such move: t

[VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See the included CHANGES.txt file for details on release contents and latest changes. The release was made using the Nutch release process, docume

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-06 Thread Mattmann, Chris A (388J)
Oh, per usual, forgot to throw in my +1. So, +1! Cheers, Chris On 4/7/10 1:14 AM, "Mattmann, Chris A (388J)" wrote: Hi Folks, I have posted a candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc1/ See th

Re: [VOTE] Apache Nutch 1.1 Release Candidate #1

2010-04-07 Thread Mattmann, Chris A (388J)
Hi, This is a VOTE thread. Please do not post your user question on this thread as we are VOTE'ing on a particular release. You can re-post a new thread with your question, and I would highly encourage it. Thanks! Cheers, Chris On 4/7/10 6:26 PM, "cefurkan0 cefurkan0" wrote: hi folks do

Re: About Apache Nutch 1.1 Final Release

2010-04-08 Thread Mattmann, Chris A (388J)
Hi there, Well as soon as we have 3 +1 binding VOTEs. Right now I'm the only PMC member that's VOTE'd +1 on the release. Hopefully in the next few days someone will have a chance to check... Cheers, Chris On 4/8/10 8:54 PM, "yhdelgado" wrote: Hi. I have a question. When the Apache Nutch 1

Re: About Apache Nutch 1.1 Final Release

2010-04-17 Thread Mattmann, Chris A (388J)
Hey Andrzej, You got it. I got bogged down yesterday but will apply this patch (was going to ask you about it) before I roll the RC. Safe travels buddy! Cheers, Chris On 4/16/10 11:55 PM, "Andrzej Bialecki" wrote: On 2010-04-17 05:45, Phil Barnett wrote: > On Sat, 2010-04-10 at 18:22 +0200,

[VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-25 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc2/ The major difference between this release and rc #1 is the application of NUTCH-812 - Crawl.java incorrectly uses the Generator API resul

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
bove. Cheers, Chris On 4/26/10 7:24 AM, "David M. Cole" wrote: At 10:55 PM -0700 4/25/10, Mattmann, Chris A (388J) wrote: >Most folks that use Nutch are likely >familiar with running ant IMHO. I guess then I fall into the category of "not most folks." Have been run

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
P that you delay this release by a few weeks and have the vote done under the auspices of the Nutch PMC? Cheers, Grant On Apr 26, 2010, at 1:55 AM, Mattmann, Chris A (388J) wrote: > Hi Folks, > > I have posted an updated candidate for the Apache Nutch 1.1 release. The > sourc

Re: Running ANT; was -- Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-26 Thread Mattmann, Chris A (388J)
Hey Andrzej, > Actually, we don't have a build target (yet) that produces a binary-only > distribution that we can ship and which you can run out of the box (not > counting the build/nutch.job alone, because it needs the Hadoop > infrastructure to run). I thought ant tar did this? That's what it

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks very much for the feedback. I¹d like to take a second to address your points: > > How do you test to see if Nutch works like the documentation says it works? > I still find major differences between how existing documentation tells me, > a newcomer to the project, how to get it r

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)
I have not graduated to making the 'deepcrawl' script work yet either, as I'm thinking that maybe Nutch might not be the 'right tool' for 'little projects' based on documentation, discussion list feedback, etc. . . . -m. On Wed, 2010-04-28 at 06:59 -0400, Phil Ba

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-30 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks for your comments. Mine below: >> Unfortunately some parts of the documentation on Nutch (namely the >> tutorial, >> and other parts of the static site) have been out of date for a while. This >> has occurred really independent of the releases, and independent of the >> wiki >> [1

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew, >> Hi Matthew, >> >> There is an open issue with Tika (e.g. >> https://issues.apache.org/jira/browse/TIKA-379) that could explain the >> differences betwen parse-html and parse-tika. Note that you can specify : >> *parse-(html|pdf) *in order to get both HTML and PDF files. > > The re

Re: nutch crawl issue

2010-05-03 Thread Mattmann, Chris A (388J)
oticed that Arpit also > mentioned the same thing. Sorry I missed it, thanks to both of you! > > -m. > > On Sat, 2010-05-01 at 21:06 -0700, Mattmann, Chris A (388J) wrote: >> Hi Matthew, >> >>>> Hi Matthew, >>>> >>>> There is an open

Re: nutch crawl issue

2010-05-04 Thread Mattmann, Chris A (388J)
han > exceptional. I cannot believe that I'm the only one who will experience > these issues with common HTML such as FRAMESET/FRAME/javascript. Thanks > for asking. > > -m. > > > > On Mon, 2010-05-03 at 09:24 -0700, Mattmann, Chris A (388J) wrote: >> Hi Ma

[VOTE] Apache Nutch 1.1 Release Candidate #3

2010-05-08 Thread Mattmann, Chris A (388J)
Hi Folks, I have posted an updated candidate for the Apache Nutch 1.1 release. The source code is at: http://people.apache.org/~mattmann/apache-nutch-1.1/rc3/ The major differences between this release and rc #2 are the application of: NUTCH-816, NUTCH-732, NUTCH-815, NUTCH-814, and NUTCH-812 ba