Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius mgris...@comcast.netwrote: I also share many of Phil's sentiments. I really want the project (bin/nutch crawl) to work for me as well and I want to help somehow. I would like to share a 5gb 'intranet' web site with ~50 people. And I have

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
Oh yeah, I built a presentation and gave it to our local Linux User Group meeting. You might find it useful: http://leap-cf.org/presentations/nutch/NutchWebCrawler.odp On Sat, May 1, 2010 at 2:10 AM, Phil Barnett ph...@philb.us wrote: On Wed, Apr 28, 2010 at 10:27 AM, matthew a. grisius

Re: nutch crawl issue

2010-05-01 Thread Phil Barnett
This sounds exactly like what I have been experiencing. On Wed, Apr 28, 2010 at 12:39 AM, matthew a. grisius mgris...@comcast.netwrote: using Nutch nightly build nutch-2010-04-27_04-00-28: I am trying to bin/nutch crawl a single html file generated by javadoc and no links are followed. I

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Phil, Thanks for your comments. Mine below: Unfortunately some parts of the documentation on Nutch (namely the tutorial, and other parts of the static site) have been out of date for a while. This has occurred really independent of the releases, and independent of the wiki [1], which

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-05-01 Thread Phil Barnett
On Sat, May 1, 2010 at 2:34 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: Sure, hopefully you'll find the answer you're looking for. In the meanwhile, it's my job to keep cutting release candidates as the RM, that at least pass the basic criteria for release and right

Re: why does nutch interpret directory as URL

2010-05-01 Thread b k
You are right. I had to add a custom plugin - InvalidUrlIndexFilter which filters out all the invalid urls while indexing the pages/files. Check out this blog: http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html Just follow the process of creating/adding a new custom plugin

getting malformed URL exception

2010-05-01 Thread arpit khurdiya
I am trying to index files local intranet using nutch 1.0, hence, i m giving path as file:hostname/shared/ as seed. Now when i use AdaptiveScheduler and crawl the intranet for the first time, it works fine but when i recrawl, it gives me malformedURL exception. But when i use the Default

Re: Searching multiple directories

2010-05-01 Thread b k
I just resolved this issue - quick and easy way though! 1. Created searchmenu.jsp with drop down selection to search from several directories passing the hidden value to search.jsp 2. In search.jsp, for default value, I am searching the entire /html directory, I just left the code as

Re: getting malformed URL exception

2010-05-01 Thread b k
may be you can try with file:/hostname// or file:///hostname Looks like you have 4 slashes...just a guess.. On Sat, May 1, 2010 at 2:36 PM, arpit khurdiya arpitkhurd...@gmail.comwrote: I am trying to index files local intranet using nutch 1.0, hence, i m giving path as

Re: skip index directory in search results

2010-05-01 Thread b k
RESOLVED--- I had to add a custom plugin - InvalidUrlIndexFilter which filters out all the invalid urls while indexing the pages/files. Check out this blog: http://sujitpal.blogspot.com/2009/07/nutch-getting-my-feet-wet.html Just follow the process of creating/adding a new custom plugin

Re: nutch crawl issue

2010-05-01 Thread Mattmann, Chris A (388J)
Hi Matthew, Hi Matthew, There is an open issue with Tika (e.g. https://issues.apache.org/jira/browse/TIKA-379) that could explain the differences betwen parse-html and parse-tika. Note that you can specify : *parse-(html|pdf) *in order to get both HTML and PDF files. The reason that I