date:20100428

Re: why does nutch interpret directory as URL

2010-04-28 Thread xiao yang

Because it's a URL indeed.
You can either filter this kind of URL by configuring
crawl-urlfilter.txt (-^.*/$ may helps, but I'm not sure about the
regular expression) or filter the search result (you need to develop a
nutch plugin).
Thanks!

Xiao

On Thu, Apr 29, 2010 at 4:33 AM, BK  wrote:
> While indexing files on local file system, why does NUTCH interpret the
> directory as a URL - fetching file:/C:/temp/html/
> This causes the index page of this directory to show up on search results.
> Any solutions for this issue??
>
>
> Bharteesh Kulkarni
>

why does nutch interpret directory as URL

2010-04-28 Thread BK

While indexing files on local file system, why does NUTCH interpret the
directory as a URL - fetching file:/C:/temp/html/
This causes the index page of this directory to show up on search results.
Any solutions for this issue??


Bharteesh Kulkarni

Fwd: Call for Participation: Technical Talks -- ApacheCon North America 2010

2010-04-28 Thread Grant Ingersoll



Begin forwarded message:

> From: Sally Khudairi 
> Date: April 28, 2010 1:48:57 PM EDT
> To: annou...@apachecon.com
> Subject: Call for Participation: Technical Talks -- ApacheCon North America 
> 2010
> Reply-To: s...@apache.org
> 
> ApacheCon North America 2010
> 1-5 November 2010 -- Westin Peachtree in Atlanta
> 
> Technical Tracks: Call For Participation
> All submissions must be received by Friday, 28 May 2010 at midnight Pacific 
> Time.
> The official conference, trainings, and expo of The Apache Software 
> Foundation (ASF) returns to Atlanta this November, with dozens of technical, 
> business, and community-focused sessions at the beginner, intermediate, and 
> advanced levels.
> 
> Over the past decade, the ASF has gone from strength to strength, developing 
> and shepherding nearly 150 Top-Level Projects and new initiatives in the 
> Apache Incubator and Labs. This year's ApacheCon celebrates how Apache 
> technologies have sparked creativity, challenged processes, streamlined 
> development, improved collaboration, launched businesses, bolstered 
> economies, and improved lives.
> 
> We are proud of our achievements and recognize that the global Apache 
> community --both developers and users-- are responsible for the success and 
> popularity of our products.
> 
> The ApacheCon Planning Team are soliciting 50-minute technical presentations 
> for the next conference, which will focus on the theme “Servers, the Cloud, 
> and Innovation”.
> 
> We are particularly interested in highly-relevant, professionally-directed 
> presentations that demonstrate specific probrlems and real-world solutions. 
> Part of the technical program has already been planned; we welcome proposals 
> based on the following Apache Projects and related technical areas:
> 
> - Cassandra/NoSQL
> - Content Technologies
> - (Java) Enterprise Development
> - Felix/OSGi
> - Geronimo
> - Hadoop + friends/Cloud Computing
> - Lucene, Mahout + friends/Search
> - Tomcat
> - Tuscany
> Submissions are open to anyone with relevant expertise: ASF affiliation is 
> not required to present at, attend, or otherwise participate in ApacheCon.
> 
> Please keep in mind that whilst we encourage submissions that the highlight 
> the use of specific Apache solutions, we are unable to accept 
> marketing/commercially-oriented presentations.
> 
> Other proposals, such as panels, or those longer than 50 minutes in duration 
> have been considered in the past. You are welcome to submit an alternate 
> presentation, however, such sessions are accepted under exceptional 
> circumstances. Please be as descriptive as possible, including names/bios of 
> proposed panelists and any related details.
> 
> All accepted speakers (not co-presenters) qualify for general conference 
> admission and a minimum of two nights lodging at the conference hotel. 
> Additional hotel nights and travel assistance are possible, depending on the 
> number of presentations given and type of assistance needed.
> 
> To submit a presentation proposal, please send an email to submissions AT 
> apachecon DOT com containing the following information in plaintext (no 
> attachments, please):
> 
> 1. Your full name, title, and organization
> 
> 2. Contact information, including your address
> 
> 3. The name of your proposed session (keep your title simple and relevant to 
> the topic)
> 
> 4. The technical category of the intended presentation (Cassandra/NoSQL; 
> Content Technologies; (Java) Enterprise Development; Felix/OSGi; Geronimo; 
> Hadoop + friends/Cloud Computing; Lucene, Mahout + friends/Search; Tomcat; or 
> Tuscany)
> 
> 5. The classification for each presentation (Servers, Cloud, or Innovation) – 
> some presentations may have more than one theme (e.g., a next-generation 
> server can be classified both as "Servers" and "Innovation"
> 
> 6. The intended audience level (beginner, intermediate, advanced)
> 
> 7. A 75-200 word overview of your presentation
> 
> 8. A 100-200-word speaker bio that includes prior conference speaking or 
> related experience
> 
> 9. Feedback or references (with contact information) on presentations given 
> within the last three years
> 
> To be considered, proposals must be received by Friday, 28 May 2010 at 
> midnight Pacific Time. Please email any questions regarding proposal 
> submissions to cfp AT apachecon DOT com.
> 
> Technical Tracks Key Dates
> 
> 23 April 2010: Call For Participation Open
> 28 May 2010: Call For Participation Closes
> 11 June 2010: Speaker Acceptance/Rejection Notification
> 1-5 November 2010: ApacheCon NA 2010
> We look forward to seeing you in Atlanta!
> 
> For the ApacheCon Planning team,
> Sally Khudairi, Program Lead
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: announce-unsubscr...@apachecon.com
> For additional commands, e-mail: announce-h...@apachecon.com
>

skip index directory in search results

2010-04-28 Thread BK

Hello all,

I have indexed few directories which contain html files and the *index to
each directory* is showing up as one of the search results. Is there any way
to skip this directory from search results. e.g. *Index of
C:\temp\html*, *Index
of C:\temp\html\dir2 *are showing up in the results which displays the list
of all files under a specific directory (end users won't need this info).
Thanks!

Re: Call for Participation: Technical Talks -- ApacheCon North America 2010

2010-04-28 Thread Grant Ingersoll


On Apr 28, 2010, at 1:53 PM, Grant Ingersoll wrote:

> 
> 
> Begin forwarded message:
> 
>> From: Sally Khudairi 
>> Date: April 28, 2010 1:48:57 PM EDT
>> To: annou...@apachecon.com
>> Subject: Call for Participation: Technical Talks -- ApacheCon North America 
>> 2010
>> Reply-To: s...@apache.org
>> 
>> ApacheCon North America 2010
>> 1-5 November 2010 -- Westin Peachtree in Atlanta
>> 
>> Technical Tracks: Call For Participation
>> All submissions must be received by Friday, 28 May 2010 at midnight Pacific 
>> Time.
>> The official conference, trainings, and expo of The Apache Software 
>> Foundation (ASF) returns to Atlanta this November, with dozens of technical, 
>> business, and community-focused sessions at the beginner, intermediate, and 
>> advanced levels.
>> 
>> Over the past decade, the ASF has gone from strength to strength, developing 
>> and shepherding nearly 150 Top-Level Projects and new initiatives in the 
>> Apache Incubator and Labs. This year's ApacheCon celebrates how Apache 
>> technologies have sparked creativity, challenged processes, streamlined 
>> development, improved collaboration, launched businesses, bolstered 
>> economies, and improved lives.
>> 
>> We are proud of our achievements and recognize that the global Apache 
>> community --both developers and users-- are responsible for the success and 
>> popularity of our products.
>> 
>> The ApacheCon Planning Team are soliciting 50-minute technical presentations 
>> for the next conference, which will focus on the theme “Servers, the Cloud, 
>> and Innovation”.
>> 
>> We are particularly interested in highly-relevant, professionally-directed 
>> presentations that demonstrate specific probrlems and real-world solutions. 
>> Part of the technical program has already been planned; we welcome proposals 
>> based on the following Apache Projects and related technical areas:
>> 
>> - Cassandra/NoSQL
>> - Content Technologies
>> - (Java) Enterprise Development
>> - Felix/OSGi
>> - Geronimo
>> - Hadoop + friends/Cloud Computing
>> - Lucene, Mahout + friends/Search
>> - Tomcat
>> - Tuscany
>> Submissions are open to anyone with relevant expertise: ASF affiliation is 
>> not required to present at, attend, or otherwise participate in ApacheCon.
>> 
>> Please keep in mind that whilst we encourage submissions that the highlight 
>> the use of specific Apache solutions, we are unable to accept 
>> marketing/commercially-oriented presentations.
>> 
>> Other proposals, such as panels, or those longer than 50 minutes in duration 
>> have been considered in the past. You are welcome to submit an alternate 
>> presentation, however, such sessions are accepted under exceptional 
>> circumstances. Please be as descriptive as possible, including names/bios of 
>> proposed panelists and any related details.
>> 
>> All accepted speakers (not co-presenters) qualify for general conference 
>> admission and a minimum of two nights lodging at the conference hotel. 
>> Additional hotel nights and travel assistance are possible, depending on the 
>> number of presentations given and type of assistance needed.
>> 
>> To submit a presentation proposal, please send an email to submissions AT 
>> apachecon DOT com containing the following information in plaintext (no 
>> attachments, please):
>> 
>> 1. Your full name, title, and organization
>> 
>> 2. Contact information, including your address
>> 
>> 3. The name of your proposed session (keep your title simple and relevant to 
>> the topic)
>> 
>> 4. The technical category of the intended presentation (Cassandra/NoSQL; 
>> Content Technologies; (Java) Enterprise Development; Felix/OSGi; Geronimo; 
>> Hadoop + friends/Cloud Computing; Lucene, Mahout + friends/Search; Tomcat; 
>> or Tuscany)
>> 
>> 5. The classification for each presentation (Servers, Cloud, or Innovation) 
>> – some presentations may have more than one theme (e.g., a next-generation 
>> server can be classified both as "Servers" and "Innovation"
>> 
>> 6. The intended audience level (beginner, intermediate, advanced)
>> 
>> 7. A 75-200 word overview of your presentation
>> 
>> 8. A 100-200-word speaker bio that includes prior conference speaking or 
>> related experience
>> 
>> 9. Feedback or references (with contact information) on presentations given 
>> within the last three years
>> 
>> To be considered, proposals must be received by Friday, 28 May 2010 at 
>> midnight Pacific Time. Please email any questions regarding proposal 
>> submissions to cfp AT apachecon DOT com.
>> 
>> Technical Tracks Key Dates
>> 
>> 23 April 2010: Call For Participation Open
>> 28 May 2010: Call For Participation Closes
>> 11 June 2010: Speaker Acceptance/Rejection Notification
>> 1-5 November 2010: ApacheCon NA 2010
>> We look forward to seeing you in Atlanta!
>> 
>> For the ApacheCon Planning team,
>> Sally Khudairi, Program Lead
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: announc

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)

Hi Matthew,

Thanks for your feedback. If you have any specific 
updates/improvements/actionable items based on your comments below, we'd love 
to have you contribute them back in the form of contributions to the community. 
Otherwise, we will take your feedback, put it into the queue of other items in 
the Nutch issue tracking system for those who are committers on the project to 
work on, as time permits.

Apache has a process for meritocracy [1] in terms of contributing to projects 
and being recognized for those contributions - we welcome feedback and 
actionable things in the forms of patches that improve the code, documentation, 
add new features, etc., while maintaining backwards compatibility with existing 
deployments and existing users.

Thanks and hope to see some issues/feedback/patches continue to come!

Cheers,
Chris

[1] http://www.apache.org/foundation/how-it-works.html#meritocracy

On 4/28/10 7:27 AM, "matthew a. grisius"  wrote:

I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that maybe Nutch might not be the 'right tool' for 'little
projects' based on documentation, discussion list feedback, etc. . . .

-m.

On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote:
> On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
>
> >
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> > open for the next 72 hours.
> >
>
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.
>
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.
>
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)
>
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.
>
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.
>
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.
>
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.
>
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.
>
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?
>
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?
>
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.
>
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.
>
> Phil Barnett
> Senior Programmer / Analyst
> Walt Disney World, Inc.




++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Mattmann, Chris A (388J)

Hi Phil,

Thanks very much for the feedback. I¹d like to take a second to address your
points:

> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.

Unfortunately some parts of the documentation on Nutch (namely the tutorial,
and other parts of the static site) have been out of date for a while. This
has occurred really independent of the releases, and independent of the wiki
[1], which hasn't really fallen out of date as quick.

> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.

Can you elaborate on your find of broken code? Did you file a JIRA issue for
this in the Nutch JIRA system [2] ?

> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)

The more information you provide here about your environment and your
situation that caused the error, as well as e.g., detailed information (a
stack trace, an exception, something), the easier it is to track down what
you're seeing.

> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.

I wouldn't say developers have advanced beyond anything really for that
matter :) The number of active developers in Nutch these days is fairly
small, but interest and the user community is stable and there are some
pretty large scale deployments of Nutch to my knowledge. That said, those
folks have been following the mailing lists for a while, have been using the
software for a while and thus their level of entry into the documentation
may be at a little higher bar than that of a newer user such as yourself.

That said, one thing to realize is that this is open source software, so in
the end, as they say in Apache, "those that do, decide", or "patches
welcome!" In other words, if there are things that you see that could be
fixed, improved, made more configurable, etc., including the code, but *also
the documentation*, then by all means we'd appreciate your feedback and
contribution. Nutch is not simply a product of the developers that
contribute their (potentially and often unsalaried) time to work on it, but
of its user community as well.

> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.

I apologize that your questions have gone unanswered and that you're hitting
walls with regards to using Nutch. What questions did you ask? Perhaps it's
the detail that you are providing (or not providing), or perhaps it's the
way you're asking the questions. Or (even more likely) it's the fact that
this is an open source project and thus the committers get around to user
emails lists as one of the multiple items on their plate that they are
working on the project and us committers may have missed your question, or
perhaps those that had the time weren't particular experts in the one area
of Nutch that you were asking about. There could be a number of reasons.
Regardless, persistence is key as is *patience* and respectfulness. This has
always to my knowledge been a really friendly community, so if you hang
around and keep asking questions they will get answered I'm confident of
that.

> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.

In certain cases you are right, but I would take your above comments as
verbatim across the board. For example, if you believe there is
documentation lacking, then the first step is typically to file JIRA issues
to alert committers and other users of Nutch of your concern and then have
discussion on the lists regarding the issues. At some point a patch is
produced, and then attached to the issue, where the committers can review
the patches and then work to get them committed to the code base.

Nutch has a number of unit tests for regression that ship with the product
that tell me that it's not broken, and users that are able to make it work
in their environments. There have been some recent bug fixes in the 1.1 RC
that we caught which have been fixed (NUTCH-812, NUTCH-814, etc.), but
that's natural.  

> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and go

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread matthew a. grisius

I also share many of Phil's sentiments. I really want the project
(bin/nutch crawl) to work for me as well and I want to help somehow. I
would like to share a 5gb 'intranet' web site with ~50 people. And I
have not graduated to making the 'deepcrawl' script work yet either, as
I'm thinking that maybe Nutch might not be the 'right tool' for 'little
projects' based on documentation, discussion list feedback, etc. . . .

-m.

On Wed, 2010-04-28 at 06:59 -0400, Phil Barnett wrote:
> On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> >
> > Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> > open for the next 72 hours.
> >
> 
> How do you test to see if Nutch works like the documentation says it works?
> I still find major differences between how existing documentation tells me,
> a newcomer to the project, how to get it running.
> 
> For example, my find of broken code in bin/nutch crawl, a most basic way of
> getting it running.
> 
> And I have yet to get the deepcrawl script which seems to be the suggestion
> of how to get beyond bin/nutch crawl. It doesn't return any data at all and
> has an error in the middle of it's run regarding missing file which the last
> stage apparently failed to write. (I believe because the scheduler excluded
> everything)
> 
> I wonder if the developers have advanced so far past these basic scripts as
> to have pretty much left them behind. This leads to these basics that people
> start with not working.
> 
> I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
> I'm getting nowhere at all. It's pretty frustrating to spend that much time
> trying to figure out how it works and keep hitting walls. And then asking
> basic questions here that go unanswered.
> 
> The view from the outside is not so good from my direction. If you don't
> keep documentation up to date and you change the way things work, the
> project as seen from the outside, is plainly broken.
> 
> I'd be happy to give you feedback on where I find these problems and I'll
> even donate whatever fixes I can come up with, but Java is not a language
> I'm familiar with and going is slow weeding through things. I really need
> this project to work for me. I want to help.
> 
> 1. Where is the scheduler documented? If I want to crawl everything from
> scratch, where is the information from the last run stored? It seems like
> the schedule is telling my crawl to ignore pages due to scheduler knocking
> them out. It's not obvious to my why this is happening and how to stop it
> from happening. I think right now this is my major roadblock in getting
> bin/nutch crawl working. Maybe the scheduler code no longer works properly
> in bin/nutch crawl. I can't tell if it's that or if the default
> configurations don't work.
> 
> 2, Where are the control files in conf documented? How do I know which ones
> do what and when? There's a half dozen *-urlfilters. Why?
> 
> 3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
> it does, why didn't it find the error that stopped it from running?
> 
> 4. Where is the documentation on how to configure the new tika parser in
> your environment? I see that the old parsers have been removed by default,
> but there's nothing that shows me how to include/exclude document types.
> 
> I believe your assessment of 'ready' is not inclusive of some very important
> things and that you would be doing a service to newcomers to bring
> documentation in line with current offerings. This is not trivial code and
> it takes a long time for someone from the outside to understand it. That
> process is being stifled on multiple fronts as far as I can see. Either that
> or I have missed an important document that exists and I haven't read it.
> 
> Phil Barnett
> Senior Programmer / Analyst
> Walt Disney World, Inc.

Re: nutch crawl issue

2010-04-28 Thread matthew a. grisius

My subject should've been clearer, e.g. it should've read Nutch 1.1
nightly build crawl issue.

Also, I did verify that Nutch 1.0 successfully completes crawling the
javadoc html file and can be verified with luke-1.0.1 and searched using
command line bin/nutch org.apache.nutch.searcher.NutchBean java

On Wed, 2010-04-28 at 00:39 -0400, matthew a. grisius wrote:
> using Nutch nightly build nutch-2010-04-27_04-00-28:
> 
> I am trying to bin/nutch crawl a single html file generated by javadoc
> and no links are followed. I verified this with bin/nutch readdb and
> bin/nutch readlinkdb, and also with luke-1.0.1. Only the single base
> seed doc specified is processed.
> 
> I searched and reviewed the nutch-user archive and tried several
> different settings but none of the settings appear to have any effect.
> 
> I then downloaded maven-2.2.1 so that I could mvn install tika and
> produce tika-app-0.7.jar to command line extract information about the
> html javadoc file. I am not familiar w/ tika but the command line
> version doesn't return any metadata, e.g. no 'src=' links from the html
> 'frame' tags. Perhaps I'm using it incorrectly, and I am not sure how
> nutch uses tika and maybe it's not related . . .
> 
> Has anyone crawled javadoc files or have any suggestions? Thanks.
> 
> -m.
>

Problem with Standard analyzer

2010-04-28 Thread Srinivas Gokavarapu

Hi,

I have faced a problem which tokenizing text using standard analyzer. When I
am trying to tokenize the string "internet,art,3d,avatar,portraits"
using StandardAnalyzer the tokens I got are
internet
art,3d,avatar
portraits

I expected it to be 5 different words. Is this a bug in the analyzer ?? Has
anyone faced this kind of a problem in working with standard analyzer.

-- 
G.R.J.Srinivas
OBH 62
IIIT Hyderabad
9492756712

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

2010-04-28 Thread Phil Barnett

On Mon, Apr 26, 2010 at 1:55 AM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

>
> Please vote on releasing these packages as Apache Nutch 1.1. The vote is
> open for the next 72 hours.
>

How do you test to see if Nutch works like the documentation says it works?
I still find major differences between how existing documentation tells me,
a newcomer to the project, how to get it running.

For example, my find of broken code in bin/nutch crawl, a most basic way of
getting it running.

And I have yet to get the deepcrawl script which seems to be the suggestion
of how to get beyond bin/nutch crawl. It doesn't return any data at all and
has an error in the middle of it's run regarding missing file which the last
stage apparently failed to write. (I believe because the scheduler excluded
everything)

I wonder if the developers have advanced so far past these basic scripts as
to have pretty much left them behind. This leads to these basics that people
start with not working.

I've spend dozens of hours trying to get 1.1 to work anything like 1.0 and
I'm getting nowhere at all. It's pretty frustrating to spend that much time
trying to figure out how it works and keep hitting walls. And then asking
basic questions here that go unanswered.

The view from the outside is not so good from my direction. If you don't
keep documentation up to date and you change the way things work, the
project as seen from the outside, is plainly broken.

I'd be happy to give you feedback on where I find these problems and I'll
even donate whatever fixes I can come up with, but Java is not a language
I'm familiar with and going is slow weeding through things. I really need
this project to work for me. I want to help.

1. Where is the scheduler documented? If I want to crawl everything from
scratch, where is the information from the last run stored? It seems like
the schedule is telling my crawl to ignore pages due to scheduler knocking
them out. It's not obvious to my why this is happening and how to stop it
from happening. I think right now this is my major roadblock in getting
bin/nutch crawl working. Maybe the scheduler code no longer works properly
in bin/nutch crawl. I can't tell if it's that or if the default
configurations don't work.

2, Where are the control files in conf documented? How do I know which ones
do what and when? There's a half dozen *-urlfilters. Why?

3. Why doesn't your post nightly compile tests include bin/nutch crawl or if
it does, why didn't it find the error that stopped it from running?

4. Where is the documentation on how to configure the new tika parser in
your environment? I see that the old parsers have been removed by default,
but there's nothing that shows me how to include/exclude document types.

I believe your assessment of 'ready' is not inclusive of some very important
things and that you would be doing a service to newcomers to bring
documentation in line with current offerings. This is not trivial code and
it takes a long time for someone from the outside to understand it. That
process is being stifled on multiple fronts as far as I can see. Either that
or I have missed an important document that exists and I haven't read it.

Phil Barnett
Senior Programmer / Analyst
Walt Disney World, Inc.

Re: why does nutch interpret directory as URL

why does nutch interpret directory as URL

Fwd: Call for Participation: Technical Talks -- ApacheCon North America 2010

skip index directory in search results

Re: Call for Participation: Technical Talks -- ApacheCon North America 2010

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

Re: nutch crawl issue

Problem with Standard analyzer

Re: [VOTE] Apache Nutch 1.1 Release Candidate #2

11 matches

Site Navigation

Mail list logo

Footer information