don't expect polish.
You shouldn't need polish to be able to leran the command required to
resume an aborted drawl, or to index what you have already crawled.
Things like this shouldn't require an easter egg hunt. They are going
to heppen to evryone doing greater than a simple crawl.
If you
Hi Richard,
I told you I was more than willing to help, and I think many users
feel
the same way, but I for one feel that there is a lack of documentation
and support. This isn't meant to offend anyone, if you are
offended you
need to toughen up your skin a little bit.
Here you can find
I agree that the doc could be better, but I still take issue with
the earlier use of the phrase proof-of-concept. If there are
dozens of sites using it in production, several of them indexing
100's of millions of pages, I don't know how you can call it
proof-of-concept.
Honestly, I'm not sure if
I do thank nutch developers very, very much for what they have put into
the project:) I think the concept is great and yes it does work, if you
invest the time needed to learn the interfaces, updgrade the
distribution nightly, relearn the commands, etc. Doug's statement that
nutch is for early
Try using depth=n when you do the crawl. Post crawl I don't know, but I
have the same question. How do you make the index go deeper when you do
your next roudn of fetching is still something I haven't figured out.
-Original Message-
From: Peter Swoboda [mailto:[EMAIL PROTECTED]
Sent:
The crawl command creates a crawlDB for each call. So as Rchard
mentioned try a higher depth.
In case you like nutch to go deeper with each iteration, try the
whole web tutorial but change the url filter in a manner that it only
crawls your webpage.
This will go as deep as much iteration
Maybe we should move the tutorial to the wiki so it can be commented on.
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice)
http://www.taxcodesoftware.org http://www.taxcodesoftware.org/
Free Open Source Tax Software
Maybe we should move the tutorial to the wiki so it can be commented on.
+1
The nutch dev team isn't focused on PDF parsing. Nutch is a search
engine framework,
IMHO, if you don't parse something correctly, you cannnot rely on the
results.
We have all parsed things where you leave a comma out and the parse
results are wrong. If there was a bug in nutches html parsing
I am sorry if you don't like my opinion or the way it is expressed.
Hi Richard,
most of your opinion I think is the same as mine. I use nutch now since
spring 2004 for our page http://www.umkreisfinder.de
It was a big effort to learn how nutch is working and also a big effort
to learn how
I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back. And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.
Also, if you use nutch you should
Maybe we should organize us ourself a little bit better in this point.
What do you think?
Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very
compressed style what the community is looking for.
However it is not used that often
Hi Richard,
IMHO, if you don't parse something correctly, you cannnot rely on the
results.
Good, we're on the same page here.
We have all parsed things where you leave a comma out and the parse
results are wrong. If there was a bug in nutches html parsing would
that be a big deal?
Yes,
Stefan,
I think I know what you're saying. When you are new to nutch and you
read the tutorial, It kind of leads you to believe (incorrectly) that
whole web crawling is different from intranet crawling and that the
steps are somehow different and independent of one another. In fact it
looks
whoops i hit send by accident :(
any idea why
http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.gov
http://24.75.221.234:8080/search.jsp?query=e-file+site%3Awww.irs.govhi
tsPerPage=10hitsPerSite=0clustering
hitsPerPage=10hitsPerSite=0clustering=
returns a list of hits where the
-
This email was sent using SquirrelMail.
Webmail for nuts!
http://squirrelmail.org/
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.
16 matches
Mail list logo