Got the following dump at 100% of generate cycle
(.8 svn release)
060405 080019 parsing
file:/home/mozdex/trunk/conf/nutch-site.xml
060405 080019 parsing
file:/home/mozdex/trunk/conf/hadoop-site.xml
Exception in thread main java.lang.RuntimeException:
class
hehe, just pulled it down and trying again :)
thanks!
--- J�r�me Charron [EMAIL PROTECTED]
wrote:
Andrzej fixed it 2 hours ago.
http://svn.apache.org/viewcvs.cgi?rev=391577view=rev
Thanks
J�r�me
On 4/5/06, Byron Miller [EMAIL PROTECTED]
wrote:
Got the following dump
I like to think of it as a framework. Building blocks
to build what you ultimately need.
If your after the one stop shop, plug in play, no
development necessary then perhaps some other
commercial systems may be your best bet.
Mailing list is very active, most people get responses
fairly quickly.
Make sure you have language-identifier enabled in your
web deployment as well.
WEB-INF/classes/nutch-site.xml or nutch-default.xml
and restart your app server.
-byron
--- Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
Hello,
I enabled language-identifier plugin and indexed
some documents.
But
You have to add query-more as one of your plugins. If
you don't rebuild your war file then you have to add
query-more to the nutch-site or nutch-conf under
WEB/classes and restart.
You will need to re-index as well so it can index
these values.
--- Teruhiko Kurosaka [EMAIL PROTECTED] wrote:
I
I've used Magpie rss library in PHP with great success
to do fast parsing of the OpenSearch XML data.
How long are your opensearch queries taking without
going through PHP to return results?
-byron
--- Insurance Squared Inc.
[EMAIL PROTECTED] wrote:
We've built a php frontend onto nutch.
The impact of drive speeds isn't that large for
queries as long as the server is only handling
queries. If you processing data at the same time then
SCSI or SATAII with tag queuing would be best.
As far as raid 0, that helps better on more smaller
drives rather than few larger drives. You will
I use OSCache with great success.
I would an amazing amount (more then i assumed) of
queries we get are duplicate of one fashion or another
so on top of warming things up as much as possible to
the OS buffer cache we use OSCache as well.
You could also use Squid to cache pages for x amount
of
With all of the discussions of
killing/restarting/pooling nutch bean has anyone
noticed that you push your luck in doing so?
I often get GC failed to collect, out of memory errors
and such when trying to do anything but a clean
shutdown.
I'm moving to 64bit jvm and java 1.5 so i'll let you
know
You have access to all of the cached data, so possibly
with mapreduce version and hacking away at the grep
demo you could pull together data to do what google
did.
--- Meryl Silverburgh [EMAIL PROTECTED]
wrote:
Hi,
Is it possible to use utch to collect web statistic
like the one
google did
Is it possible to pass
type:pdf or lang:en as defaults not in the query string?
change it,
I'd add fields to the search form and make the
default be
checked/selected.
Jake.
-Original Message-
From: Byron Miller [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 31, 2006 2:11 PM
To: nutch-user@lucene.apache.org
Subject: passing type: or lang: as hidden field
Seems very slow.
What is your platform/OS?
I crawl 1 million pages in about an hour in most
cases. I have one client i have with a huge whitelist
so i'll give that a whirl and get some more numbers.
When you do a crawl is it based upon injected urls or
a large depth? are you running into max
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf
Anyone have any further details on this?
that is what they're going for.
Thanks again for the quick follow up.
--- Doug Cutting [EMAIL PROTECTED] wrote:
Byron Miller wrote:
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf
Anyone have any further details on this?
The first author of the paper is also
Actually the process would be to generate your new
segments, move the segments to your newer/faster
server, fetch those segments and then copy those
segments to your webdb and run updatedb there.
you could also index your segments on the faster
server. The only process that needs webdb is the
Just to add my 2 cents, for the most part if you have
a decent nic card you could issue OS commands to drop
the port rate of your interface to 10mbit and not
waste cpu cycles on shaping/proxying.
Although i do recommend squid for this since i too use
it to further filter/offload regex/hostname
I want to build fetchlists directly from url
submission and url only crawls, is that safe?
(instead of injecting into webdb first and then
running generate to create the fetchlist)
Create Fetch
Fetch Content
Update WebDB
Index
I noticed the same thing that the outlinks are fetched
during subsequent runs even though you have URLfilters
in place.
-byron
--- carmmello [EMAIL PROTECTED] wrote:
When someone uses the crawl method with, lets say
100 hundred sites, you
establish your url filters to allow only those
Pretty much any modern app server.
--- Mike Markzon [EMAIL PROTECTED] wrote:
Can I use Nutch with another web server like Sun ONE
or does it only work with Tomcat?
Thanks,
Mike
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best
src/java/org/apache/nutch/searcher/Summary.java
It has a b to mark the term. I typically edit this
and use a CSS style sheet to make this easier to
customize for my clients.
-byron
--- Andy Morris [EMAIL PROTECTED] wrote:
How do I change the background color on the results
page for the
Anyone done any work on outputing JSON formated
results?
http://www.crockford.com/JSON/index.html
Lighter weight than XML and pretty popular because of
its portability and easy of use as JavaScript object.
Made pretty popular through del.ici.us and Yahoo (and
now google)
Yes and no. It's my personal belief that when you
read they run the search engine on 40 servers that is
40 query servers. I doubt they have a 2billion page
index/db/analysis/processing and spidering on just 40
servers but prove me wrong :)
-byron
--- Dan Segel [EMAIL PROTECTED] wrote:
If i'm not mistaken doesn't the opensearch servlet get
around this issue? You could then post process the xml
through a stylesheet/css or your favorite scripting
language.
-byron
--- Raghavendra Prabhu [EMAIL PROTECTED] wrote:
Right now
Whenever an user comes and searches ,a NutchBean is
I know you can enable language detect during
index-more however is there a method to doing this
during the crawl?
I'm interested in building an index as english only
right now. what is the theory behind that? anyone have
any experience?
would it be building a huge black list, ignoring tlds
until
it's more feasable for me to focus on
english based sites as a whole since the cultural
differences and laws are enough to shy me away from
getting into the legal messes other nations can
potentially enforce or imply :)
-byron
--- Byron Miller [EMAIL PROTECTED] wrote:
I know you can enable language
I would recommend that you search the list for some
great discussions on NDFS. Doug has a nice writeup of
his vision of using a map reduce job to push the
indexes to your query servers so they're updates as
the webdb is and managed that way.
NDFS just wasn't designed for the I/O of a query. You
Check the list for my earlier discussions. There are
tweaks you can do to enhance the performance if you
have available memory resources.
How large are your segments that you are indexing?
what file system do you use? what OS /JVM are you
building your index on?
-byron
--- R.Mayoran [EMAIL
phpadsnew is ok.. not easy to integrate with a keyword
based system such as search.
I've used Inclick before with moderate success.. was
under heavy development at the time however the
developers seem to have a strong base to work from.
With my experience it's not affordable to really do
your
Which plugins do you have enabled? Have you optimized
any of your nutch-site settings yet?
-byron
--- Goldschmidt, Dave [EMAIL PROTECTED]
wrote:
Hello,
I'm currently indexing ~50 segments, each ~2GB in
size, for a total of
only ~7,000,000 pages. From the log output, I see
an index
I'm noticing searches returning results that have
every tld for the same site listed.
For example .org, .com and .net of the same site.
is there anyway to do duplicate detection based upon
X% of duplicate content and either flag/descore or
delete based upon that?
I have it loaded on mozdex.com and it works fairly
well.
Only thing i noticed is it seems to look for longer
versions of a matching phrase vs immediate common
mistakes.
For example diat pill (which is a very common query)
comes up as diatribe pill instead of diet pill :)
BUT as my index grows
Yes, i would love to see it committed so it is
maintained through the branches..
--- Jérôme Charron [EMAIL PROTECTED] wrote:
I have it loaded on mozdex.com http://mozdex.com
and it works fairly
well.
Thanks for your feedback Byron.
So it is a good candidate for a commit... I note it.
Bill
Glad to see you working! Nutch is fantastic!
--- Bill Goffe [EMAIL PROTECTED] wrote:
Thanks greatly -- it is nice to have Nutch working
as I hoped it would. The bad segment was indeed
slowing down queries by a factor of 5 or maybe more.
There is a big smile on my face for having
Since your running Debian, can you confirm your
java_home points to 1.4.2 and not Kaffe for both Nutch
Tomcat?
If you have corruption, you may want to start over.
My laptop runs quicker queries on 300k pages than this
server yields results.
Was your crawl/fetch performing terribly as well or
I run on Centos 4.2 (RHEL clone) with JDK 5 and Resin
free edition.
works like a charm.
--- AJ Chen [EMAIL PROTECTED] wrote:
Has anyone successfully run Nutch on Fedora 3 or 4
linux ? Is Redhat Linux
better or no difference for Nutch application? I'm
getting a AMD Opteron
server and want
I'm looking to see if i can pull a meta description in
lieu of summary for some content and wondering if this
is indexed - is there an easy way to see the fields
indexed by default and how they're exposed through
nutch bean?
Zaheed
On 10/31/05, Byron Miller [EMAIL PROTECTED]
wrote:
I got this to work this evening.. was a problem
with
patch on the system i was working on..
feel free to check it out on slashdot.org.. you
can
try an example of searching for slashdt and it
should recommend the good site
Can fetchnewonly work with -topN so you fetch new urls
only working from the top down or do they not work together?
Any idea if this will be implemented in the
mapread/.07 branches?
Does it have to get voted in?
Anyone using this patch?
http://issues.apache.org/jira/browse/NUTCH-48
I would like to incorporate this, but not having much
luck getting the patch to install over svn release
(branch .7)
-byron
I got this to work this evening.. was a problem with
patch on the system i was working on..
feel free to check it out on slashdot.org.. you can
try an example of searching for slashdt and it
should recommend the good site :)
-byron
--- Byron Miller [EMAIL PROTECTED] wrote:
Anyone using
For what its worth i fetch my segments of 1 million
urls with 80 threads at a time and no slow downs.
I'll grab some of my stats and publish them, but i
haven't had problems with fetcher slowing down like
this in a long time.
(linux/Centos 4.2 platform)
-byron
--- Andrzej Bialecki [EMAIL
051028 083415 DONE indexing segment 20051019000305:
total 10 records in 520.156 s (192.3077 rec/s).
051028 083415 done indexing
Been doing some testing and i've pretty much peaked
out at 192-200 rec/s on a 2.8ghz machine with lang
ident enabled on 512bytes data @ 3ngrams which after
tweaking
/description
/property
Initially high index merge factor caused out of file
handle errors but increasing the others along with it
seemed to help get around that.
-byron
--- Doug Cutting [EMAIL PROTECTED] wrote:
Byron Miller wrote:
For example i've been tweaking max merge/min merge
and
such and i've
My testing is on 100k documents, but most of the time
i work with 1 million so i don't have a gazillion
segments across my servers.
i'll try and adjust that number down and see what
happens.
-byron
--- Doug Cutting [EMAIL PROTECTED] wrote:
Byron Miller wrote:
property
If you use site:mydomain.com instead of
site:www.mydomain.com, shouldn't the query search
home.mydomain.com, news.mydomain.com or any prefixed
url of that domain?
Thanks for the headsup on this information! I'll be
sure to let you know how my luck goes in tweaking out
these parameters.
-byron
--- Jérôme Charron [EMAIL PROTECTED] wrote:
Is there any tips/pointers to beefing this up?
Anyone
else have any index benchmarks with/without this
enabled
Before with nutch .7 svn defaults
051027 135317 DONE indexing segment 20051019145225-2:
total 100155 records in 2108.737 s (47.51186 rec/s).
051027 135317 done indexing
after
051027 142316 DONE indexing segment 20051019145225-3:
total 103838 records in 1413.624 s (73.48762 rec/s).
051027 142316
When generating an index from a segment, is there a
measure of peak performance?
For example i've been tweaking max merge/min merge and
such and i've been able to double my performance
without increasing anything but cpu load..
Is there a point that tweaking these will cause a
heavier IO load or
50 matches
Mail list logo