generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Byron Miller
Got the following dump at 100% of generate cycle (.8 svn release) 060405 080019 parsing file:/home/mozdex/trunk/conf/nutch-site.xml 060405 080019 parsing file:/home/mozdex/trunk/conf/hadoop-site.xml Exception in thread main java.lang.RuntimeException: class

Re: generate failes - class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper

2006-04-05 Thread Byron Miller
hehe, just pulled it down and trying again :) thanks! --- J�r�me Charron [EMAIL PROTECTED] wrote: Andrzej fixed it 2 hours ago. http://svn.apache.org/viewcvs.cgi?rev=391577view=rev Thanks J�r�me On 4/5/06, Byron Miller [EMAIL PROTECTED] wrote: Got the following dump

Re: project vitality?

2006-03-05 Thread Byron Miller
I like to think of it as a framework. Building blocks to build what you ultimately need. If your after the one stop shop, plug in play, no development necessary then perhaps some other commercial systems may be your best bet. Mailing list is very active, most people get responses fairly quickly.

Re: language-identifier and language filter

2006-03-05 Thread Byron Miller
Make sure you have language-identifier enabled in your web deployment as well. WEB-INF/classes/nutch-site.xml or nutch-default.xml and restart your app server. -byron --- Teruhiko Kurosaka [EMAIL PROTECTED] wrote: Hello, I enabled language-identifier plugin and indexed some documents. But

Re: query-more and date range

2006-03-02 Thread Byron Miller
You have to add query-more as one of your plugins. If you don't rebuild your war file then you have to add query-more to the nutch-site or nutch-conf under WEB/classes and restart. You will need to re-index as well so it can index these values. --- Teruhiko Kurosaka [EMAIL PROTECTED] wrote: I

Re: speed concerns, calling nutch from php

2006-03-01 Thread Byron Miller
I've used Magpie rss library in PHP with great success to do fast parsing of the OpenSearch XML data. How long are your opensearch queries taking without going through PHP to return results? -byron --- Insurance Squared Inc. [EMAIL PROTECTED] wrote: We've built a php frontend onto nutch.

Re: Off-topic:scsi vs sata/speed

2006-02-09 Thread Byron Miller
The impact of drive speeds isn't that large for queries as long as the server is only handling queries. If you processing data at the same time then SCSI or SATAII with tag queuing would be best. As far as raid 0, that helps better on more smaller drives rather than few larger drives. You will

Re: Speeding up initial searches using cache

2006-02-07 Thread Byron Miller
I use OSCache with great success. I would an amazing amount (more then i assumed) of queries we get are duplicate of one fashion or another so on top of warming things up as much as possible to the OS buffer cache we use OSCache as well. You could also use Squid to cache pages for x amount of

Re: Updating the search index

2006-02-03 Thread Byron Miller
With all of the discussions of killing/restarting/pooling nutch bean has anyone noticed that you push your luck in doing so? I often get GC failed to collect, out of memory errors and such when trying to do anything but a clean shutdown. I'm moving to 64bit jvm and java 1.5 so i'll let you know

Re: Use Nutch to collect web statistic

2006-01-31 Thread Byron Miller
You have access to all of the cached data, so possibly with mapreduce version and hacking away at the grep demo you could pull together data to do what google did. --- Meryl Silverburgh [EMAIL PROTECTED] wrote: Hi, Is it possible to use utch to collect web statistic like the one google did

passing type: or lang: as hidden field (not in query)

2006-01-31 Thread Byron Miller
Is it possible to pass type:pdf or lang:en as defaults not in the query string?

RE: passing type: or lang: as hidden field (not in query)

2006-01-31 Thread Byron Miller
change it, I'd add fields to the search form and make the default be checked/selected. Jake. -Original Message- From: Byron Miller [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 31, 2006 2:11 PM To: nutch-user@lucene.apache.org Subject: passing type: or lang: as hidden field

Re: crawl/update speed

2006-01-23 Thread Byron Miller
Seems very slow. What is your platform/OS? I crawl 1 million pages in about an hour in most cases. I have one client i have with a huge whitelist so i'll give that a whirl and get some more numbers. When you do a crawl is it based upon injected urls or a large depth? are you running into max

interesting paper with competing index systems

2006-01-19 Thread Byron Miller
http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf Anyone have any further details on this?

Re: interesting paper with competing index systems

2006-01-19 Thread Byron Miller
that is what they're going for. Thanks again for the quick follow up. --- Doug Cutting [EMAIL PROTECTED] wrote: Byron Miller wrote: http://www.cs.yorku.ca/~mladen/pdf/Read6_u.pisa-attardi.tera.pdf Anyone have any further details on this? The first author of the paper is also

Re: Nutch system running on multiple servers | fetcher

2006-01-17 Thread Byron Miller
Actually the process would be to generate your new segments, move the segments to your newer/faster server, fetch those segments and then copy those segments to your webdb and run updatedb there. you could also index your segments on the faster server. The only process that needs webdb is the

Re: throttling bandwidth

2006-01-17 Thread Byron Miller
Just to add my 2 cents, for the most part if you have a decent nic card you could issue OS commands to drop the port rate of your interface to 10mbit and not waste cpu cycles on shaping/proxying. Although i do recommend squid for this since i too use it to further filter/offload regex/hostname

is it safe to inject into fetchlist directly?

2006-01-16 Thread Byron Miller
I want to build fetchlists directly from url submission and url only crawls, is that safe? (instead of injecting into webdb first and then running generate to create the fetchlist) Create Fetch Fetch Content Update WebDB Index

Re: URL filters and outlinks

2006-01-13 Thread Byron Miller
I noticed the same thing that the outlinks are fetched during subsequent runs even though you have URLfilters in place. -byron --- carmmello [EMAIL PROTECTED] wrote: When someone uses the crawl method with, lets say 100 hundred sites, you establish your url filters to allow only those

Re: Only Tomcat?

2006-01-12 Thread Byron Miller
Pretty much any modern app server. --- Mike Markzon [EMAIL PROTECTED] wrote: Can I use Nutch with another web server like Sun ONE or does it only work with Tomcat? Thanks, Mike __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best

Re: Background color searched word

2006-01-11 Thread Byron Miller
src/java/org/apache/nutch/searcher/Summary.java It has a b to mark the term. I typically edit this and use a CSS style sheet to make this easier to customize for my clients. -byron --- Andy Morris [EMAIL PROTECTED] wrote: How do I change the background color on the results page for the

JSON output

2006-01-07 Thread Byron Miller
Anyone done any work on outputing JSON formated results? http://www.crockford.com/JSON/index.html Lighter weight than XML and pretty popular because of its portability and easy of use as JavaScript object. Made pretty popular through del.ici.us and Yahoo (and now google)

Re: please disregard last post....

2006-01-05 Thread Byron Miller
Yes and no. It's my personal belief that when you read they run the search engine on 40 servers that is 40 query servers. I doubt they have a 2billion page index/db/analysis/processing and spidering on just 40 servers but prove me wrong :) -byron --- Dan Segel [EMAIL PROTECTED] wrote:

Re: pooling for nutch bean

2006-01-05 Thread Byron Miller
If i'm not mistaken doesn't the opensearch servlet get around this issue? You could then post process the xml through a stylesheet/css or your favorite scripting language. -byron --- Raghavendra Prabhu [EMAIL PROTECTED] wrote: Right now Whenever an user comes and searches ,a NutchBean is

Limiting search/crawl to specific language

2006-01-03 Thread Byron Miller
I know you can enable language detect during index-more however is there a method to doing this during the crawl? I'm interested in building an index as english only right now. what is the theory behind that? anyone have any experience? would it be building a huge black list, ignoring tlds until

Re: Limiting search/crawl to specific language

2006-01-03 Thread Byron Miller
it's more feasable for me to focus on english based sites as a whole since the cultural differences and laws are enough to shy me away from getting into the legal messes other nations can potentially enforce or imply :) -byron --- Byron Miller [EMAIL PROTECTED] wrote: I know you can enable language

Re: Setting Search over NDFS

2005-12-28 Thread Byron Miller
I would recommend that you search the list for some great discussions on NDFS. Doug has a nice writeup of his vision of using a map reduce job to push the indexes to your query servers so they're updates as the webdb is and managed that way. NDFS just wasn't designed for the I/O of a query. You

Re: Clustering Index job

2005-12-28 Thread Byron Miller
Check the list for my earlier discussions. There are tweaks you can do to enhance the performance if you have available memory resources. How large are your segments that you are indexing? what file system do you use? what OS /JVM are you building your index on? -byron --- R.Mayoran [EMAIL

Re: ad feed for nutch

2005-12-07 Thread Byron Miller
phpadsnew is ok.. not easy to integrate with a keyword based system such as search. I've used Inclick before with moderate success.. was under heavy development at the time however the developers seem to have a strong base to work from. With my experience it's not affordable to really do your

Re: Speed of indexing

2005-12-05 Thread Byron Miller
Which plugins do you have enabled? Have you optimized any of your nutch-site settings yet? -byron --- Goldschmidt, Dave [EMAIL PROTECTED] wrote: Hello, I'm currently indexing ~50 segments, each ~2GB in size, for a total of only ~7,000,000 pages. From the log output, I see an index

How to detect/manage duplicates across multiple tld's

2005-11-21 Thread Byron Miller
I'm noticing searches returning results that have every tld for the same site listed. For example .org, .com and .net of the same site. is there anyway to do duplicate detection based upon X% of duplicate content and either flag/descore or delete based upon that?

Re: Spelling

2005-11-18 Thread Byron Miller
I have it loaded on mozdex.com and it works fairly well. Only thing i noticed is it seems to look for longer versions of a matching phrase vs immediate common mistakes. For example diat pill (which is a very common query) comes up as diatribe pill instead of diet pill :) BUT as my index grows

Re: Spelling

2005-11-18 Thread Byron Miller
Yes, i would love to see it committed so it is maintained through the branches.. --- Jérôme Charron [EMAIL PROTECTED] wrote: I have it loaded on mozdex.com http://mozdex.com and it works fairly well. Thanks for your feedback Byron. So it is a good candidate for a commit... I note it.

Re: Slow Searches -- What to Tune?

2005-11-16 Thread Byron Miller
Bill Glad to see you working! Nutch is fantastic! --- Bill Goffe [EMAIL PROTECTED] wrote: Thanks greatly -- it is nice to have Nutch working as I hoped it would. The bad segment was indeed slowing down queries by a factor of 5 or maybe more. There is a big smile on my face for having

Re: Slow Searches -- What to Tune?

2005-11-14 Thread Byron Miller
Since your running Debian, can you confirm your java_home points to 1.4.2 and not Kaffe for both Nutch Tomcat? If you have corruption, you may want to start over. My laptop runs quicker queries on 300k pages than this server yields results. Was your crawl/fetch performing terribly as well or

Re: linux OS question

2005-11-04 Thread Byron Miller
I run on Centos 4.2 (RHEL clone) with JDK 5 and Resin free edition. works like a charm. --- AJ Chen [EMAIL PROTECTED] wrote: Has anyone successfully run Nutch on Fedora 3 or 4 linux ? Is Redhat Linux better or no difference for Nutch application? I'm getting a AMD Opteron server and want

Which fields can you call via detail.getvalue(....) out of the box?

2005-11-01 Thread Byron Miller
I'm looking to see if i can pull a meta description in lieu of summary for some content and wondering if this is indexed - is there an easy way to see the fields indexed by default and how they're exposed through nutch bean?

Re: Jira - Nutch 48 - did you mean patch

2005-10-31 Thread Byron Miller
Zaheed On 10/31/05, Byron Miller [EMAIL PROTECTED] wrote: I got this to work this evening.. was a problem with patch on the system i was working on.. feel free to check it out on slashdot.org.. you can try an example of searching for slashdt and it should recommend the good site

Jira NUTCH-49 - fetchnewonly

2005-10-30 Thread Byron Miller
Can fetchnewonly work with -topN so you fetch new urls only working from the top down or do they not work together?

Jira NUTCH-59 - incorporating dmoz.org metadata

2005-10-30 Thread Byron Miller
Any idea if this will be implemented in the mapread/.07 branches? Does it have to get voted in?

Jira - Nutch 48 - did you mean patch

2005-10-30 Thread Byron Miller
Anyone using this patch? http://issues.apache.org/jira/browse/NUTCH-48 I would like to incorporate this, but not having much luck getting the patch to install over svn release (branch .7) -byron

Re: Jira - Nutch 48 - did you mean patch

2005-10-30 Thread Byron Miller
I got this to work this evening.. was a problem with patch on the system i was working on.. feel free to check it out on slashdot.org.. you can try an example of searching for slashdt and it should recommend the good site :) -byron --- Byron Miller [EMAIL PROTECTED] wrote: Anyone using

Re: fetch questions - freezing

2005-10-28 Thread Byron Miller
For what its worth i fetch my segments of 1 million urls with 80 threads at a time and no slow downs. I'll grab some of my stats and publish them, but i haven't had problems with fetcher slowing down like this in a long time. (linux/Centos 4.2 platform) -byron --- Andrzej Bialecki [EMAIL

Indexer Performance - up to 200+ rec/s with Lang identification enabled

2005-10-28 Thread Byron Miller
051028 083415 DONE indexing segment 20051019000305: total 10 records in 520.156 s (192.3077 rec/s). 051028 083415 done indexing Been doing some testing and i've pretty much peaked out at 192-200 rec/s on a 2.8ghz machine with lang ident enabled on 512bytes data @ 3ngrams which after tweaking

Re: Peak index performance

2005-10-28 Thread Byron Miller
/description /property Initially high index merge factor caused out of file handle errors but increasing the others along with it seemed to help get around that. -byron --- Doug Cutting [EMAIL PROTECTED] wrote: Byron Miller wrote: For example i've been tweaking max merge/min merge and such and i've

Re: Peak index performance

2005-10-28 Thread Byron Miller
My testing is on 100k documents, but most of the time i work with 1 million so i don't have a gazillion segments across my servers. i'll try and adjust that number down and see what happens. -byron --- Doug Cutting [EMAIL PROTECTED] wrote: Byron Miller wrote: property

using site:mydomain.com searches question

2005-10-28 Thread Byron Miller
If you use site:mydomain.com instead of site:www.mydomain.com, shouldn't the query search home.mydomain.com, news.mydomain.com or any prefixed url of that domain?

Re: Index performance with language identifier enabled

2005-10-27 Thread Byron Miller
Thanks for the headsup on this information! I'll be sure to let you know how my luck goes in tweaking out these parameters. -byron --- Jérôme Charron [EMAIL PROTECTED] wrote: Is there any tips/pointers to beefing this up? Anyone else have any index benchmarks with/without this enabled

Re: Index performance with language identifier enabled

2005-10-27 Thread Byron Miller
Before with nutch .7 svn defaults 051027 135317 DONE indexing segment 20051019145225-2: total 100155 records in 2108.737 s (47.51186 rec/s). 051027 135317 done indexing after 051027 142316 DONE indexing segment 20051019145225-3: total 103838 records in 1413.624 s (73.48762 rec/s). 051027 142316

Peak index performance

2005-10-27 Thread Byron Miller
When generating an index from a segment, is there a measure of peak performance? For example i've been tweaking max merge/min merge and such and i've been able to double my performance without increasing anything but cpu load.. Is there a point that tweaking these will cause a heavier IO load or