RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
Thanks for the input, Andrzej. Yes, I'm still working off of 0.7. I might still try it since I'm not planning on upgrading for a while, but it sounds like it's not going to port to the current versions. Howie _ Your friends are

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Please make the following test using your favorite relational DB:* create a table with 300 mln rows and 10 columns of mixed type* select 1 mln rows, sorted by some value* update 1 mln rows to different valuesIf you find that these operations take less time than with

RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Howie Wang
Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie _

RE: ask a problem about nutch (from China)

2006-09-15 Thread Howie Wang
hi,I am using nutch to develop a SE. I must get wml page. About this,I have known that it needs a plugin(parse-wml). The plugin is used to get wml page from internet. so my problem is: 1,How can I write the plugin . 2,How can the plugin be configured in nutch. The following page should get

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
I have made some quick tests with regex-urlfilter... The major problem is that it doen't use the Perl syntax... For instance, ît doesn't support the boundary matchers ^ and $ (which are used in nutch) Are there other ways to match start/end of string in the other regex library? I use ^http a

Re: Much faster RegExp lib needed in nutch?

2006-03-13 Thread Howie Wang
Thanks to everybody for your suggestions. But really, my problem is not technical, but political : What should we do if we switch to automaton regexp lib ? 1. Keeps the well-known perl syntax for regexp (and then find a way to simulate them with automaton limited syntax) ? 2. Switch to the

Re: quality of search text

2006-03-12 Thread Howie Wang
I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of continuous block to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then

Re: Nutch Improvement - HTML Parser

2006-02-25 Thread Howie Wang
I wouldn't go so far as to call it stupid, but I wouldn't mind having an html parser not built on DOM. Meta info can still be gotten without a full DOM parse. Boosting phrases within certain tags (H1,H2,...) would be nice, but it won't necessarily be useful for everyone, and we aren't doing it

Re: Strange search results

2005-08-05 Thread Howie Wang
Hello, In my experience it is very important to use anchor text giving it quite high boost. It allows me to return http://www.aa.com when user searches for American Airlines - without using anchor text it was impossible to achieve - a lot of sites (spam or not) with american airlines in url and

RE: Strange search results

2005-08-03 Thread Howie Wang
text doesn't show up on the text of the page, so maybe that's it. Andy On 8/3/05, Howie Wang [EMAIL PROTECTED] wrote: Hi, I've been noticing some strange search results recently. I seem to be getting two issues. 1. The fieldNorm for certain terms is unusually high for certain sites

Strange search results

2005-08-02 Thread Howie Wang
Hi, I've been noticing some strange search results recently. I seem to be getting two issues. 1. The fieldNorm for certain terms is unusually high for certain sites for anchors and titles. And they are usually just whole numbers (4.0, 5.0, etc). I find this strange since the lengthNorm used

RE: fetching behavior of Nutch

2005-07-24 Thread Howie Wang
There are probably two settings you'll need to tweak in nutch-default.xml http.content.limit -- by default it's 64K, if the page is larger than that, then it essentially truncates the file. You could be missing lots of links that appear later in the page. max.outlinks.per.page -- by default

RE: bin/nutch issue - on Mac OS X

2005-07-19 Thread Howie Wang
It works for me and I'm on Cygwin. Howie I'm getting expr: syntax error when running all bin/nutch commands. It comes from this line: if expr match `uname` 'CYGWIN*' /dev/null; then should this be modified to be this instead: if expr `uname` : 'CYGWIN*' /dev/null; then That