I definitely don't expect people to write it just because it happens to be
useful to me :-) Call me crazy, but I'm thinking of implementing this when I
get some free time (whenever that will be). It seems that I would just need to
implement IWebDBWriter and IWebDBReader, and then add a
Thanks for the input, Andrzej. Yes, I'm still working off of 0.7. I might
still try it since I'm not planning on upgrading for a while, but it sounds
like it's not going to port to the current versions. Howie
_
Your friends are
Please make the following test using your favorite relational DB:*
create a table with 300 mln rows and 10 columns of mixed type*
select 1 mln rows, sorted by some value* update 1 mln rows to
different valuesIf you find that these operations take less time
than with
Sorry about the previous crappily formatted message. In brief, my point wasthat
relational DB might perform better for small niche users, and plusyou get the
flexibility of SQL. No more writing custom code to tweak webdb.Howie
_
hi,I am using nutch to develop a SE.
I must get wml page.
About this,I have known that it needs a plugin(parse-wml). The plugin is
used to get wml page from internet.
so my problem is: 1,How can I write the plugin .
2,How can the plugin be configured in nutch.
The following page should get
I have made some quick tests with regex-urlfilter...
The major problem is that it doen't use the Perl syntax...
For instance, ît doesn't support the boundary matchers ^ and $ (which are
used in nutch)
Are there other ways to match start/end of string in the other
regex library? I use ^http a
Thanks to everybody for your suggestions.
But really, my problem is not technical, but political :
What should we do if we switch to automaton regexp lib ?
1. Keeps the well-known perl syntax for regexp (and then find a way to
simulate them with automaton limited syntax) ?
2. Switch to the
I'd agree that (2) is quite important for the end user; Richard's
continuous text heuristic may actually work for that. I'd extend the
meaning of continuous block to ignore inline tags such as SPAN, I, B, TT
etc, so only certain tags would actually break the content into chunks.
Snippets then
I wouldn't go so far as to call it stupid, but I wouldn't mind
having an html parser not built on DOM. Meta info can still
be gotten without a full DOM parse. Boosting phrases within
certain tags (H1,H2,...) would be nice, but it won't necessarily
be useful for everyone, and we aren't doing it
Hello,
In my experience it is very important to use anchor text giving it
quite high boost. It allows me to return http://www.aa.com when user
searches for American Airlines - without using anchor text it was
impossible to achieve - a lot of sites (spam or not) with american
airlines in url and
text doesn't show up on the text of the page, so maybe that's it.
Andy
On 8/3/05, Howie Wang [EMAIL PROTECTED] wrote:
Hi,
I've been noticing some strange search results recently. I seem to be
getting two issues.
1. The fieldNorm for certain terms is unusually high for certain sites
Hi,
I've been noticing some strange search results recently. I seem
to be getting two issues.
1. The fieldNorm for certain terms is unusually high for certain sites
for anchors and titles. And they are usually just whole numbers (4.0, 5.0,
etc).
I find this strange since the lengthNorm used
There are probably two settings you'll need to tweak
in nutch-default.xml
http.content.limit -- by default it's 64K, if the page is
larger than that, then it essentially truncates the file.
You could be missing lots of links that appear later in
the page.
max.outlinks.per.page -- by default
It works for me and I'm on Cygwin.
Howie
I'm getting expr: syntax error when running all bin/nutch commands. It
comes from this line:
if expr match `uname` 'CYGWIN*' /dev/null; then
should this be modified to be this instead:
if expr `uname` : 'CYGWIN*' /dev/null; then
That
14 matches
Mail list logo