Doug Cutting wrote:
Andrzej Bialecki wrote:
100k regexps is still alot, so I'm not totally sure it would be much
faster, but perhaps worth checking.
I have worked with this type of technology before (minimized,
determinized FSAs, constructed from large sets of strings & expressions)
and it
Andrzej Bialecki wrote:
100k regexps is still alot, so I'm not totally sure it would be much
faster, but perhaps worth checking.
I have worked with this type of technology before (minimized,
determinized FSAs, constructed from large sets of strings & expressions)
and it should be very fast to
Gal Nitzan wrote:
Hi Andrzej,
Yes, it seems like a good option. However, it is GPL, and I noticed in
one of the posts that this license is no good for apach.org :).
If you refer to the bricks automata library, it's BSD-licensed. I
mentioned in one of the posts that the Innovation httpclient
Hi Andrzej,
Yes, it seems like a good option. However, it is GPL, and I noticed in
one of the posts that this license is no good for apach.org :).
Regards,
Gal
Andrzej Bialecki wrote:
Gal Nitzan wrote:
Hi,
Well, the reason for this plugin is that i wish to crawl many sites
but they all
Gal Nitzan wrote:
Hi,
Well, the reason for this plugin is that i wish to crawl many sites but
they all must be in my list. If it was implemented with regular
expressions, the filter would still have to loop 100K expressions on
each url for a match right?
No, that's the whole point - using t
Hi,
Well, the reason for this plugin is that i wish to crawl many sites but
they all must be in my list. If it was implemented with regular
expressions, the filter would still have to loop 100K expressions on
each url for a match right?
Regards,
Gal
Andrzej Bialecki wrote:
[EMAIL PROTECTE
Hi Otis,
I have only a few thousands urls in my db at the moment. However, for a
100K it should be about 600-800KB. I do not cache the url itself, only a
hash string. So the next time a url is searched in the cache if the hash
exists than it is allowed.
Regards,
Gal
[EMAIL PROTECTED] wrote
[EMAIL PROTECTED] wrote:
Hi Gal,
I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.
Slightly off-topic, but I hope this is relevant to the original reason
for creating this plugin...
There is a B
Hi Gal,
I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.
Thanks,
Otis
--- Gal Nitzan <[EMAIL PROTECTED]> wrote:
> Hi Michael,
>
> At the moment I have about 3000 domains in my db. I didn't time t
Hi Michael,
At the moment I have about 3000 domains in my db. I didn't time the
performance however having even 100k domains shouldn't have an impact
since it is fetched only once from the database to the cache. A little
performance hit should be over 100k (depends on number elements defined
hi,
How is performance concern if the size of domain list
reaches 10,000?
Micheal Ji,
--- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote:
> [
>
http://issues.apache.org/jira/browse/NUTCH-100?page=all
> ]
>
> Gal Nitzan updated NUTCH-100:
> -
>
>type
[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]
Gal Nitzan updated NUTCH-100:
-
type: Improvement (was: New Feature)
Description:
Hi,
I have written a new plugin, based on the URLFilter interface: urlfilter-db .
The purpose of this p
[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]
Gal Nitzan updated NUTCH-100:
-
Attachment: urlfilter-db.tar.gz
AddedDbURLFilter.patch
Fixed some issue with swarm cache (removed loading as daemon).
Code cleanup and remarks
Added so
[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]
Gal Nitzan updated NUTCH-100:
-
Attachment: urlfilter-db.tar.gz
The plugin. Extract, and in myplugin folder read README
> New plugin urlfilter-db
> ---
>
> Key: NUTCH-10
14 matches
Mail list logo