subject:"\[jira\] Updated\: \(NUTCH\-100\) New plugin urlfilter\-db"

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki

Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings & expressions) and it

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting

Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings & expressions) and it should be very fast to

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki

Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation httpclient

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Gal Nitzan

Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). Regards, Gal Andrzej Bialecki wrote: Gal Nitzan wrote: Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki

Gal Nitzan wrote: Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all must be in my list. If it was implemented with regular expressions, the filter would still have to loop 100K expressions on each url for a match right? No, that's the whole point - using t

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-11 Thread Gal Nitzan

Hi, Well, the reason for this plugin is that i wish to crawl many sites but they all must be in my list. If it was implemented with regular expressions, the filter would still have to loop 100K expressions on each url for a match right? Regards, Gal Andrzej Bialecki wrote: [EMAIL PROTECTE

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-11 Thread Gal Nitzan

Hi Otis, I have only a few thousands urls in my db at the moment. However, for a 100K it should be about 600-800KB. I do not cache the url itself, only a hash string. So the next time a url is searched in the cache if the hash exists than it is allowed. Regards, Gal [EMAIL PROTECTED] wrote

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote: Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Slightly off-topic, but I hope this is relevant to the original reason for creating this plugin... There is a B

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread ogjunk-nutch

Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan <[EMAIL PROTECTED]> wrote: > Hi Michael, > > At the moment I have about 3000 domains in my db. I didn't time t

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan

Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Michael Ji

hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- "Gal Nitzan (JIRA)" <[EMAIL PROTECTED]> wrote: > [ > http://issues.apache.org/jira/browse/NUTCH-100?page=all > ] > > Gal Nitzan updated NUTCH-100: > - > >type

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - type: Improvement (was: New Feature) Description: Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this p

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - Attachment: urlfilter-db.tar.gz AddedDbURLFilter.patch Fixed some issue with swarm cache (removed loading as daemon). Code cleanup and remarks Added so

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-09-29 Thread Gal Nitzan (JIRA)

[ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - Attachment: urlfilter-db.tar.gz The plugin. Extract, and in myplugin folder read README > New plugin urlfilter-db > --- > > Key: NUTCH-10

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

14 matches

Site Navigation

Mail list logo

Footer information