Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Gal Nitzan wrote: Hi Andrzej, Yes, it seems like a good option. However, it is GPL, and I noticed in one of the posts that this license is no good for apach.org :). If you refer to the bricks automata library, it's BSD-licensed. I mentioned in one of the posts that the Innovation httpclient is L-GPL, and hence not acceptable for apache.org. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to perform lookups, even in large, complex FSAs. Construction of the FSA can be time consuming and should probably be done offline, not at fetcher startup time, so that it is only performed once for a number of fetcher runs. Doug
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Doug Cutting wrote: Andrzej Bialecki wrote: 100k regexps is still alot, so I'm not totally sure it would be much faster, but perhaps worth checking. I have worked with this type of technology before (minimized, determinized FSAs, constructed from large sets of strings expressions) and it should be very fast to perform lookups, even in large, complex FSAs. Construction of the FSA can be time consuming and should probably be done offline, not at fetcher startup time, so that it is only performed once for a number of fetcher runs. Guess what... this library supports (de)serialization of automata, so they can be compiled once, and then just stored/loaded. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Thanks, Otis --- Gal Nitzan [EMAIL PROTECTED] wrote: Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined in xml file). After a few birth problems, the plugin works nicely and I do not feel any impact. Regards, Gal Michael Ji wrote: hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - type: Improvement (was: New Feature) Description: Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml was: Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml Environment: All Nutch versions (was: MapRed) Fixed some issues clean up Added a patch for Subversion New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Environment: All Nutch versions Reporter: Gal Nitzan Priority: Trivial Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ .
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
[EMAIL PROTECTED] wrote: Hi Gal, I'm curious about the memory consumption of the cache and the speed of retrieval of an item from the cache, when the cache has 100k domains in it. Slightly off-topic, but I hope this is relevant to the original reason for creating this plugin... There is a BSD-licensed library that implements a large subset of regexps, which is based on finite automata. It is reported to be scalable and very fast (benchmarks are surely impressive): http://www.brics.dk/~amoeller/automaton/ I suggest to do some tests with 100k regexps and see if it survives. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db
Hi Michael, At the moment I have about 3000 domains in my db. I didn't time the performance however having even 100k domains shouldn't have an impact since it is fetched only once from the database to the cache. A little performance hit should be over 100k (depends on number elements defined in xml file). After a few birth problems, the plugin works nicely and I do not feel any impact. Regards, Gal Michael Ji wrote: hi, How is performance concern if the size of domain list reaches 10,000? Micheal Ji, --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote: [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ] Gal Nitzan updated NUTCH-100: - type: Improvement (was: New Feature) Description: Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml was: Hi, I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml Environment: All Nutch versions (was: MapRed) Fixed some issues clean up Added a patch for Subversion New plugin urlfilter-db --- Key: NUTCH-100 URL: http://issues.apache.org/jira/browse/NUTCH-100 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev Environment: All Nutch versions Reporter: Gal Nitzan Priority: Trivial Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz Hi, I have written a new plugin, based on the URLFilter interface: urlfilter-db . The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains. The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database. For each url filter is called end for filter get the domain name from url call cache.get domain if not in cache try the database if in database cache it and return it return null end filter The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ .