Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki

Gal Nitzan wrote:

Hi Andrzej,

Yes, it seems like a good option. However, it is GPL, and I noticed in 
one of the posts that this license is no good for apach.org :).


If you refer to the bricks automata library, it's BSD-licensed.  I 
mentioned in one of the posts that the Innovation httpclient is L-GPL, 
and hence not acceptable for apache.org.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Doug Cutting

Andrzej Bialecki wrote:
100k regexps is still alot, so I'm not totally sure it would be much 
faster, but perhaps worth checking.


I have worked with this type of technology before (minimized, 
determinized FSAs, constructed from large sets of strings  expressions) 
and it should be very fast to perform lookups, even in large, complex 
FSAs.  Construction of the FSA can be time consuming and should probably 
be done offline, not at fetcher startup time, so that it is only 
performed once for a number of fetcher runs.


Doug


Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-12 Thread Andrzej Bialecki

Doug Cutting wrote:

Andrzej Bialecki wrote:

100k regexps is still alot, so I'm not totally sure it would be much 
faster, but perhaps worth checking.



I have worked with this type of technology before (minimized, 
determinized FSAs, constructed from large sets of strings  expressions) 
and it should be very fast to perform lookups, even in large, complex 
FSAs.  Construction of the FSA can be time consuming and should probably 
be done offline, not at fetcher startup time, so that it is only 
performed once for a number of fetcher runs.


Guess what... this library supports (de)serialization of automata, so 
they can be compiled once, and then just stored/loaded.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread ogjunk-nutch
Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan [EMAIL PROTECTED] wrote:

 Hi Michael,
 
 At the moment I have about 3000 domains in my db. I didn't time the 
 performance however having even 100k domains shouldn't have an impact
 
 since it is fetched only once from the database to the cache. A
 little 
 performance hit should be over 100k (depends on number elements
 defined 
 in xml file).
 
 After a few birth problems, the plugin works nicely and I do not feel
 
 any impact.
 
 Regards,
 
 Gal
 
 
 Michael Ji wrote:
  hi,
 
  How is performance concern if the size of domain list
  reaches 10,000?
 
  Micheal Ji,
 
  --- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote:
 

   [
 
  
  http://issues.apache.org/jira/browse/NUTCH-100?page=all

  ]
 
  Gal Nitzan updated NUTCH-100:
  -
 
 type: Improvement  (was: New Feature)
  Description: 
  Hi,
 
  I have written a new plugin, based on the URLFilter
  interface: urlfilter-db .
 
  The purpose of this plugin is to filter domains,
  i.e. I would like to crawl the world but to fetch
  only certain domains.
 
  The plugin uses a caching system (SwarmCache, easier
  to deploy than JCS) and on the back-end a database.
 
  For each url
 filter is called
  end for
 
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
 
 
  The plugin reads the cache size, jdbc driver,
  connection string, table to use and domain field
  from nutch-site.xml
 
 
was:
  Hi,
 
  I have written (not much) a new plugin, based on the
  URLFilter interface: urlfilter-db .
 
  The purpose of this plugin is to filter domains,
  i.e. I would like to crawl the world but to fetch
  only certain domains.
 
  The plugin uses a caching system (SwarmCache, easier
  to deploy than JCS) and on the back-end a database.
 
  For each url
 filter is called
  end for
 
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
 
 
  The plugin reads the cache size, jdbc driver,
  connection string, table to use and domain field
  from nutch-site.xml
 
 
  Environment: All Nutch versions  (was: MapRed)
 
  Fixed some issues
  clean up
  Added a patch for Subversion
 
  
  New plugin urlfilter-db
  ---
 
   Key: NUTCH-100
   URL:

  http://issues.apache.org/jira/browse/NUTCH-100
  
   Project: Nutch
  Type: Improvement
Components: fetcher
  Versions: 0.8-dev
   Environment: All Nutch versions
  Reporter: Gal Nitzan
  Priority: Trivial
   Attachments: AddedDbURLFilter.patch,

  urlfilter-db.tar.gz, urlfilter-db.tar.gz
  
  Hi,
  I have written a new plugin, based on the

  URLFilter interface: urlfilter-db .
  
  The purpose of this plugin is to filter domains,

  i.e. I would like to crawl the world but to fetch
  only certain domains.
  
  The plugin uses a caching system (SwarmCache,

  easier to deploy than JCS) and on the back-end a
  database.
  
  For each url
 filter is called
  end for
  filter
   get the domain name from url
call cache.get domain
if not in cache try the database
if in database cache it and return it
return null
  end filter
  The plugin reads the cache size, jdbc driver,

  connection string, table to use and domain field
  from nutch-site.xml
 
  -- 
  This message is automatically generated by JIRA.
  -
  If you think it was sent incorrectly contact one of
  the administrators:

 
  
  http://issues.apache.org/jira/secure/Administrators.jspa

  -
  For more information on JIRA, see:
 http://www.atlassian.com/software/jira
 
 
  
 
 
 
  
  __ 
  Yahoo! Music Unlimited 
  Access over 1 million songs. Try it free.
  http://music.yahoo.com/unlimited/
 
  .
 

 
 
 



Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-10 Thread Andrzej Bialecki

[EMAIL PROTECTED] wrote:

Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.


Slightly off-topic, but I hope this is relevant to the original reason 
for creating this plugin...


There is a BSD-licensed library that implements a large subset of 
regexps, which is based on finite automata. It is reported to be 
scalable and very fast (benchmarks are surely impressive):


http://www.brics.dk/~amoeller/automaton/

I suggest to do some tests with 100k regexps and see if it survives.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

2005-10-09 Thread Gal Nitzan

Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time the 
performance however having even 100k domains shouldn't have an impact 
since it is fetched only once from the database to the cache. A little 
performance hit should be over 100k (depends on number elements defined 
in xml file).


After a few birth problems, the plugin works nicely and I do not feel 
any impact.


Regards,

Gal


Michael Ji wrote:

hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- Gal Nitzan (JIRA) [EMAIL PROTECTED] wrote:

  

 [



http://issues.apache.org/jira/browse/NUTCH-100?page=all
  

]

Gal Nitzan updated NUTCH-100:
-

   type: Improvement  (was: New Feature)
Description: 
Hi,


I have written a new plugin, based on the URLFilter
interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the
URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains,
i.e. I would like to crawl the world but to fetch
only certain domains.

The plugin uses a caching system (SwarmCache, easier
to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver,
connection string, table to use and domain field
from nutch-site.xml


Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion



New plugin urlfilter-db
---

 Key: NUTCH-100
 URL:
  

http://issues.apache.org/jira/browse/NUTCH-100


 Project: Nutch
Type: Improvement
  Components: fetcher
Versions: 0.8-dev
 Environment: All Nutch versions
Reporter: Gal Nitzan
Priority: Trivial
 Attachments: AddedDbURLFilter.patch,
  

urlfilter-db.tar.gz, urlfilter-db.tar.gz


Hi,
I have written a new plugin, based on the
  

URLFilter interface: urlfilter-db .


The purpose of this plugin is to filter domains,
  

i.e. I would like to crawl the world but to fetch
only certain domains.


The plugin uses a caching system (SwarmCache,
  

easier to deploy than JCS) and on the back-end a
database.


For each url
   filter is called
end for
filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter
The plugin reads the cache size, jdbc driver,
  

connection string, table to use and domain field
from nutch-site.xml

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of
the administrators:
  



http://issues.apache.org/jira/secure/Administrators.jspa
  

-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira








__ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.

http://music.yahoo.com/unlimited/

.