Gabriele

What you are describing could be done with Nutch 2.0 by adding a SOLR
backend to GORA. SOLR would be used to store the webtable and provided that
you setup the schema accordingly you could index the appropriate fields for
searching. I think there were plans to add SOLR as a GORA backend. I think
Nutch 2.0 would be a natural fit for what you are describing, more than 1.x
IMHO.

HTH

Julien



On 16 July 2011 01:00, Gabriele Kahlout <gabri...@mysimpatico.com> wrote:

> Hello,
>
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
>
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period.
> At the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice.
> [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN
> SOLR]
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to