Gabriele What you are describing could be done with Nutch 2.0 by adding a SOLR backend to GORA. SOLR would be used to store the webtable and provided that you setup the schema accordingly you could index the appropriate fields for searching. I think there were plans to add SOLR as a GORA backend. I think Nutch 2.0 would be a natural fit for what you are describing, more than 1.x IMHO.
HTH Julien On 16 July 2011 01:00, Gabriele Kahlout <gabri...@mysimpatico.com> wrote: > Hello, > > I had this draft lurking for a while now, and before archiving for personal > reference I wondered if it's accurate, and if you recommend posting it to > the wiki. > > Nutch maintains a crawldb (and linkdb, for that matter) of the urls it > crawled, the fetch status, and the date. This data is maintained beyond > fetch so that pages may be re-crawled, after the a re-crawling period. > At the same time Solr maintains an inverted index of all the fetched pages. > It'd seem more efficient if nutch relied on the index instead of > maintaining its own crawldb, to !store the same url twice. > [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN > SOLR] > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com