Re: Any advice for a searchable web archiver ?

2017-11-27 Thread Ricardo Signes
* Marc Chantreux  [2017-11-20T14:42:24]
> > We're using Xapian as part of Cyrus IMAP, and it's quite useful for
> > what we're doing,
> 
> do you think this should be enough to store mailing lists archives?

Based on personal experience, yes.

-- 
rjbs


signature.asc
Description: Digital signature


Re: Any advice for a searchable web archiver ?

2017-11-21 Thread Marc Chantreux
On Tue, Nov 21, 2017 at 01:17:17AM +, Eric Wong wrote:
> Which parts of the INSTALL, HACKING, or README were unclear?

i just missed the good url to clone from README:

git clone git://repo.or.cz/public-inbox

thank you


Re: Any advice for a searchable web archiver ?

2017-11-20 Thread Eric Wong
Marc Chantreux  wrote:
> hello,
> 
> > public-inbox is Perl, uses Email::MIME, and (optionally) uses
> > Xapian like notmuch.  The Perl bits around search indexing is
> > ported to Perl from what I understood of the C++ code in notmuch.
> 
> it seems you explored very interesting concepts. however i didn't
> understand how to install/test/hack on it.

Which parts of the INSTALL, HACKING, or README were unclear?

https://public-inbox.org/INSTALL
https://public-inbox.org/HACKING
https://public-inbox.org/README

I can try to clarify (sorry, not online much due to upcoming holidays)

It uses MakeMaker like most Perl modules, but haven't made a
tarball release, yet, so you can just:

git clone https://public-inbox.org/ public-inbox

to grab the source


Re: Any advice for a searchable web archiver ?

2017-11-20 Thread Marc Chantreux
On Sun, Nov 19, 2017 at 08:15:15PM -0600, Peter Karman wrote:
> I like to think that Dezi, like Lucy, is stable rather than inactive, but
> your point is taken. :)

thank you for this point! this is very important for me to know we can
test a stable product.

> You could use Dezi::App with something like
> https://metacpan.org/pod/Dezi::Aggregator::MailFS to index email messages.
> Then serve the index(es) with the Dezi server. That's definitely a known use
> case.

thank you very much.

marc


Re: Any advice for a searchable web archiver ?

2017-11-20 Thread Bron Gondwana
On Mon, 20 Nov 2017, at 07:52, Marc Chantreux wrote:
> Hello,

Hi Marc

> As the sympa community (http://www.sympa.org) recently grown, we are
> thinking about revamping the whole UI and we would like to have
> a new web archiver based on:
> 
> * no default frontend but exposing the API through REST, websockets or>   
> whatever.
> * maximizing the interactions between Sympa and CPAN
> * trying to avoid other dynamic langage or jvm dependency
>   (or considering them as temporary solutions)
> * being JMAP friendly (we bet on it to become a very healthy
>   community)
I'm glad to see that you're interested in JMAP :)  We're also betting
very heavily on it at FastMail as I'm sure you're aware!
> My first idea was to use notmuch, PEP modules and Dancer on top of
> maildirs then i discover Dezi (inactive since 2015) and the use of
> Lucy (also used by the very active librecat project).
> 
> I know Dezi is a general search engine but i hope that taking care of> a good 
> email support for it than reinvent the wheel.
> 
> Those are lot of things to look for if i want to have a clear opinion> on a 
> good strategy. Any advice would be warmly welcome.

We're using Xapian as part of Cyrus IMAP, and it's quite useful for
what we're doing, though I'm sure any search engine would be fine.
There are some pitfalls to look out for, for example if you naively
index everything a search for "references" is going to return quite a
lot of messages.
Another problem with naive indexing is that Maildir allows message file
names to move as flags are added/removed, and you'll want your indexer
to avoid reindexing them every time.  I expect you might already have a
datastructure that handles that though.
In terms of search usefulness, most of our customers love the stemming
support, but it does have some exciting issues around languages and
diacritics and inability to match on anything other than word prefixes -
so you can't match partial strings inside a word.  That may or may not
be an issue for your usecase.
I don't know Dezi or Lucy, so I don't have a strong opinion there.

Regards,

Bron.

--
  Bron Gondwana
  br...@fastmail.fm




Re: Any advice for a searchable web archiver ?

2017-11-19 Thread Eric Wong
Marc Chantreux  wrote:
> Hello,
> 
> As the sympa community (http://www.sympa.org) recently grown, we are
> thinking about revamping the whole UI and we would like to have
> a new web archiver based on:
> 
> * no default frontend but exposing the API through REST, websockets or
>   whatever.
> * maximizing the interactions between Sympa and CPAN
> * trying to avoid other dynamic langage or jvm dependency
>   (or considering them as temporary solutions)
> * being JMAP friendly (we bet on it to become a very healthy community)
> 
> My first idea was to use notmuch, PEP modules and Dancer on top of
> maildirs then i discover Dezi (inactive since 2015) and the use of
> Lucy (also used by the very active librecat project).
> 
> I know Dezi is a general search engine but i hope that taking care of
> a good email support for it than reinvent the wheel.

public-inbox is Perl, uses Email::MIME, and (optionally) uses
Xapian like notmuch.  The Perl bits around search indexing is
ported to Perl from what I understood of the C++ code in notmuch.

The web part is PSGI and I consider the URL format a stable API:

https://public-inbox.org/design_www.txt

I will probably add JSON support to it for external web services;
haven't looked into JMAP, yet...

There's also a standalone NNTP server based on Danga::Socket.

You can find an example of it for the git mailing list
:  https://public-inbox.org/git/

> Those are lot of things to look for if i want to have a clear opinion
> on a good strategy. Any advice would be warmly welcome.

It's probably not a perfect match for you guys, but it's all AGPL-3+.
The whole thing (code AND data) is designed to be completely
replicatable and forkable using git, so anybody can clone any instance
it's entirety.