Re: Any advice for a searchable web archiver ?
* Marc Chantreux[2017-11-20T14:42:24] > > We're using Xapian as part of Cyrus IMAP, and it's quite useful for > > what we're doing, > > do you think this should be enough to store mailing lists archives? Based on personal experience, yes. -- rjbs signature.asc Description: Digital signature
Re: Any advice for a searchable web archiver ?
On Tue, Nov 21, 2017 at 01:17:17AM +, Eric Wong wrote: > Which parts of the INSTALL, HACKING, or README were unclear? i just missed the good url to clone from README: git clone git://repo.or.cz/public-inbox thank you
Re: Any advice for a searchable web archiver ?
Marc Chantreuxwrote: > hello, > > > public-inbox is Perl, uses Email::MIME, and (optionally) uses > > Xapian like notmuch. The Perl bits around search indexing is > > ported to Perl from what I understood of the C++ code in notmuch. > > it seems you explored very interesting concepts. however i didn't > understand how to install/test/hack on it. Which parts of the INSTALL, HACKING, or README were unclear? https://public-inbox.org/INSTALL https://public-inbox.org/HACKING https://public-inbox.org/README I can try to clarify (sorry, not online much due to upcoming holidays) It uses MakeMaker like most Perl modules, but haven't made a tarball release, yet, so you can just: git clone https://public-inbox.org/ public-inbox to grab the source
Re: Any advice for a searchable web archiver ?
On Sun, Nov 19, 2017 at 08:15:15PM -0600, Peter Karman wrote: > I like to think that Dezi, like Lucy, is stable rather than inactive, but > your point is taken. :) thank you for this point! this is very important for me to know we can test a stable product. > You could use Dezi::App with something like > https://metacpan.org/pod/Dezi::Aggregator::MailFS to index email messages. > Then serve the index(es) with the Dezi server. That's definitely a known use > case. thank you very much. marc
Re: Any advice for a searchable web archiver ?
On Mon, 20 Nov 2017, at 07:52, Marc Chantreux wrote: > Hello, Hi Marc > As the sympa community (http://www.sympa.org) recently grown, we are > thinking about revamping the whole UI and we would like to have > a new web archiver based on: > > * no default frontend but exposing the API through REST, websockets or> > whatever. > * maximizing the interactions between Sympa and CPAN > * trying to avoid other dynamic langage or jvm dependency > (or considering them as temporary solutions) > * being JMAP friendly (we bet on it to become a very healthy > community) I'm glad to see that you're interested in JMAP :) We're also betting very heavily on it at FastMail as I'm sure you're aware! > My first idea was to use notmuch, PEP modules and Dancer on top of > maildirs then i discover Dezi (inactive since 2015) and the use of > Lucy (also used by the very active librecat project). > > I know Dezi is a general search engine but i hope that taking care of> a good > email support for it than reinvent the wheel. > > Those are lot of things to look for if i want to have a clear opinion> on a > good strategy. Any advice would be warmly welcome. We're using Xapian as part of Cyrus IMAP, and it's quite useful for what we're doing, though I'm sure any search engine would be fine. There are some pitfalls to look out for, for example if you naively index everything a search for "references" is going to return quite a lot of messages. Another problem with naive indexing is that Maildir allows message file names to move as flags are added/removed, and you'll want your indexer to avoid reindexing them every time. I expect you might already have a datastructure that handles that though. In terms of search usefulness, most of our customers love the stemming support, but it does have some exciting issues around languages and diacritics and inability to match on anything other than word prefixes - so you can't match partial strings inside a word. That may or may not be an issue for your usecase. I don't know Dezi or Lucy, so I don't have a strong opinion there. Regards, Bron. -- Bron Gondwana br...@fastmail.fm
Re: Any advice for a searchable web archiver ?
Marc Chantreuxwrote: > Hello, > > As the sympa community (http://www.sympa.org) recently grown, we are > thinking about revamping the whole UI and we would like to have > a new web archiver based on: > > * no default frontend but exposing the API through REST, websockets or > whatever. > * maximizing the interactions between Sympa and CPAN > * trying to avoid other dynamic langage or jvm dependency > (or considering them as temporary solutions) > * being JMAP friendly (we bet on it to become a very healthy community) > > My first idea was to use notmuch, PEP modules and Dancer on top of > maildirs then i discover Dezi (inactive since 2015) and the use of > Lucy (also used by the very active librecat project). > > I know Dezi is a general search engine but i hope that taking care of > a good email support for it than reinvent the wheel. public-inbox is Perl, uses Email::MIME, and (optionally) uses Xapian like notmuch. The Perl bits around search indexing is ported to Perl from what I understood of the C++ code in notmuch. The web part is PSGI and I consider the URL format a stable API: https://public-inbox.org/design_www.txt I will probably add JSON support to it for external web services; haven't looked into JMAP, yet... There's also a standalone NNTP server based on Danga::Socket. You can find an example of it for the git mailing list : https://public-inbox.org/git/ > Those are lot of things to look for if i want to have a clear opinion > on a good strategy. Any advice would be warmly welcome. It's probably not a perfect match for you guys, but it's all AGPL-3+. The whole thing (code AND data) is designed to be completely replicatable and forkable using git, so anybody can clone any instance it's entirety.