Re: [Haskell-cafe] Handling a large database (of ngrams)

2011-05-23 Thread wren ng thornton

On 5/22/11 8:40 AM, Aleksandar Dimitrov wrote:

If you have too much trouble trying to get SRILM to work, there's
also the Berkeley LM which is easier to install. I'm not familiar
with its inner workings, but it should offer pretty much the same
sorts of operations.


Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.


Well, BerlekelyLM is specifically for n-gram language modeling, it's not 
a general database. According to the paper I mentioned off-list, the 
entire Google Web1T corpus (approx 1 trillion word tokens, 4 billion 
n-gram types) can be fit into 10GB of memory, which is much smaller than 
SRILM can do.


Databases aren't really my area so I couldn't give a good comparison. 
Though for this scale of data you're going to want to use something 
specialized for storing n-grams, rather than a general database. There's 
a lot of redundant structure in n-gram counts and you'll want to take 
advantage of that.



For regular projects, that integerization would be enough, but for
your task you'll probably want to spend some time tweaking the
codes. In particular, you'll probably have enough word types to
overflow the space of Int32/Word32 or even Int64/Word64.


Again according to Pauls  Klein (2011), Google Web1T has 13.5M word 
types, which easily fits into 24-bits. That's for English, so 
morphologically rich languages will be different. I wouldn't expect too 
many problems for German, unless you have a lot of technical text with a 
prodigious number of unique compound nouns. Even then I'd be surprised 
if you went over 2^64 (that'd be reserved for languages like Japanese, 
Hungarian, Inuit,... if even they'd ever get that bad).


--
Live well,
~wren

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Handling a large database (of ngrams)

2011-05-22 Thread Aleksandar Dimitrov
Hi Wren,

First of all, thanks for your elaborate answer! Your input is very much
appreciated!

On Sat, May 21, 2011 at 10:42:57PM -0400, wren ng thornton wrote:
 I've been working on some n-gram stuff lately, which I'm hoping to
 put up on Hackage sometime this summer (i.e., in a month or so,
 after polishing off a few rough edges).

Unfortunately, my current time constraints don't allow me to wait for your code
:-( However, research on that system might continue well into the summer (if I
can manage to get paid for it…) so I'll keep an eye out for your announcement!

 Unless you really need to get your hands dirty, the first thing you
 should do is check out SRILM and see if that can handle it. Assuming
 SRILM can do it, you'd just need to write some FFI code to call into
 it from Haskell. This is relatively straightforward, and if it
 hasn't been done already it'd be a great contribution to the Haskell
 NLP community. One great thing about this approach is that it's
 pretty easy to have the SRILM server running on a dedicated machine,
 so that the memory requirements of the model don't interfere with
 the memory requirements of your program. (I can give you pointers to
 Java code doing all this, if you'd like.) One big caveat to bear in
 mind is that SRILM is not threadsafe, so that'll get in the way of
 any parallelism or distributed computing on the client side of
 things. Also, SRILM is hell to install.

I don't need distributed computing, but I haven't used SRILM based on the fact
that I'd have to write a wrapper around it, and because I need to do some custom
language model stuff for which I need to write functions that operate on ngram
probabilities directly, and I want to try out different machine learning methods
to compare their merits for my research. I'll see if SRILM is flexible enough,
to do this though.

 If you have too much trouble trying to get SRILM to work, there's
 also the Berkeley LM which is easier to install. I'm not familiar
 with its inner workings, but it should offer pretty much the same
 sorts of operations.

Do you know how BerkeleyLM compares to, say MongoDB and PostgresQL for large
data sets? Maybe this is also the wrong list to ask for this kind of question.

 But, with that in mind I can offer a few pointers for doing it in
 Haskell. The first one is interning. Assuming the data is already
 tokenized (or can be made to be), then it should be straightforward
 to write a streaming program that converts every unique token into
 an integer (storing the conversion table somewhere for later use).

This is what I meant by using the flyweight pattern. There's of course also the
possibility of computing a hash of every string, but I don't want to deal with
hash collisions. While they are largely avoidable (using, say, SHA1) but since
that would force a multi-byte index, I don't know if that would help too much,
seeing as the average word length isn't dramatically big.

 It doesn't even need to be complete morphological
 segmentation, just break on the obvious boundaries in order to get
 the average word size down to something reasonable.

I will do my best to avoid having to do this. My current research target is
German, which is richer in morphology than English, but not as much as Turkic,
Slavic or Uralic languages. In case of a strong inflectional language or even an
agglutinating or polysynthetic one, the need for a smarter morphological
analysis would arise. Unfortunately, this would have to be in the form of
language-specific morphological analysers.

Since this is something I would want to do in general anyway, I might write a
morphological analyser for German that breaks words down to case and tense
markings as well as lemmas, but this isn't the focus of my project, so I'll
first try to do without.

 For regular projects, that integerization would be enough, but for
 your task you'll probably want to spend some time tweaking the
 codes. In particular, you'll probably have enough word types to
 overflow the space of Int32/Word32 or even Int64/Word64.

I will, most likely, because of the long tail of word frequencies, run into
problems with integer space; not only that, but these ultra-rare words aren't
really going to be of much use for me. I'm debating using a trie for common
words while keeping a backlog of rare words in a mongoDB instance. Rare words
could graduate from there if they occur frequently enough, and get accepted into
the trie for easier access. Everything left over in the long tail end would just
be mapped to RARETAG in the n-grams, where TAG refers to the part of speech
tag (since my corpus is pos-tagged.) What number constitutes rare and common
will have to be subject to experimentation.

A bloom filter might guard the trie, too. Since BFs can tell you if a certain
element is certainly *not* in a collection, I could cut down search operations
on the trie itself for all the rare words. AFAICR there's a chapter in RWH on
building Bloom filters.

There are 

[Haskell-cafe] Handling a large database (of ngrams)

2011-05-21 Thread Aleksandar Dimitrov
Hi Haskellers,

I'm unaware of a good method or default way of handling large datasets to
which I need non-sequential (i.e. random) access in Haskell.

My use case is linguistic analysis of a ~30GB corpus — the most basic form of
quantitative analysis here are ngram based HMMs, which aren't difficult to pull
of in Haskell.

However, I would need a *large* database of ngrams, possibly into the hundreds
of gigabytes. Such a beast will obviously never fit into memory. Moreover, I'll
need sort of random access both during training and testing/operation.

My initial idea (that I'm currently following up on) is to use Haskell's
Berkeley DB wrapper. In this case, I would just stuff all ngrams in one big
table (with their counts in an associated column,) and let Berkeley DB do the
heavy lifting.

One simplifying assumption is, of course, using the flyweight pattern (i.e.
having one table of unigrams (that would even fit into memory!) associated with
indices, and then just reference the indices in the ngram table.) But this is
only going to decrease the size of the db by some factor, not an order of
magnitude.

Tries come to mind, too. Not only for the unigram table (where they indeed do
make a lot of sense,) but also for the ngram table, where they might or might
not help. In this case, the cells of the trie would be word indices pointing to
the unigram table. Is there a way to serialize tries over a finite (but possibly
large) domain to disk and have a semi-speedy random access to them?

I'm fully aware that I'll have to wait up to a couple of seconds per lookup,
that's OK. I could hack together a primitive implementation of what I need here,
but maybe there's something better out there already? Hackage didn't speak to
me[1], and #haskell was busy discussing something else, so I hope -cafe can 
help me :-)

Thanks for any hints and pointers!
Aleks

[1] There's the ngram package on Haskell, but that only queries Google. I
actually want to build my own ngram database, because I'm training on a specific
corpus, and will possibly have to adapt to different domains.


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Handling a large database (of ngrams)

2011-05-21 Thread wren ng thornton

On 5/21/11 3:56 PM, Aleksandar Dimitrov wrote:

Hi Haskellers,

I'm unaware of a good method or default way of handling large datasets to
which I need non-sequential (i.e. random) access in Haskell.

My use case is linguistic analysis of a ~30GB corpus — the most basic form of
quantitative analysis here are ngram based HMMs, which aren't difficult to pull
of in Haskell.


I've been working on some n-gram stuff lately, which I'm hoping to put 
up on Hackage sometime this summer (i.e., in a month or so, after 
polishing off a few rough edges). The project aims to be suitable for 
industrial-scale problems, though I'm not sure if it'll make it quite so 
far as 30GB.


Unless you really need to get your hands dirty, the first thing you 
should do is check out SRILM and see if that can handle it. Assuming 
SRILM can do it, you'd just need to write some FFI code to call into it 
from Haskell. This is relatively straightforward, and if it hasn't been 
done already it'd be a great contribution to the Haskell NLP community. 
One great thing about this approach is that it's pretty easy to have the 
SRILM server running on a dedicated machine, so that the memory 
requirements of the model don't interfere with the memory requirements 
of your program. (I can give you pointers to Java code doing all this, 
if you'd like.) One big caveat to bear in mind is that SRILM is not 
threadsafe, so that'll get in the way of any parallelism or distributed 
computing on the client side of things. Also, SRILM is hell to install.


If you have too much trouble trying to get SRILM to work, there's also 
the Berkeley LM which is easier to install. I'm not familiar with its 
inner workings, but it should offer pretty much the same sorts of 
operations.




But, with that in mind I can offer a few pointers for doing it in 
Haskell. The first one is interning. Assuming the data is already 
tokenized (or can be made to be), then it should be straightforward to 
write a streaming program that converts every unique token into an 
integer (storing the conversion table somewhere for later use). This 
will reduce the effective corpus size significantly for any natural 
language--- and dramatically for morphologically poor languages like 
English. For morphologically rich languages the benefits will be best if 
you can do morphological segmentation before interning, though whether 
that's viable depends on your NLP task. It doesn't even need to be 
complete morphological segmentation, just break on the obvious 
boundaries in order to get the average word size down to something 
reasonable.


For regular projects, that integerization would be enough, but for your 
task you'll probably want to spend some time tweaking the codes. In 
particular, you'll probably have enough word types to overflow the space 
of Int32/Word32 or even Int64/Word64. (If not, then skip this section 
and go boldly forward!) Consequently, you won't be able to fit an ID 
into a machine register, so you'll want to find a good dynamic 
representation. If you collect unigram counts along the way (or can 
estimate their frequencies reliably from a subcorpus), then you can use 
a Huffman code, Shannon--Fano code, or arithmetic code in order to bring 
the average codeword length down. The vast majority of word tokens 
belong to very few word types, so this coding will enable you to fit the 
vast majority of tokens into a single machine word, saving a single bit 
or two to indicate that you need to fail-over to passing around an array 
of bits identifying the original token.


If you can get away with not doing those codeword compression tricks, 
then you can also gather all your n-gram statistics along the way (this 
is what my program does). Or, if you can partition the corpus into a 
handful of subcorpora, each of which can get by without the compression 
tricks, then you can run your training code separately on each subcorpus 
and unify the ID spaces when folding the results together.


Other than interning, another major way to save on space is to use 
tries, as you mentioned. In particular, every tree for n-grams fits into 
a tree for (n+1)-grams, and since you're storing the (n+1)-gram tree 
already you might as well have the n-gram tree reuse the overhead. In my 
experience, unifying all the different n-gram trees has at most a 
trivial speed cost compared to giving them all dedicated representations.


The next question is what sort of trie representation to use. Something 
based on big-endian patricia trees (e.g., containers:Data.IntMap or 
bytestring-trie:Data.Trie) is a good place to start. Though, you'll want 
to make them wider than binary-branching in order to reduce the overhead 
of the pointers et al. If you're still running into memory issues for 
storing the whole trained model, then you should look into 
cache-oblivious tree structures that you can mmap in.


If all that still isn't enough, then that's where you start breaking 
into doing research :)


--