Hi! We have implemented a transformer that computes a cooccurrence matrix for words within a given window. This matrix will then be used for unsupervised learning of vector representations for words (we basically implement this: http://nlp.stanford.edu/projects/glove/)
Right now, we have implemented the computation of the cooccurrence matrix as a sliding window over lines that we get from env.readTextFile(...) Instead, it would be nice if we could do a sliding window over sentences. Until now, we could not figure out how to get sentences that (in the worst case) span multiple lines. Is this somehow possible or would we have to define our own input-format for this? The idea is to read a corpus and allow some kind of user defined parsing of the text documents (something like CorpusInputFormat maybe...?). Thanks! Felix