I have added a question <https://stackoverflow.com/questions/46636034/apache-beam-python-word-count-and-document-frequency> to stackoverflow showing what I have put together so far. While it works, this still seems fairly hacky. Would anybody be able to suggest some improvements/best-practices that I should implement?
On Sun, Oct 8, 2017 at 1:53 PM, James Comfort <[email protected]> wrote: > Hi all, > > I am looking to extend the wordcount > <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py> > python > example to track not only the 'count' of the words in all sentences, but > also to include the number of unique documents (ie. sentences) that the > word appears in. This could be used to calculate the inverse document > frequency for tf-idf or similar. Any suggestions or examples that help to > illustrate the most efficient way to do this via the Python SDK? > > Thanks, > Jimmy >
