There already is an example of tfidf[1]. 1: https://github.com/apache/beam/blob/v2.1.1/sdks/python/apache_beam/examples/complete/tfidf.py
On Sun, Oct 8, 2017 at 2:04 PM, James Comfort <[email protected]> wrote: > I have added a question > <https://stackoverflow.com/questions/46636034/apache-beam-python-word-count-and-document-frequency> > to stackoverflow showing what I have put together so far. While it works, > this still seems fairly hacky. Would anybody be able to suggest some > improvements/best-practices that I should implement? > > On Sun, Oct 8, 2017 at 1:53 PM, James Comfort <[email protected]> wrote: > >> Hi all, >> >> I am looking to extend the wordcount >> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py> >> python >> example to track not only the 'count' of the words in all sentences, but >> also to include the number of unique documents (ie. sentences) that the >> word appears in. This could be used to calculate the inverse document >> frequency for tf-idf or similar. Any suggestions or examples that help to >> illustrate the most efficient way to do this via the Python SDK? >> >> Thanks, >> Jimmy >> > >
