There already is an example of tfidf[1].

1:
https://github.com/apache/beam/blob/v2.1.1/sdks/python/apache_beam/examples/complete/tfidf.py

On Sun, Oct 8, 2017 at 2:04 PM, James Comfort <[email protected]> wrote:

> I have added a question
> <https://stackoverflow.com/questions/46636034/apache-beam-python-word-count-and-document-frequency>
> to stackoverflow showing what I have put together so far. While it works,
> this still seems fairly hacky. Would anybody be able to suggest some
> improvements/best-practices that I should implement?
>
> On Sun, Oct 8, 2017 at 1:53 PM, James Comfort <[email protected]> wrote:
>
>> Hi all,
>>
>> I am looking to extend the wordcount
>> <https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount.py>
>>  python
>> example to track not only the 'count' of the words in all sentences, but
>> also to include the number of unique documents (ie. sentences) that the
>> word appears in.  This could be used to calculate the inverse document
>> frequency for tf-idf or similar.  Any suggestions or examples that help to
>> illustrate the most efficient way to do this via the Python SDK?
>>
>> Thanks,
>> Jimmy
>>
>
>

Reply via email to