Depending on your needs, its fairly easy to write a lightweight python wrapper around the Databricks spark-corenlp library: https://github.com/databricks/spark-corenlp
Nicholas Szandor Hakobian, Ph.D. Staff Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat <dceash...@gmail.com> wrote: > Thanks Holden and Chetan. > > Holden - Have you tried it out, do you know the right way to do it? > Chetan - yes, if we use a Java NLP library, it should not be any issue in > integrating with spark streaming, but as I pointed out earlier, we want to > give flexibility to data scientists to use the language and library of > their choice, instead of restricting them to a library of our choice. > > On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> But you can still use Stanford NLP library and distribute through spark >> right ! >> >> On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >>> So it’s certainly doable (it’s not super easy mind you), but until the >>> arrow udf release goes out it will be rather slow. >>> >>> On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <dceash...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> Has someone tried running NLTK (python) with Spark Streaming (scala)? I >>>> was wondering if this is a good idea and what are the right Spark operators >>>> to do this? The reason we want to try this combination is that we don't >>>> want to run our transformations in python (pyspark), but after the >>>> transformations, we need to run some natural language processing operations >>>> and we don't want to restrict the functions data scientists' can use to >>>> Spark natural language library. So, Spark streaming with NLTK looks like >>>> the right option, from the perspective of fast data processing and data >>>> science flexibility. >>>> >>>> Regards, >>>> Ashish >>>> >>> -- >>> Twitter: https://twitter.com/holdenkarau >>> >> >> >