Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-16 Thread Davis Varghese
Since we are on spark 2.2, I backported/fixed it. Here is the diff file comparing against https://github.com/apache/spark/blob/73fe1d8087cfc2d59ac5b9af48b4cf5f5b86f920/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala 24c24 < import org.apache.spark.ml.param.{Param, ParamMap,

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-15 Thread Davis Varghese
Since we are on spark 2.2, I backported/fixed it. Here is the diff file comparing against https://github.com/apache/spark/blob/73fe1d8087cfc2d59ac5b9af48b4cf5f5b86f920/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala 24c24 < import org.apache.spark.ml.param.{Param, ParamMap,

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-15 Thread Davis Varghese
Since we are on spark 2.2, I backported/fixed it. Here is the diff file comparing against https://github.com/apache/spark/blob/73fe1d8087cfc2d59ac5b9af48b4cf5f5b86f920/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala 24c24 < import org.apache.spark.ml.param.{Param, ParamMap,

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-15 Thread Jorge Sánchez
Hi, after seeing that IDF needed refactoring to use ML vectors instead of MLLib ones, I have created a Jira ticket in https://issues.apache.org/jira/browse/SPARK-22531 and submitted a PR for it. If anyone can have a look and suggest any changes

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-14 Thread Bago Amirbekian
There is a known issue with VectorAssembler which causes it to fail in streaming if any of the input columns are of VectorType & don't have size information, https://issues.apache.org/jira/browse/SPARK-22346. This can be fixed by adding size information to the vector columns, I've made a PR to

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-12 Thread Davis Varghese
Bago, Finally I am able to create one which fails consistently. I think the issue is caused by the VectorAssembler in the model. In the new code, I have 2 features(1 text and 1 number) and I have to run through a VectorAssembler before giving to LogisticRegression. Code and test data below

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-09 Thread Davis Varghese
Bago, The code I wrote is not generating the issue. In our case, we build a ML pipeline from a UI and is done in a particular fashion so that a user can create a pipeline behind the scene using drag and drop. I am yet to dig deeper to recreate the same as a standalone code. Meanwhile I am sharing

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-09 Thread Bago Amirbekian
Davis, were you able to find an example? Anything you have could help help. On Wed, Nov 1, 2017 at 8:53 PM Davis Varghese wrote: > Sure. I will get one over the weekend > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > >

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-01 Thread Davis Varghese
Sure. I will get one over the weekend -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: HashingTFModel/IDFModel in Structured Streaming

2017-11-01 Thread Bago Amirbekian
Davis I'm looking into this. If you could include some code that I can use to reproduce the error & the stack trace it would be really helpful. On Fri, Oct 20, 2017 at 11:01 AM Joseph Bradley wrote: > Hi Davis, > We've started tracking these issues under this umbrella: >

Re: HashingTFModel/IDFModel in Structured Streaming

2017-10-20 Thread Joseph Bradley
Hi Davis, We've started tracking these issues under this umbrella: https://issues.apache.org/jira/browse/SPARK-21926 I'm hoping we can fix some of these for 2.3. Thanks, Joseph On Mon, Oct 16, 2017 at 9:23 PM, Davis Varghese wrote: > I have built a ML pipeline model on a

HashingTFModel/IDFModel in Structured Streaming

2017-10-16 Thread Davis Varghese
I have built a ML pipeline model on a static twitter data for sentiment analysis. When I use the model on a structured stream, it always throws "Queries with streaming sources must be executed with writeStream.start()". This particular model doesn't contain any documented "unsupported"