1: Why model data is so small and why I'm getting predictions only for a couple of items?
the model size depends on number of items. It only save the "item-vector" of each item. 2: Is this event data quality problem? If yes, How can I test and improve the data quality? could be because your data is too sparse. One way to investigate is to open the engine page (http://localhost:8000 by default) after you run 'pio deploy' the engine. and see the printed model output You should see the info here. take a look the size of "productFeatures". https://github.com/PredictionIO/template-scala-parallel-similarproduct/blob/develop/src/main/scala/ALSAlgorithm.scala#L30 after 3: Is it safe to remove old duplicate set events with MySQL query and leave only the latest set event for item? Will it help regarding data quality? yes. it's safe. (this template doesn't rely on the state change of item properties info to train model) 4: I see different settings for ALS algorithm in engine.json file. Can tweaking those settings in someway help? Are those settings explained somewhere? see here http://spark.apache.org/docs/1.6.2/mllib-collaborative-filtering.html#collaborative-filtering On Mon, Sep 26, 2016 at 3:44 AM, Tahir Mushtaq <[email protected]> wrote: > Hi everyone, > > I am using SimilarProducts template. I have around 3 millions of event > data for about 180k unique items, which is collected in 2 months of period. > Original event data size is about 900MB, but after training, model data > size shrinks to only 16KB. And when I try to get predictions, I receive > predictions only for 15 items. > > I have only $set and view events in my event store, which looks like > following. > > > { > "event" : "$set", > "entityType" : "item", > "entityId" : "someEntityId", > "properties" : { > "property1" : "property1_value", > "property2" : "property2_value" > } > } > > > { > "event" : "view", > "entityType" : "user", > "entityId" : "userSessionId", > "targetEntityType" : "item", > "targetEntityId" : "someTargetEntityId", > "properties" : {} > } > > > > Few facts about my implementation: > - I have removed the requirement in engine template to set user before > user can view the item, as described here https://github.com/apache/ > incubator-predictionio/tree/develop/examples/scala- > parallel-similarproduct/no-set-user > - Since I dont want to track users, Im using session id of the user as the > entityId in view event. > - In my case I cannot track if an item is already set in event store or > not. for this reason, I'm setting the item before each view event every > time. As I read many times in forums that it does not affect predictions, > if an item has multiple set events. > - I'm using MySQL to store everything (event data, model data, metadata > etc.) because of certain requirements. > > I have following questions about above problem: > 1: Why model data is so small and why I'm getting predictions only for a > couple of items? > 2: Is this event data quality problem? If yes, How can I test and improve > the data quality? > 3: Is it safe to remove old duplicate set events with MySQL query and > leave only the latest set event for item? Will it help regarding data > quality? > 4: I see different settings for ALS algorithm in engine.json file. Can > tweaking those settings in someway help? Are those settings explained > somewhere? > > Currently my ALS algorithm settings looks like this: > > "algorithms": [ > { > "name": "als", > "params": { > "rank": 10, > "numIterations" : 10, > "lambda": 0.01, > "seed": 3 > } > } > ] > > > Many thanks for your time and suggestions. > > Best, > Tahir >
