Hi Team, Firstly, If I am posting to a wrong a group please direct me to the right forum or mailing list. Thanks in advance.
Problem: Binary Classification Number of Features: 10K - 20K Number of documents to be trained: 1 Million Model: https://github.com/EmergentOrder/template-scala- probabilistic-classifier-batch-lbfgs Recommended PIO version: 0.9.2 I am new to Prediction IO and I have done small predictions with ~100 features and 10k training set and I was able to run that using a 2 Core 16GB RAM server. Now that my actual dataset is very huge, I don't know where to even start in terms of configuration. I need 3 suggestions - For my problem, have I chosen the correct model? As this model only runs on 0.9.2 and with 0.12 being the latest, am I spending energy on the wrong model? - Should I consider changing the code to be compatible 0.12? - What is the hardware that I should choose? - Should I have a dedicated Spark Cluster? If yes, with what config should I start off with? - How much memory should I set for the driver and executor? - How much time can I expect this training to take? With Regards, Sachin ⚜KTBFFH⚜
