Hi Sachin, 1. I would highly encourage you to adopt the template, and upgrade and maintain it to track future PIO releases if that's something you like to do. Otherwise, you may want to consider following http://predictionio. apache.org/templates/classification/quickstart/ and see if your use case fits into it. Being an official template means it will track the main PredictionIO release.
2. You should definitely have a dedicated Spark cluster if your input data size is going to be much larger. Start with machines that have 1:2 to 1:4 core-to-GB of memory ratio, and scales out the cluster as needed to meet your training time requirement. Regards, Donald On Thu, Oct 26, 2017 at 1:00 AM, Sachin Kamkar <[email protected]> wrote: > Hi Team, > > Firstly, If I am posting to a wrong a group please direct me to the right > forum or mailing list. Thanks in advance. > > Problem: Binary Classification > Number of Features: 10K - 20K > Number of documents to be trained: 1 Million > Model: https://github.com/EmergentOrder/template-scala-probabilisti > c-classifier-batch-lbfgs > Recommended PIO version: 0.9.2 > > I am new to Prediction IO and I have done small predictions with ~100 > features and 10k training set and I was able to run that using a 2 Core > 16GB RAM server. > > Now that my actual dataset is very huge, I don't know where to even start > in terms of configuration. > > I need 3 suggestions > > - For my problem, have I chosen the correct model? As this model only > runs on 0.9.2 and with 0.12 being the latest, am I spending energy on the > wrong model? > - Should I consider changing the code to be compatible 0.12? > - What is the hardware that I should choose? > - Should I have a dedicated Spark Cluster? If yes, with what config > should I start off with? > - How much memory should I set for the driver and executor? > - How much time can I expect this training to take? > > > With Regards, > > Sachin > ⚜KTBFFH⚜ >
