It sounds like you don't quite have a cold start problem. You have a few behaviors, a few views or clicks, not zero. So you really just need to find an approach that's quite comfortable with sparse input. A low-rank factorization model like ALS works fine in this case, for example.
There's a circularity problem in thinking about solving this with clustering: if you have not enough data to recommend to users at the start, on what data are you clustering them before that? I don't think you need clustering either. (Of course, you can cluster easily from the representation you get out of something like a low-rank factorization. It can easily be an output rather than an 'input'.) As to evaluation, it a depends a little on what you mean by frequent item sets and evaluation. You say a result is good if it occurs frequently overall with other items the user viewed? It makes some sense, although it sounds like you're just testing if the recommender does exactly what a item-similarity-based recommender would do when based on co-occurrence between items. That is, if that's defined as the right answer, then save yourself the trouble and build the recommender to give exactly that answer? Usually you see if the model recommends back things the user actually viewed, that were held out of the training data. This has its own problems but presupposing a correct algorithm isn't one of them.