[GitHub] flink issue #2819: [FLINK-4961] [ml] SGD for Matrix Factorization (WIP)
Github user gaborhermann commented on the issue: https://github.com/apache/flink/pull/2819 Hello Theodore, Thanks for taking a look at this PR! 1. No problem. Some performance evaluations might help us in this case too. I don't believe it should have much effect on the performance anyway as a 3-way join is only used outside the iteration. 2. I really like the idea of doing a performance evaluation! I'm not exactly sure how to do this with a `join` instead of a `coGroup`, so let me sketch an implementation before elaborating on the pros/cons. (It was more straightforward to use `coGroup`.) 3. The benefit of using more blocks is to use less memory, just as in the ALS algorithm, with the disadvantage of using more network. However, more blocks here means much more network and bookkeeping compared to ALS, because there are more Flink iterations. I've investigated this a bit more and found that the main "bottleneck" is the maximum size of the sort-buffer, which was around 100 MB with around 40 GB TaskManager memory, and large matrix blocks do not fit in. So even given enough memory, we cannot use large blocks. Unfortunately we cannot configure the maximum size of the sort-buffer, and I would not like to change the underlying code in the core or runtime, but I tried a workaround which might work (splitting the factor blocks to subblocks). I'll just need to run some measurements. If this workaround turns out fine, then it should be okay to go on with this and give instructions in the docs. 4. Okay, I agree. I hope we could generalize this non-overlapping blocking to other optimization problems :) I'll take a look at the paper you linked! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2819: [FLINK-4961] [ml] SGD for Matrix Factorization (WIP)
Github user thvasilo commented on the issue: https://github.com/apache/flink/pull/2819 Hello @gaborhermann, I really like the idea of introducing a `MatrixFactorization` interface that we can then use for different specialized optimization algorithms. For the question I'm afraid I can't be of much help, I'll read the relevant paper this week and get back to you if I have any more comments. 1. Don't know enough about joins to answer this :/ 2. For this we would need to test the two solutions you have proposed and evaluate the performance. You have listed a couple of pros/cons there, maybe you can elaborate? 3. If there is absolutely no benefit from using more blocks I don't see the need to investigate further. We'll need to include these instructions in the docs. 4. I think test hardening should be done in a different PRs (potentially for more algorithms). For now manual tests should suffice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #2819: [FLINK-4961] [ml] SGD for Matrix Factorization (WIP)
Github user gaborhermann commented on the issue: https://github.com/apache/flink/pull/2819 There are some open questions: 1. Should we optimize 3 way join? For now the join order is burnt into the code, also we might be able to give hints for join strategies. 2. How should we handle empty blocks? When matching a rating block with the current factor blocks there might be no rating block or no factor blocks with that id, as the rating block corresponds to differnt user and item block at every iteration. For now we do the join between the blocks with a `coGroup`, and do basically a full-outer-join, because we need to change the rating block ID for every factor block at each iteration. This might not be the most optimal solution (see comments at `coGroup`), but I don't see a better one right now. 3. The number of blocks determine also the number of iterations. Therefore the higher number of blocks degrade the performance. We conducted experiments on a cluster that shows this: see [plot for movielens data](https://s18.postimg.org/txap3x9o9/movielens_blocks.png) and [for lfm_1b data](https://s11.postimg.org/ysnonuer7/lfm1b_blocks.png). Based on this we would recommend setting the number of blocks to the smallest possible that can fit into memory (and at least the parallelism of the execution). There might be some way to avoid this and break the computation to more blocks while doing the same amount of iteration, but it's not trivial because of the possibly conflicting user-item blocks (and why the paper uses this blocking in the first-place). Should we investigate this further? With the recommended settings (and given enough memory) the algorithm performs well (see the plots). 4. The testing data is made by hand to ensure changes to the code does not change the algorithm. The algorithm produces good results on real data. The question is whether we should make a more thorough testing mechanism for matrix factorization (as proposed in the [PR for iALS](https://github.com/apache/flink/pull/2542)) or is this kind of testing sufficient? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---