[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84049606 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends

[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-20 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84232802 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0")

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-10-23 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 Hi @yanboliang and @srowen , could you please review whether this PR includes all your comments. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-11-22 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 hi @yanboliang , @srowen @jkbradley , I have updated this PR, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request #15444: [SPARK-17870][MLLIB][ML]Change statistic to pValu...

2016-10-11 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/15444 [SPARK-17870][MLLIB][ML]Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference ## What changes were proposed in this pull request? For feature selection

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector method...

2016-10-16 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 Hi @yanboliang @srowen , this is the last two feature selection methods based on ChiSquare, which is similar to the method in scikit learn. But there is a bug about SelectFDR in scikit learn. I have

[GitHub] spark issue #16434: [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor chang...

2017-01-05 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/16434 Hi @jkbradley , I have updated this PR per your comments. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request #16434: [SPARK-17645][MLLIB][ML][FOLLOW-UP] document mino...

2016-12-29 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/16434 [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change ## What changes were proposed in this pull request? This is a follow-up pr for #15212 to address @jkbradley comments on Document change

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-29 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 Thanks @jkbradley , I will send a follow-up PR for your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML]add feature selector method base...

2016-12-29 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 hi @jkbradley @yanboliang , I have created a follow up PR for this PR. https://github.com/apache/spark/pull/16434 I have not added FDR test in ML Suite. The main reason is the current data set

[GitHub] spark pull request #16452: [ML] fix getThresholds logic error

2017-01-02 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/16452 [ML] fix getThresholds logic error ## What changes were proposed in this pull request? The logic of getThresholds in ML LogisticRegression is not right, and it doesn't match

[GitHub] spark issue #16452: [ML] fix getThresholds logic error

2017-01-02 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/16452 If both threshold and thresholds are not set, the master will return thresholds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request #16452: [ML] fix getThresholds logic error

2017-01-02 Thread mpjlu
Github user mpjlu closed the pull request at: https://github.com/apache/spark/pull/16452 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16452: [ML] fix getThresholds logic error

2017-01-02 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/16452 @sethah , thanks, I got it wrong. I will close it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16434: [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor chang...

2017-01-05 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/16434 Thanks @jkbradley @srowen , I have added a code snippet for verifying with R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2017-04-24 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/17739 [SPARK-20443][MLLIB][ML] set ALS blockify size ## What changes were proposed in this pull request? The blockSize of MLLIB ALS is very important for ALS performance. In our test

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-01 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Thanks. This is my test setting: 3 workers, each: 40 cores, 196G memory, 1 executor. Data Size: user 480,000, item 17,000 --- If your project is set up for it, you can reply

[GitHub] spark pull request #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18832 [SPARK-21623][ML]fix RF doc ## What changes were proposed in this pull request? comments of parentStats in RF are wrong. parentStats is not only used for the first iteration, it is used

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 node.stats is ImpurityStats, and parentStats is Array[Double], there are different. Maybe this comment should be used on node.stats, but not on parentStats. Is my understanding wrong? --- If your

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 I know your point. I am confusing the code doesn't work that way. The code update parentStats for each iteration. Actually, we only need to update parentStats for the first Iteration

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 parentStats is used in this code: binAggregates.getParentImpurityCalculator(), this is used in all iteration. So that comment seems very misleading. `} else

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 I agree with you. Do you think we should update the comment to help others understand the code. Since parantStats is updated and used in each iteration. Thanks. --- If your project is set

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-03 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 Thanks @sethah . I strongly think we should update the commend or just delete the comment as the current PR. Another reason is: there are three kinds of feature: categorical, ordered

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimize Vector compress

2017-08-15 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 I have tested the performance of toSparse and toSparseWithSize separately. There is about 35% performance improvement for this change. --- If your project is set up for it, you can reply

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 For PR-18904, before this change, one iteration is about 58s, after this change, one iteration is about:40s --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18904 A gentle ping: @sethah @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 Hi @srowen; how about using our first version? though duplicate some code, but change is small. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 Yes, I just concern if add toSparse(size) we should check the size in the code, there will be no performance gain. If we don't need to check the "size" (comparing size with numNonZero) i

[GitHub] spark pull request #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-10 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18904 [SPARK-21624]optimzie RF communicaiton cost ## What changes were proposed in this pull request? The implementation of RF is bound by either the cost of statistics computation on workers

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 Thanks @srowen. I will revise the code per your suggestion. when I wrote the code, I just concerned user call toSparse(size) and give a very small size. --- If your project is set up

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 I did not only test this PR. Only work for PR 18904 and find this performance difference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 Thanks @sethah @srowen . The comment is added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18904: [SPARK-21624]optimzie RF communicaiton cost

2017-08-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18904 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18899#discussion_r132610165 --- Diff: project/MimaExcludes.scala --- @@ -1012,6 +1012,10 @@ object MimaExcludes { ProblemFilters.exclude[IncompatibleResultTypeProblem

[GitHub] spark issue #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18899 Hi @sethah , the unit test is added. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18899#discussion_r132610049 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala --- @@ -635,8 +642,9 @@ class SparseVector @Since("2.0.0") (

[GitHub] spark issue #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18868 Yes, that is right. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-13 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18624#discussion_r127214361 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -286,40 +288,124 @@ object

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 I have submitted PR for ALS optimization with GEMM. and it is ready for review. The performance is about 50% improvement comparing with the master method. https://github.com/apache/spark/pull

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Hi @srowen , I have added Test Suite for BoundedPriorityQueue. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 An user block, after Cartesian, will generate many blocks(Number of Item blocks), all these blocks should be aggregated. Thanks. --- If your project is set up for it, you can reply to this email

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 If no poll, we have to use toArray.sorted, which performance is bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 I have tested much about poll and toArray.sorted. If the queue is much ordered (suppose offer 2000 times for queue size 20). Use pq.toArray.sorted is faster. If the queue is much disordered

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Keep it or close it, both is ok for me. We have much discussion on: https://issues.apache.org/jira/browse/SPARK-21401 --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-17 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Hi @MLnick , pq.toArray.sorted also used in other places, like word2vector and LDA, how about waiting for my other benchmark results. Then decide to close it or not. Thanks. --- If your

[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-17 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18624#discussion_r127669102 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -286,40 +288,120 @@ object

[GitHub] spark pull request #18551: [SPARK-21305][ML][MLLIB]Add options to disable mu...

2017-07-11 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18551#discussion_r126665323 --- Diff: docs/ml-guide.md --- @@ -61,6 +61,12 @@ To configure `netlib-java` / Breeze to use system optimised binaries, include project and read

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18551: [SPARK-21305][ML][MLLIB]Add options to disable mu...

2017-07-09 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18551#discussion_r126323336 --- Diff: docs/ml-guide.md --- @@ -61,6 +61,11 @@ To configure `netlib-java` / Breeze to use system optimised binaries, include project and read

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-07-12 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 I have rewritten recommendForAll with BLAS GEMM, and get about 20%-30% performance improvement. https://issues.apache.org/jira/browse/SPARK-21389 --- If your project is set up for it, you can

[GitHub] spark pull request #18620: [MINOR][ML][MLLIB] add poll function for BoundedP...

2017-07-13 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18620 [MINOR][ML][MLLIB] add poll function for BoundedPriorityQueue ## What changes were proposed in this pull request? The most of BoundedPriorityQueue usages in ML/MLLIB are: Get the value

[GitHub] spark issue #18620: [MINOR][ML][MLLIB] add poll function for BoundedPriority...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Yes, my following PR will use it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #18620: [MINOR][ML][MLLIB] add poll function for BoundedPriority...

2017-07-13 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Ok, thanks @srowen . I will create a JIRA, and show the usage and performance comparing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-14 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 We need the value is in order here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-13 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18624 [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by gemm with about 50% performance improvement ## What changes were proposed in this pull request? In Spark 2.2, we have optimized ALS

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 I am ok to close this. Thanks @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Thanks @srowen , my test also said pq.poll is a little faster on some cases. One possible benefit here is if we provide pq.poll, user's first choice may use pq.poll, not pq.toArray.sorted, which

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 I also very confused about this. You can change https://github.com/apache/spark/pull/18624 to sorted and test. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 My micro benchmark (write a program only test pq.toArray.sorted and pq.Array.sortBy and pq.poll), not find significant performance difference. Only in the Spark job, there is big difference. Confused

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 Hi @srowen @MLnick @jkbradley @mengxr @yanboliang Is this change acceptable? if it is acceptable, I will update ALS ML code following this method. Also update Test Suite, which are too simple

[GitHub] spark issue #18620: [SPARK-21401][ML][MLLIB] add poll function for BoundedPr...

2017-07-18 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18620 Hi @MLnick , @srowen . My test showing: pq.poll is not significantly faster than pq.toArray.sortBy, but significantly faster than pq.toArray.sorted. Seems not each pq.toArray.sorted

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-09 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 hi @felixcheung , I have tested one case, write a single thread java program, and call native blas. The performance is much better to disable native blas multi-threading (the total program

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 Hi @srowen , Thanks very much for your review. I will revise the document of this PR to soften the language. According to my profiling data, I guess, when the native BLAS is loaded (or when

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 hi @srowen , I understand Felix's point. I mean if you only have 1 task in C/C++, and 2 CPUs, setting native BLAS to use 2 CPUs will be faster. But in JVM env, even you only have one task, and 2 CPUs

[GitHub] spark issue #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendForAll by...

2017-07-16 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18624 I have checked the results with the master method, the recommendation results are right. The master TestSuite is too simple, should be updated. I will update it. Thanks. --- If your

[GitHub] spark pull request #18624: [SPARK-21389][ML][MLLIB] Optimize ALS recommendFo...

2017-07-17 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18624#discussion_r127641933 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -286,40 +288,120 @@ object

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-07-04 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 I find why F2j BLAS is much faster than Native BLAS for xiangrui's method (use GEMM) here. https://issues.apache.org/jira/browse/SPARK-21305 --- If your project is set up for it, you can reply

[GitHub] spark pull request #18551: [SPARK-21305][ML][MLLIB]Add options to disable mu...

2017-07-06 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18551 [SPARK-21305][ML][MLLIB]Add options to disable multi-threading of native BLAS ## What changes were proposed in this pull request? Many ML/MLLIB algorithms use native BLAS (like Intel MKL

[GitHub] spark issue #18551: [SPARK-21305][ML][MLLIB]Add options to disable multi-thr...

2017-07-06 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18551 Thanks, @srowen . I have updated the doc. I also validated the current option in spark-env.sh, it works. Thanks. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-26 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 Another case 3 workers, each 40 cores, each 196G memory, each 1 executor. Data Size: user 480,000, item 17,000 recommendProductsForUsers with blockSize 4096 is about 34s --- If your

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-26 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 Hi @MLnick , The new test results are: 3 worker, each 10 cores, each 30G memory, each 1 executor. Data Size: user 3,290,000, item 200,000. recommendProductsForUsers with blockSize 4096

[GitHub] spark pull request #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-04-26 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/17742#discussion_r113444550 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -277,17 +278,39 @@ object

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-26 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 Thanks very much @MLnick . I am doing more test about mllib solution. When it is solid enough, then we can submit a follow up PR for ML optimization. How do you think about it? --- If your

[GitHub] spark pull request #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-04-26 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/17742#discussion_r113445477 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -277,17 +278,39 @@ object

[GitHub] spark issue #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2017-04-24 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17739 users is 480,000, items is 170,000. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2017-04-24 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17739 Thanks @MLnick . Could you please review my another PR for recommend all performance problem. https://github.com/apache/spark/pull/17742. Sorry, I forget user cannot call recommendForAll

[GitHub] spark pull request #17742: [Spark-20446][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-04-24 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/17742 [Spark-20446][ML][MLLIB]Optimize MLLIB ALS recommendForAll ## What changes were proposed in this pull request? The recommendForAll of MLLIB ALS is very slow. GC is a key problem

[GitHub] spark issue #17739: [SPARK-20443][MLLIB][ML] set ALS blockify size

2017-04-24 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17739 RecommandProductsForUsers. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-04-28 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/17742#discussion_r113862880 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -276,44 +277,53 @@ object

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-28 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 Hi @MLnick, I will be on vacation next week. If you have time to create an ML optimization follow up PR. I am ok. Otherwise, I will submit the follow up PR after my 1 week vacation. Thanks

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-28 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 retest please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-28 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-04-28 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 Thanks @MLnick. Please go ahead for ML API optimization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-07-31 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Did you test the performance of this, I tested the performance of MLLIB recommendForUserSubset some days ago, the performance is not good. Suppose the time of recommendForAll is 35s, recommend for 1

[GitHub] spark pull request #18899: [SPARK-21680][ML][MLLIB]optimzie Vector coompress

2017-08-09 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18899 [SPARK-21680][ML][MLLIB]optimzie Vector coompress ## What changes were proposed in this pull request? When use Vector.compressed to change a Vector to SparseVector, the performance is very

[GitHub] spark pull request #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/18868#discussion_r131605981 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -1107,9 +1108,11 @@ private[spark] object RandomForest extends Logging

[GitHub] spark issue #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18868 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #18868: [SPARK-21638][ML]Fix RF/GBT Warning message error

2017-08-07 Thread mpjlu
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/18868 [SPARK-21638][ML]Fix RF/GBT Warning message error ## What changes were proposed in this pull request? When train RF model, there are many warning messages like this: > W

[GitHub] spark issue #18832: [SPARK-21623][ML]fix RF doc

2017-08-07 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18832 Thanks @srowen , I revised the comments per Seth's suggestion: "Parent stats need to be explicitly tracked in the DTStatsAggregator because the parent [[Node]] object does not have Impurity

[GitHub] spark issue #18748: [SPARK-20679][ML] Support recommending for a subset of u...

2017-08-20 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/18748 Thanks @MLnick . I have double checked my test. Since there is no recommendForUserSubset , my previous test is MLLIB MatrixFactorizationModel::predict(RDD(Int, Int)), which predicts the rating

[GitHub] spark pull request #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recomm...

2017-05-03 Thread mpjlu
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/17742#discussion_r114579054 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -276,44 +277,53 @@ object

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-05-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 ** The most optimized version would be doing a quickselect on each row and select top k. ** An easy-to-implement version would be: I test both of the methods, the best performance is about 50

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-05-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 I not validate whether this code is right. just test performance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-05-11 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 val srcBlocks = blockify(rank, srcFeatures) val dstBlocks = blockify(rank, dstFeatures) val pq = new BoundedPriorityQueue[(Int, Double)](num)(Ordering.by(_._2)) val ratings

[GitHub] spark issue #17742: [Spark-11968][ML][MLLIB]Optimize MLLIB ALS recommendForA...

2017-05-10 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17742 F2Jblas is faster than MKL blas. The following test is based on F2jBLAS. Method 1: BLAS 3 + quickselect on each row and select top k. Method 2: this PR BLOCK size: 256 512 1024 2048

[GitHub] spark issue #17919: [SPARK-20677][MLLIB][ML] Follow-up to ALS recommend-all ...

2017-05-09 Thread mpjlu
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/17919 Thanks, I am ok for this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

<    1   2   3   >