[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986259#comment-15986259 ] Daniel Li commented on SPARK-20468: --- Great, thanks Sean. How should I go forward with the documentation? Detail proposed documentation in the ticket before starting a PR, or just go ahead and create a PR? > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Li closed SPARK-20468. - Resolution: Won't Fix > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985888#comment-15985888 ] Daniel Li commented on SPARK-20468: --- Thanks for the guidance, [~srowen]. I've read the Contributing guide now; my apologies for not reading it through sooner. I've closed the PR and created a few new Jira issues to propose documentation additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and SPARK-20486. How do you want to proceed? > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class
[ https://issues.apache.org/jira/browse/SPARK-20486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Li updated SPARK-20486: -- Description: The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { val userPart = new ALSPartitioner(numUserBlocks) val itemPart = new ALSPartitioner(numItemBlocks) val blockRatings = partitionRatings(ratings, userPart, itemPart) .persist(storageLevel) val (userInBlocks, userOutBlocks) = makeBlocks("user", blockRatings, userPart, itemPart, storageLevel) userOutBlocks.count() // materialize `blockRatings` and user blocks val swappedBlockRatings = blockRatings.map { case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, localRatings)) => ((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, localRatings)) } val (itemInBlocks, itemOutBlocks) = makeBlocks("item", swappedBlockRatings, itemPart, userPart, storageLevel) itemOutBlocks.count() // materialize item blocks blockRatings.unpersist() new RatingBlocks(userInBlocks, userOutBlocks, itemInBlocks, itemOutBlocks) } private[this] def partitionRatings[ID: ClassTag](...) = { // existing code goes here verbatim } private[this] def makeBlocks[ID: ClassTag](...) = { // existing code goes here verbatim } } {code} was: The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { // In-block and out-block code currently in `ALS.train` goes here } private[this] def partitionRatings[ID: ClassTag](...) = { ... } private[this] def makeBlocks[ID: ClassTag](...) = { ... } } {code} > Encapsulate ALS in-block and out-block data structures and methods into a > separate class > > > Key: SPARK-20486 > URL: https://issues.apache.org/jira/browse/SPARK-20486 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Trivial > > The in-block and out-block data structures in the ALS code is currently > calculated within the {{ALS.train}} method itself. I propose to move this > code, along with its helper functions, into a separate class to encapsulate > the creation of the blocks. This has the added benefit of allowing us to > include a comprehensive Scaladoc to this new class to explain in detail how > this core part of the algorithm works. > Proposal: > {code} > private[recommendation] final case class RatingBlocks[ID]( > userIn: RDD[(Int, InBlock[ID])], > userOut: RDD[(Int, OutBlock)], > itemIn: RDD[(Int, InBlock[ID])], > itemOut: RDD[(Int, OutBlock)] > ) > private[recommendation] object RatingBlocks { > def create[ID: ClassTag: Ordering]( > ratings: RDD[Rating[ID]], > numUserBlocks: Int, > numItemBlocks: Int, > storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): > RatingBlocks[ID] = { > val userPart = new ALSPartitioner(numUserBlocks) > val itemPart = new ALSPartitioner(numItemBlocks) > val blockRatings = > partitionRatings(ratings, userPart, itemPart) > .persist(storageLevel) > val
[jira] [Created] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class
Daniel Li created SPARK-20486: - Summary: Encapsulate ALS in-block and out-block data structures and methods into a separate class Key: SPARK-20486 URL: https://issues.apache.org/jira/browse/SPARK-20486 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The in-block and out-block data structures in the ALS code is currently calculated within the {{ALS.train}} method itself. I propose to move this code, along with its helper functions, into a separate class to encapsulate the creation of the blocks. This has the added benefit of allowing us to include a comprehensive Scaladoc to this new class to explain in detail how this core part of the algorithm works. Proposal: {code} private[recommendation] final case class RatingBlocks[ID]( userIn: RDD[(Int, InBlock[ID])], userOut: RDD[(Int, OutBlock)], itemIn: RDD[(Int, InBlock[ID])], itemOut: RDD[(Int, OutBlock)] ) private[recommendation] object RatingBlocks { def create[ID: ClassTag: Ordering]( ratings: RDD[Rating[ID]], numUserBlocks: Int, numItemBlocks: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): RatingBlocks[ID] = { // In-block and out-block code currently in `ALS.train` goes here } private[this] def partitionRatings[ID: ClassTag](...) = { ... } private[this] def makeBlocks[ID: ClassTag](...) = { ... } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20485) Split ALS.scala into multiple files
Daniel Li created SPARK-20485: - Summary: Split ALS.scala into multiple files Key: SPARK-20485 URL: https://issues.apache.org/jira/browse/SPARK-20485 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The file {{/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala}} currently contains four top-level classes that span 1,500+ lines. To make the file easier to navigate, and to bring it in line with the [Scala style guide|http://docs.scala-lang.org/style/files.html], I propose we split it into four files holding their respective classes: {{ALS.scala}}, {{ALSParams.scala}}, {{ALSModel.scala}}, and {{ALSModelParams.scala}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20484) Add documentation to ALS code
Daniel Li created SPARK-20484: - Summary: Add documentation to ALS code Key: SPARK-20484 URL: https://issues.apache.org/jira/browse/SPARK-20484 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Trivial The documentation (both Scaladocs and inline comments) for the ALS code (in package {{org.apache.spark.ml.recommendation}}) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20468) Refactor the ALS code
[ https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984350#comment-15984350 ] Daniel Li commented on SPARK-20468: --- I'd appreciate if someone with permissions could assign this issue to me. > Refactor the ALS code > - > > Key: SPARK-20468 > URL: https://issues.apache.org/jira/browse/SPARK-20468 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Daniel Li >Priority: Minor > Labels: documentation, readability, refactoring > > The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is > quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all > in one file. Here are some things I think could improve the clarity and > maintainability of the code: > * The file can be split into more manageable parts. In particular, {{ALS}}, > {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files > for better readability. > * Certain parts can be encapsulated or moved to clarify the intent. For > example: > ** The {{ALS.train}} method is currently defined in the {{ALS}} companion > object, and it seems to take 12 individual parameters that are all members of > the {{ALS}} class. This method can be made an instance method. > ** The code that creates in-blocks and out-blocks in the body of > {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods > in the {{ALS}} companion object, can be moved into a separate case class that > holds the blocks. This has the added benefit of allowing us to write > specific Scaladoc to explain the logic behind these block objects, as their > usage is certainly nontrivial yet is fundamental to the implementation. > ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be > hidden within {{UncompressedInBlock}} to clarify the scope of their usage. > ** Various builder classes could be encapsulated in the companion objects of > the classes they build. > * The code can be formatted more clearly. For example: > ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be > formatted more clearly and have comments added explaining the reasoning > behind key parts. That these methods form the core of the ALS logic makes > this doubly important for maintainability. > ** Parts of the code that use {{while}} loops with manually incremented > counters can be rewritten as {{for}} loops. > ** Where non-idiomatic Scala code is used that doesn't improve performance > much, clearer code can be substituted. (This in particular should be done > very carefully if at all, as it's apparent the original author spent much > time and pains optimizing the code to significantly improve its runtime > profile.) > * The documentation (both Scaladocs and inline comments) can be clarified > where needed and expanded where incomplete. This is especially important for > parts of the code that are written imperatively for performance, as these > parts don't benefit from the intuitive self-documentation of Scala's > higher-level language abstractions. Specifically, I'd like to add > documentation fully explaining the key functionality of the in-block and > out-block objects, their purpose, how they relate to the overall ALS > algorithm, and how they are calculated in such a way that new maintainers can > ramp up much more quickly. > The above is not a complete enumeration of improvements but a high-level > survey. All of these improvements will, I believe, add up to make the code > easier to understand, extend, and maintain. This issue will track the > progress of this refactoring so that going forward, authors will have an > easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20468) Refactor the ALS code
Daniel Li created SPARK-20468: - Summary: Refactor the ALS code Key: SPARK-20468 URL: https://issues.apache.org/jira/browse/SPARK-20468 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.1.0 Reporter: Daniel Li Priority: Minor The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in one file. Here are some things I think could improve the clarity and maintainability of the code: * The file can be split into more manageable parts. In particular, {{ALS}}, {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files for better readability. * Certain parts can be encapsulated or moved to clarify the intent. For example: ** The {{ALS.train}} method is currently defined in the {{ALS}} companion object, and it seems to take 12 individual parameters that are all members of the {{ALS}} class. This method can be made an instance method. ** The code that creates in-blocks and out-blocks in the body of {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods in the {{ALS}} companion object, can be moved into a separate case class that holds the blocks. This has the added benefit of allowing us to write specific Scaladoc to explain the logic behind these block objects, as their usage is certainly nontrivial yet is fundamental to the implementation. ** The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden within {{UncompressedInBlock}} to clarify the scope of their usage. ** Various builder classes could be encapsulated in the companion objects of the classes they build. * The code can be formatted more clearly. For example: ** Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be formatted more clearly and have comments added explaining the reasoning behind key parts. That these methods form the core of the ALS logic makes this doubly important for maintainability. ** Parts of the code that use {{while}} loops with manually incremented counters can be rewritten as {{for}} loops. ** Where non-idiomatic Scala code is used that doesn't improve performance much, clearer code can be substituted. (This in particular should be done very carefully if at all, as it's apparent the original author spent much time and pains optimizing the code to significantly improve its runtime profile.) * The documentation (both Scaladocs and inline comments) can be clarified where needed and expanded where incomplete. This is especially important for parts of the code that are written imperatively for performance, as these parts don't benefit from the intuitive self-documentation of Scala's higher-level language abstractions. Specifically, I'd like to add documentation fully explaining the key functionality of the in-block and out-block objects, their purpose, how they relate to the overall ALS algorithm, and how they are calculated in such a way that new maintainers can ramp up much more quickly. The above is not a complete enumeration of improvements but a high-level survey. All of these improvements will, I believe, add up to make the code easier to understand, extend, and maintain. This issue will track the progress of this refactoring so that going forward, authors will have an easier time maintaining this part of the project. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177 ] Daniel Li edited comment on SPARK-6407 at 3/5/17 10:57 AM: --- {quote} In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. {quote} Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? was (Author: danielyli): bq. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177 ] Daniel Li commented on SPARK-6407: -- bq. In practice fold-in works fine. Folding in a day or so of updates has been OK. The question isn't RMSE but how it affects actual rankings of items in recommendations, and it takes a while before the effect of the approximation actually changes a rank. Hmm, I see. This would be something I'd be interested in implementing for Spark if there's need. Are there implementations (or papers) of this you know of that I could look at? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896033#comment-15896033 ] Daniel Li commented on SPARK-6407: -- Appreciate the quick reply, [~srowen]. Yeah, we'd be recomputing them, but not from scratch since we'd be starting with optimized _U_ and _V_. It would likely take only one or two iterations before reconvergence. Would this still be considered too expensive? The thing I hesitate about regarding fold-in updating is that the assumption that only the corresponding user row and item row will change may be too simplifying (since, of course, there's a "rippling out" effect—all items the user rated previous need to be updated, then all users that rated any of those items would need updating, etc.). Then again, even if we take this rippling into account the computation may not be too expensive, since a single update likely won't affect the RMSE enough to delay convergence. (Though I haven't worked out the math showing this; it's just a hunch.) Do you have any insights into this? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering
[ https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895589#comment-15895589 ] Daniel Li commented on SPARK-6407: -- Reviving this thread since I'm interested in implementing streaming CF for Spark. bq. Using ALS for online updates is expensive. Recomputing the factor matrices _U_ and _V_ from scratch for every update would be terribly expensive, but what about keeping _U_ and _V_ around and simply recomputing another round or two after each new rating that comes in? The algorithm would simply be continually following a moving optimum. I can't imagine the RMSE changing much due to small updates if we use a convergence threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] instead of a fixed number of iterations. (In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice of time, new updates that come in during a round of calculation could be "batched" into the next round of calculation, increasing efficiency.) Thoughts? > Streaming ALS for Collaborative Filtering > - > > Key: SPARK-6407 > URL: https://issues.apache.org/jira/browse/SPARK-6407 > Project: Spark > Issue Type: New Feature > Components: DStreams >Reporter: Felix Cheung >Priority: Minor > > Like MLLib's ALS implementation for recommendation, and applying to streaming. > Similar to streaming linear regression, logistic regression, could we apply > gradient updates to batches of data and reuse existing MLLib implementation? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port
[ https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649570#comment-14649570 ] Daniel Li commented on SPARK-9279: -- I echo Omar's point - I understand that my particular use goes against normal best practices, but I don't think it's a developer's place to enforce these kinds of constraints on users just because it makes sense to them. It seems to limit spark's usability for no gain elsewhere I have a case where users can access services running in data center only through a small number ports between 1 and 1024. Hard-code spark webui port essentially stopped users from access spark webui in the data center. I submitted a pull request above (https://github.com/apache/spark/pull/7835). Sean, do you have major concerns in configuring spark webui to use ports between 1 and 1024? Spark Master Refuses to Bind WebUI to a Privileged Port --- Key: SPARK-9279 URL: https://issues.apache.org/jira/browse/SPARK-9279 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.1 Environment: Ubuntu Trusty running in a docker container Reporter: Omar Padron Priority: Minor When trying to start a spark master server as root... {code} export SPARK_MASTER_PORT=7077 export SPARK_MASTER_WEBUI_PORT=80 spark-class org.apache.spark.deploy.master.Master \ --host $( hostname ) \ --port $SPARK_MASTER_PORT \ --webui-port $SPARK_MASTER_WEBUI_PORT {code} The process terminates with IllegalArgumentException requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs fine. I do not understand why the usable ports have been arbitrarily restricted to the non-privileged. Users choosing to run spark as root should be allowed to choose their own ports. Full output from a sample run below: {code} 2015-07-23 14:36:50,892 INFO [main] master.Master (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, HUP, INT] 2015-07-23 14:36:51,399 WARN [main] util.NativeCodeLoader (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2015-07-23 14:36:51,586 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: root 2015-07-23 14:36:51,587 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: root 2015-07-23 14:36:51,588 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 2015-07-23 14:36:52,295 INFO [sparkMaster-akka.actor.default-dispatcher-2] slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started 2015-07-23 14:36:52,349 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting 2015-07-23 14:36:52,489 INFO [sparkMaster-akka.actor.default-dispatcher-2] Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening on addresses :[akka.tcp://sparkMaster@sparkmaster:7077] 2015-07-23 14:36:52,497 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'sparkMaster' on port 7077. 2015-07-23 14:36:52,717 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SelectChannelConnector@sparkmaster:6066 2015-07-23 14:36:52,759 INFO [sparkMaster-akka.actor.default-dispatcher-4] util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 6066. 2015-07-23 14:36:52,760 INFO [sparkMaster-akka.actor.default-dispatcher-4] rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server for submitting applications on port 6066 2015-07-23 14:36:52,765 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Starting Spark master at spark://sparkmaster:7077 2015-07-23 14:36:52,766 INFO [sparkMaster-akka.actor.default-dispatcher-4] master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI java.lang.IllegalArgumentException: requirement failed: startPort should be between 1024 and 65535 (inclusive), or 0 for a random free port. at