[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-27 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986259#comment-15986259
 ] 

Daniel Li commented on SPARK-20468:
---

Great, thanks Sean.  How should I go forward with the documentation?  Detail 
proposed documentation in the ticket before starting a PR, or just go ahead and 
create a PR?

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20468) Refactor the ALS code

2017-04-27 Thread Daniel Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Li closed SPARK-20468.
-
Resolution: Won't Fix

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15985888#comment-15985888
 ] 

Daniel Li commented on SPARK-20468:
---

Thanks for the guidance, [~srowen].  I've read the Contributing guide now; my 
apologies for not reading it through sooner.

I've closed the PR and created a few new Jira issues to propose documentation 
additions and internal code clarity improvements: SPARK-20484, SPARK-20485, and 
SPARK-20486.  How do you want to proceed?

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class

2017-04-26 Thread Daniel Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Li updated SPARK-20486:
--
Description: 
The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {

val userPart = new ALSPartitioner(numUserBlocks)
val itemPart = new ALSPartitioner(numItemBlocks)

val blockRatings =
  partitionRatings(ratings, userPart, itemPart)
.persist(storageLevel)

val (userInBlocks, userOutBlocks) =
  makeBlocks("user", blockRatings, userPart, itemPart, storageLevel)

userOutBlocks.count()   // materialize `blockRatings` and user blocks

val swappedBlockRatings = blockRatings.map {
  case ((userBlockId, itemBlockId), RatingBlock(userIds, itemIds, 
localRatings)) =>
((itemBlockId, userBlockId), RatingBlock(itemIds, userIds, 
localRatings))
}

val (itemInBlocks, itemOutBlocks) =
  makeBlocks("item", swappedBlockRatings, itemPart, userPart, storageLevel)

itemOutBlocks.count()   // materialize item blocks
blockRatings.unpersist()
new RatingBlocks(userInBlocks, userOutBlocks, itemInBlocks, itemOutBlocks)
  }

  private[this] def partitionRatings[ID: ClassTag](...) = {
// existing code goes here verbatim
  }

  private[this] def makeBlocks[ID: ClassTag](...) = {
// existing code goes here verbatim
  }
}
{code}

  was:
The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {
// In-block and out-block code currently in `ALS.train` goes here
  }

  private[this] def partitionRatings[ID: ClassTag](...) = { ... }

  private[this] def makeBlocks[ID: ClassTag](...) = { ... }
}
{code}


> Encapsulate ALS in-block and out-block data structures and methods into a 
> separate class
> 
>
> Key: SPARK-20486
> URL: https://issues.apache.org/jira/browse/SPARK-20486
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Trivial
>
> The in-block and out-block data structures in the ALS code is currently 
> calculated within the {{ALS.train}} method itself.  I propose to move this 
> code, along with its helper functions, into a separate class to encapsulate 
> the creation of the blocks.  This has the added benefit of allowing us to 
> include a comprehensive Scaladoc to this new class to explain in detail how 
> this core part of the algorithm works.
> Proposal:
> {code}
> private[recommendation] final case class RatingBlocks[ID](
>   userIn: RDD[(Int, InBlock[ID])],
>   userOut: RDD[(Int, OutBlock)],
>   itemIn: RDD[(Int, InBlock[ID])],
>   itemOut: RDD[(Int, OutBlock)]
> )
> private[recommendation] object RatingBlocks {
>   def create[ID: ClassTag: Ordering](
>   ratings: RDD[Rating[ID]],
>   numUserBlocks: Int,
>   numItemBlocks: Int,
>   storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
> RatingBlocks[ID] = {
> val userPart = new ALSPartitioner(numUserBlocks)
> val itemPart = new ALSPartitioner(numItemBlocks)
> val blockRatings =
>   partitionRatings(ratings, userPart, itemPart)
> .persist(storageLevel)
> val 

[jira] [Created] (SPARK-20486) Encapsulate ALS in-block and out-block data structures and methods into a separate class

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20486:
-

 Summary: Encapsulate ALS in-block and out-block data structures 
and methods into a separate class
 Key: SPARK-20486
 URL: https://issues.apache.org/jira/browse/SPARK-20486
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The in-block and out-block data structures in the ALS code is currently 
calculated within the {{ALS.train}} method itself.  I propose to move this 
code, along with its helper functions, into a separate class to encapsulate the 
creation of the blocks.  This has the added benefit of allowing us to include a 
comprehensive Scaladoc to this new class to explain in detail how this core 
part of the algorithm works.

Proposal:

{code}
private[recommendation] final case class RatingBlocks[ID](
  userIn: RDD[(Int, InBlock[ID])],
  userOut: RDD[(Int, OutBlock)],
  itemIn: RDD[(Int, InBlock[ID])],
  itemOut: RDD[(Int, OutBlock)]
)

private[recommendation] object RatingBlocks {
  def create[ID: ClassTag: Ordering](
  ratings: RDD[Rating[ID]],
  numUserBlocks: Int,
  numItemBlocks: Int,
  storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK): 
RatingBlocks[ID] = {
// In-block and out-block code currently in `ALS.train` goes here
  }

  private[this] def partitionRatings[ID: ClassTag](...) = { ... }

  private[this] def makeBlocks[ID: ClassTag](...) = { ... }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20485) Split ALS.scala into multiple files

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20485:
-

 Summary: Split ALS.scala into multiple files
 Key: SPARK-20485
 URL: https://issues.apache.org/jira/browse/SPARK-20485
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The file {{/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala}} 
currently contains four top-level classes that span 1,500+ lines.  To make the 
file easier to navigate, and to bring it in line with the [Scala style 
guide|http://docs.scala-lang.org/style/files.html], I propose we split it into 
four files holding their respective classes: {{ALS.scala}}, 
{{ALSParams.scala}}, {{ALSModel.scala}}, and {{ALSModelParams.scala}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20484) Add documentation to ALS code

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20484:
-

 Summary: Add documentation to ALS code
 Key: SPARK-20484
 URL: https://issues.apache.org/jira/browse/SPARK-20484
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Trivial


The documentation (both Scaladocs and inline comments) for the ALS code (in 
package {{org.apache.spark.ml.recommendation}}) can be clarified where needed 
and expanded where incomplete. This is especially important for parts of the 
code that are written imperatively for performance, as these parts don't 
benefit from the intuitive self-documentation of Scala's higher-level language 
abstractions. Specifically, I'd like to add documentation fully explaining the 
key functionality of the in-block and out-block objects, their purpose, how 
they relate to the overall ALS algorithm, and how they are calculated in such a 
way that new maintainers can ramp up much more quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984350#comment-15984350
 ] 

Daniel Li commented on SPARK-20468:
---

I'd appreciate if someone with permissions could assign this issue to me.

> Refactor the ALS code
> -
>
> Key: SPARK-20468
> URL: https://issues.apache.org/jira/browse/SPARK-20468
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Daniel Li
>Priority: Minor
>  Labels: documentation, readability, refactoring
>
> The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
> quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all 
> in one file.  Here are some things I think could improve the clarity and 
> maintainability of the code:
> *  The file can be split into more manageable parts.  In particular, {{ALS}}, 
> {{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
> for better readability.
> *  Certain parts can be encapsulated or moved to clarify the intent.  For 
> example:
> **  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
> object, and it seems to take 12 individual parameters that are all members of 
> the {{ALS}} class.  This method can be made an instance method.
> **  The code that creates in-blocks and out-blocks in the body of 
> {{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
> in the {{ALS}} companion object, can be moved into a separate case class that 
> holds the blocks.  This has the added benefit of allowing us to write 
> specific Scaladoc to explain the logic behind these block objects, as their 
> usage is certainly nontrivial yet is fundamental to the implementation.
> **  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be 
> hidden within {{UncompressedInBlock}} to clarify the scope of their usage.
> **  Various builder classes could be encapsulated in the companion objects of 
> the classes they build.
> *  The code can be formatted more clearly.  For example:
> **  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
> formatted more clearly and have comments added explaining the reasoning 
> behind key parts.  That these methods form the core of the ALS logic makes 
> this doubly important for maintainability.
> **  Parts of the code that use {{while}} loops with manually incremented 
> counters can be rewritten as {{for}} loops.
> **  Where non-idiomatic Scala code is used that doesn't improve performance 
> much, clearer code can be substituted.  (This in particular should be done 
> very carefully if at all, as it's apparent the original author spent much 
> time and pains optimizing the code to significantly improve its runtime 
> profile.)
> *  The documentation (both Scaladocs and inline comments) can be clarified 
> where needed and expanded where incomplete.  This is especially important for 
> parts of the code that are written imperatively for performance, as these 
> parts don't benefit from the intuitive self-documentation of Scala's 
> higher-level language abstractions.  Specifically, I'd like to add 
> documentation fully explaining the key functionality of the in-block and 
> out-block objects, their purpose, how they relate to the overall ALS 
> algorithm, and how they are calculated in such a way that new maintainers can 
> ramp up much more quickly.
> The above is not a complete enumeration of improvements but a high-level 
> survey.  All of these improvements will, I believe, add up to make the code 
> easier to understand, extend, and maintain.  This issue will track the 
> progress of this refactoring so that going forward, authors will have an 
> easier time maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20468) Refactor the ALS code

2017-04-26 Thread Daniel Li (JIRA)
Daniel Li created SPARK-20468:
-

 Summary: Refactor the ALS code
 Key: SPARK-20468
 URL: https://issues.apache.org/jira/browse/SPARK-20468
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Daniel Li
Priority: Minor


The current ALS implementation ({{org.apache.spark.ml.recommendation}}) is 
quite the beast --- 21 classes, traits, and objects across 1,500+ lines, all in 
one file.  Here are some things I think could improve the clarity and 
maintainability of the code:

*  The file can be split into more manageable parts.  In particular, {{ALS}}, 
{{ALSParams}}, {{ALSModel}}, and {{ALSModelParams}} can be in separate files 
for better readability.

*  Certain parts can be encapsulated or moved to clarify the intent.  For 
example:
**  The {{ALS.train}} method is currently defined in the {{ALS}} companion 
object, and it seems to take 12 individual parameters that are all members of 
the {{ALS}} class.  This method can be made an instance method.
**  The code that creates in-blocks and out-blocks in the body of 
{{ALS.train}}, along with the {{partitionRatings}} and {{makeBlocks}} methods 
in the {{ALS}} companion object, can be moved into a separate case class that 
holds the blocks.  This has the added benefit of allowing us to write specific 
Scaladoc to explain the logic behind these block objects, as their usage is 
certainly nontrivial yet is fundamental to the implementation.
**  The {{KeyWrapper}} and {{UncompressedInBlockSort}} classes could be hidden 
within {{UncompressedInBlock}} to clarify the scope of their usage.
**  Various builder classes could be encapsulated in the companion objects of 
the classes they build.

*  The code can be formatted more clearly.  For example:
**  Certain methods such as {{ALS.train}} and {{ALS.makeBlocks}} can be 
formatted more clearly and have comments added explaining the reasoning behind 
key parts.  That these methods form the core of the ALS logic makes this doubly 
important for maintainability.
**  Parts of the code that use {{while}} loops with manually incremented 
counters can be rewritten as {{for}} loops.
**  Where non-idiomatic Scala code is used that doesn't improve performance 
much, clearer code can be substituted.  (This in particular should be done very 
carefully if at all, as it's apparent the original author spent much time and 
pains optimizing the code to significantly improve its runtime profile.)

*  The documentation (both Scaladocs and inline comments) can be clarified 
where needed and expanded where incomplete.  This is especially important for 
parts of the code that are written imperatively for performance, as these parts 
don't benefit from the intuitive self-documentation of Scala's higher-level 
language abstractions.  Specifically, I'd like to add documentation fully 
explaining the key functionality of the in-block and out-block objects, their 
purpose, how they relate to the overall ALS algorithm, and how they are 
calculated in such a way that new maintainers can ramp up much more quickly.

The above is not a complete enumeration of improvements but a high-level 
survey.  All of these improvements will, I believe, add up to make the code 
easier to understand, extend, and maintain.  This issue will track the progress 
of this refactoring so that going forward, authors will have an easier time 
maintaining this part of the project.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177
 ] 

Daniel Li edited comment on SPARK-6407 at 3/5/17 10:57 AM:
---

{quote}
In practice fold-in works fine. Folding in a day or so of updates has been OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.
{quote}

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?


was (Author: danielyli):
bq. In practice fold-in works fine. Folding in a day or so of updates has been 
OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-05 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896177#comment-15896177
 ] 

Daniel Li commented on SPARK-6407:
--

bq. In practice fold-in works fine. Folding in a day or so of updates has been 
OK.
The question isn't RMSE but how it affects actual rankings of items in 
recommendations, and it takes a while before the effect of the approximation 
actually changes a rank.

Hmm, I see.  This would be something I'd be interested in implementing for 
Spark if there's need.  Are there implementations (or papers) of this you know 
of that I could look at?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15896033#comment-15896033
 ] 

Daniel Li commented on SPARK-6407:
--

Appreciate the quick reply, [~srowen].

Yeah, we'd be recomputing them, but not from scratch since we'd be starting 
with optimized _U_ and _V_.  It would likely take only one or two iterations 
before reconvergence.  Would this still be considered too expensive?

The thing I hesitate about regarding fold-in updating is that the assumption 
that only the corresponding user row and item row will change may be too 
simplifying (since, of course, there's a "rippling out" effect—all items the 
user rated previous need to be updated, then all users that rated any of those 
items would need updating, etc.).  Then again, even if we take this rippling 
into account the computation may not be too expensive, since a single update 
likely won't affect the RMSE enough to delay convergence.  (Though I haven't 
worked out the math showing this; it's just a hunch.)

Do you have any insights into this?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6407) Streaming ALS for Collaborative Filtering

2017-03-04 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895589#comment-15895589
 ] 

Daniel Li commented on SPARK-6407:
--

Reviving this thread since I'm interested in implementing streaming CF for 
Spark.

bq. Using ALS for online updates is expensive.

Recomputing the factor matrices _U_ and _V_ from scratch for every update would 
be terribly expensive, but what about keeping _U_ and _V_ around and simply 
recomputing another round or two after each new rating that comes in?  The 
algorithm would simply be continually following a moving optimum.  I can't 
imagine the RMSE changing much due to small updates if we use a convergence 
threshold _à la_ [Y. Zhou, et al., “Large-Scale Parallel Collaborative 
Filtering for the Netflix Prize”|http://dl.acm.org/citation.cfm?id=1424269] 
instead of a fixed number of iterations.

(In fact, since calculating _(U^T) * V_ would probably take a nontrivial slice 
of time, new updates that come in during a round of calculation could be 
"batched" into the next round of calculation, increasing efficiency.)

Thoughts?

> Streaming ALS for Collaborative Filtering
> -
>
> Key: SPARK-6407
> URL: https://issues.apache.org/jira/browse/SPARK-6407
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Reporter: Felix Cheung
>Priority: Minor
>
> Like MLLib's ALS implementation for recommendation, and applying to streaming.
> Similar to streaming linear regression, logistic regression, could we apply 
> gradient updates to batches of data and reuse existing MLLib implementation?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9279) Spark Master Refuses to Bind WebUI to a Privileged Port

2015-07-31 Thread Daniel Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14649570#comment-14649570
 ] 

Daniel Li commented on SPARK-9279:
--

I echo Omar's point - I understand that my particular use goes against normal 
best practices, but I don't think it's a developer's place to enforce these 
kinds of constraints on users just because it makes sense to them. It seems to 
limit spark's usability for no gain elsewhere

I have a case where users can access services running in data center only 
through a small number ports between 1 and 1024. Hard-code spark webui port 
essentially stopped users from access spark webui in the data center.

I submitted a pull request above (https://github.com/apache/spark/pull/7835). 
Sean, do you have major concerns in configuring spark webui to use ports 
between 1 and 1024?

 Spark Master Refuses to Bind WebUI to a Privileged Port
 ---

 Key: SPARK-9279
 URL: https://issues.apache.org/jira/browse/SPARK-9279
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.4.1
 Environment: Ubuntu Trusty running in a docker container
Reporter: Omar Padron
Priority: Minor

 When trying to start a spark master server as root...
 {code}
 export SPARK_MASTER_PORT=7077
 export SPARK_MASTER_WEBUI_PORT=80
 spark-class org.apache.spark.deploy.master.Master \
 --host $( hostname ) \
 --port $SPARK_MASTER_PORT \
 --webui-port $SPARK_MASTER_WEBUI_PORT
 {code}
 The process terminates with IllegalArgumentException requirement failed: 
 startPort should be between 1024 and 65535 (inclusive), or 0 for a random 
 free port.
 But, when SPARK_MASTER_WEBUI_PORT=8080 (or anything 1024), the process runs 
 fine.
 I do not understand why the usable ports have been arbitrarily restricted to 
 the non-privileged.  Users choosing to run spark as root should be allowed to 
 choose their own ports.
 Full output from a sample run below:
 {code}
 2015-07-23 14:36:50,892 INFO  [main] master.Master 
 (SignalLogger.scala:register(47)) - Registered signal handlers for [TERM, 
 HUP, INT]
 2015-07-23 14:36:51,399 WARN  [main] util.NativeCodeLoader 
 (NativeCodeLoader.java:clinit(62)) - Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 2015-07-23 14:36:51,586 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - Changing view acls to: root
 2015-07-23 14:36:51,587 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - Changing modify acls to: root
 2015-07-23 14:36:51,588 INFO  [main] spark.SecurityManager 
 (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui 
 acls disabled; users with view permissions: Set(root); users with modify 
 permissions: Set(root)
 2015-07-23 14:36:52,295 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 slf4j.Slf4jLogger (Slf4jLogger.scala:applyOrElse(80)) - Slf4jLogger started
 2015-07-23 14:36:52,349 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Starting remoting
 2015-07-23 14:36:52,489 INFO  [sparkMaster-akka.actor.default-dispatcher-2] 
 Remoting (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting started; listening 
 on addresses :[akka.tcp://sparkMaster@sparkmaster:7077]
 2015-07-23 14:36:52,497 INFO  [main] util.Utils (Logging.scala:logInfo(59)) - 
 Successfully started service 'sparkMaster' on port 7077.
 2015-07-23 14:36:52,717 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT
 2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started 
 SelectChannelConnector@sparkmaster:6066
 2015-07-23 14:36:52,759 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 util.Utils (Logging.scala:logInfo(59)) - Successfully started service on port 
 6066.
 2015-07-23 14:36:52,760 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 rest.StandaloneRestServer (Logging.scala:logInfo(59)) - Started REST server 
 for submitting applications on port 6066
 2015-07-23 14:36:52,765 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 master.Master (Logging.scala:logInfo(59)) - Starting Spark master at 
 spark://sparkmaster:7077
 2015-07-23 14:36:52,766 INFO  [sparkMaster-akka.actor.default-dispatcher-4] 
 master.Master (Logging.scala:logInfo(59)) - Running Spark version 1.4.1
 2015-07-23 14:36:52,772 ERROR [sparkMaster-akka.actor.default-dispatcher-4] 
 ui.MasterWebUI (Logging.scala:logError(96)) - Failed to bind MasterWebUI
 java.lang.IllegalArgumentException: requirement failed: startPort should be 
 between 1024 and 65535 (inclusive), or 0 for a random free port.
 at