[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10574: -- Shepherd: Joseph K. Bradley > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10574: -- Assignee: Yanbo Liang > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-10574: -- Issue Type: Sub-task (was: Improvement) Parent: SPARK-13964 > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10574: -- Target Version/s: 2.0.0 Priority: Major (was: Critical) > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictionary-like strings > the hashing function choice will not be a big problem but why have an > implementation in MLlib with this limitation when there is a better > implementation readily available in the standard Scala library? > Switching to MurmurHash3 solves both problems. If there is agreement that > this is a good change, I can prepare a PR. > Note that changing the hash function would mean that models saved with a > previous version would have to be re-trained. This introduces a problem > that's orthogonal to breaking changes in APIs: breaking changes related to > artifacts, e.g., a saved model, produced by a previous version. Is there a > policy or best practice currently in effect about this? If not, perhaps we > should come up with a few simple rules about how we communicate these in > release notes, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simeon Simeonov updated SPARK-10574: Description: {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant problems with this. First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for {{hashCode}}, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down. Second, the native Scala hashing function performs badly on longer strings, exhibiting [200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If Spark users apply {{HashingTF}} only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library? Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR. Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc. was: {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are two significant problems with this. First, per the [Scala documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for {{hashCode}}, the implementation is platform specific. This means that feature vectors created on one platform may be different than vectors created on another platform. This can create significant problems when a model trained offline is used in another environment for online prediction. The problem is made harder by the fact that following a hashing transform features lose human-tractable meaning and a problem such as this may be extremely difficult to track down. Second, the native Scala hashing function performs badly on longer strings, exhibiting [200-500% higher collision rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for example, [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] which is also included in the standard Scala libraries and is the hashing choice of fast learners such as Vowpal Wabbit and others. If Spark users apply {{HashingTF}} only to very short, dictionary-like strings the hashing function choice will not be a big problem but why have an implementation in MLlib with this limitation when there is a better implementation readily available in the standard Scala library? Switching to MurmurHash3 solves both problems. If there is agreement that this is a good change, I can prepare a PR. Note that changing the hash function would mean that models saved with a previous version would have to be re-trained. This introduces a problem that's orthogonal to breaking changes in APIs: breaking changes related to artifacts, e.g., a saved model, produced by a previous version. Is there a policy or best practice currently in effect about this? If not, perhaps we should come up with a few simple rules about how we communicate these in release notes, etc. > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than