Re: HashingTF "compatibility" across Python, Scala?

2016-04-12 Thread Nick Pentreath
I should point out that actually the "ml" version of HashingTF does call
into Java so that will be consistent across Python and Java.

It's the "mllib" version in PySpark that implements its own version using
Pythons "hash" function (while Java uses Object.hashCode).

On Thu, 7 Apr 2016 at 18:19 Nick Pentreath  wrote:

> You're right Sean, the implementation depends on hash code currently so
> may differ. I opened a JIRA (which duplicated this one -
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
> which is the active JIRA), for using murmurhash3 which should then be
> consistent across platforms & langs (as well as more performant).
>
> It's also odd (legacy I think) that the Python version has its own
> implementation rather than calling into Java. That should also be changed
> probably.
> On Thu, 7 Apr 2016 at 17:59, Sean Owen  wrote:
>
>> Let's say I use HashingTF in my Pipeline to hash a string feature.
>> This is available in Python and Scala, but they hash strings to
>> different values since both use their respective runtime's native hash
>> implementation. This means that I create different feature vectors for
>> the same input. While I can load/store something like a
>> NaiveBayesModel across the two languages successfully, it seems like
>> the hashing part doesn't translate.
>>
>> Is that accurate, or, have I completely missed a way to get the same
>> hashing for the same input across languages?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Nick Pentreath
You're right Sean, the implementation depends on hash code currently so may
differ. I opened a JIRA (which duplicated this one -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
which is the active JIRA), for using murmurhash3 which should then be
consistent across platforms & langs (as well as more performant).

It's also odd (legacy I think) that the Python version has its own
implementation rather than calling into Java. That should also be changed
probably.
On Thu, 7 Apr 2016 at 17:59, Sean Owen  wrote:

> Let's say I use HashingTF in my Pipeline to hash a string feature.
> This is available in Python and Scala, but they hash strings to
> different values since both use their respective runtime's native hash
> implementation. This means that I create different feature vectors for
> the same input. While I can load/store something like a
> NaiveBayesModel across the two languages successfully, it seems like
> the hashing part doesn't translate.
>
> Is that accurate, or, have I completely missed a way to get the same
> hashing for the same input across languages?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Sean Owen
Let's say I use HashingTF in my Pipeline to hash a string feature.
This is available in Python and Scala, but they hash strings to
different values since both use their respective runtime's native hash
implementation. This means that I create different feature vectors for
the same input. While I can load/store something like a
NaiveBayesModel across the two languages successfully, it seems like
the hashing part doesn't translate.

Is that accurate, or, have I completely missed a way to get the same
hashing for the same input across languages?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org