No, the feature vector is not converted. It contains count n_i of how
often each term t_i occurs (or a TF-IDF transformation of those). You
are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
maximized.

In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...

So your n_1 counts (or TF-IDF values) are used as-is and this is where
the dot product comes from.

Your bug is probably something lower-level and simple. I'd debug the
Spark example and print exactly its values for the log priors and
conditional probabilities, and the matrix operations, and yours too,
and see where the difference is.

On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet <jatinpr...@gmail.com> wrote:
> Hi,
>
> I have been running through some troubles while converting the code to Java.
> I have done the matrix operations as directed and tried to find the maximum
> score for each category. But the predicted category is mostly different from
> the prediction done by MLlib.
>
> I am fetching iterators of the pi, theta and testData to do my calculations.
> pi and theta are in  log space while my testData vector is not, could that
> be a problem because I didn't see explicit conversion in Mllib also?
>
> For example, for two categories and 5 features, I am doing the following
> operation,
>
> [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
>            [6 7 8 9 10]
> These are simple element-wise matrix multiplication and addition operators.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to