Re: how idf is calculated

2014-10-31 Thread Andrejs Abele
I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base
10 is used, but as I found in this discussion
https://groups.google.com/forum/#!topic/scala-language/K5tbYSYqQc8, in
scala it is actually ln (natural logarithm).

Regards,
Andrejs

On Thu, Oct 30, 2014 at 10:49 PM, Ashic Mahtab as...@live.com wrote:

 Hi Andrejs,
 The calculations are a bit different to what I've come across in Mining
 Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available here:
 http://www.mmds.org/

 Their calculation of IDF is as follows:

 IDFi = log2(N / ni)

 where N is the number of documents and ni is the number of documents in
 which the word appears. This looks different to your IDF function.

 For TF, they use

 TFij = fij / maxk fkj

 That is:

 For document j,
  the term frequency of the term i in j is the number of times i
 appears in j divided by the maximum number of times any term appears in j.
 Stop words are usually excluded when considering the maximum).

 So, in your case, the

 TFa1 = 2 / 2 = 1
 TFb1 = 1 / 2 = 0.5
 TFc1 = 1/2 = 0.5
 TFm1 = 2/2 = 1
 ...

 IDFa = log2(3 / 2) = 0.585

 So, TFa1 * IDFa = 0.585

 Wikipedia mentions an adjustment to overcome biases for long documents, by
 calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change
 anything for TFa1, as the value remains 1.

 In other words, my calculations don't agree with yours, and neither seem
 to agree with Spark :)

 Regards,
 Ashic.

 --
 Date: Thu, 30 Oct 2014 22:13:49 +
 Subject: how idf is calculated
 From: andr...@sindicetech.com
 To: u...@spark.incubator.apache.org


 Hi,
 I'm writing a paper and I need to calculate tf-idf. Whit your help I
 managed to get results, I needed, but the problem is that I need to be able
 to explain how each number was gotten. So I tried to understand how idf was
 calculated and the numbers i get don't correspond to those I should get .

 I have 3 documents (each line a document)
 a a b c m m
 e a c d e e
 d j k l m m c

 When I calculate tf, I get this
 (1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])
 (1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])
 (1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]

 idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))
 m -number of documents (3 in my case).
 d(t) - in how many documents is term present
 a: log(4/3) =0.1249387366
 b: log(4/2) =0.3010299957
 c: log(4/4) =0
 d: log(4/3) =0.1249387366
 e: log(4/2) =0.3010299957
 l: log(4/2) =0.3010299957
 m: log(4/3) =0.1249387366

 When I output  idf vector `
 idf.idf.toArray.filter(_.(0)).distinct.foreach(println(_)) `
 I get :
 1.3862943611198906
 0.28768207245178085
 0.6931471805599453

 I understand why there are only 3 numbers, because only 3 are unique :
 log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf
 where calculated

 Best regards,
 Andrejs




Re: how idf is calculated

2014-10-31 Thread Sean Owen
Yes, here the base doesn't matter as it just multiplies all results by
a constant factor. Math libraries tend to have ln, not log10 or log2.
ln is often the more, er, natural base for several computations. So I
would assume that log = ln in the context of ML.

On Fri, Oct 31, 2014 at 11:31 AM, Andrejs Abele andr...@sindicetech.com wrote:
 I found my problem. I assumed based on TF-IDF in  Wikipedia , that log base
 10 is used, but as I found in this discussion, in scala it is actually ln
 (natural logarithm).

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab
Hi Andrejs,The calculations are a bit different to what I've come across in 
Mining Massive Datasets (2nd Ed. Ullman et. al.,  Cambridge Press) available 
here:http://www.mmds.org/ 
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number of documents in which 
the word appears. This looks different to your IDF function.
For TF, they use
TFij = fij / maxk fkj
That is:
For document j, the term frequency of the term i in j is the number of 
times i appears in j divided by the maximum number of times any term appears in 
j. Stop words are usually excluded when considering the maximum).
So, in your case, the 
TFa1 = 2 / 2 = 1
TFb1 = 1 / 2 = 0.5TFc1 = 1/2 = 0.5TFm1 = 2/2 = 1...
IDFa = log2(3 / 2) = 0.585
So, TFa1 * IDFa = 0.585
Wikipedia mentions an adjustment to overcome biases for long documents, by 
calculating TFij = 0.5 + {(0.5*fij)/maxk fkj}, but that doesn't change anything 
for TFa1, as the value remains 1.
In other words, my calculations don't agree with yours, and neither seem to 
agree with Spark :)
Regards,Ashic.
Date: Thu, 30 Oct 2014 22:13:49 +
Subject: how idf is calculated
From: andr...@sindicetech.com
To: u...@spark.incubator.apache.org

Hi,I'm writing a paper and I need to calculate tf-idf. Whit your help I managed 
to get results, I needed, but the problem is that I need to be able to explain 
how each number was gotten. So I tried to understand how idf was calculated and 
the numbers i get don't correspond to those I should get .  
I have 3 documents (each line a document)a a b c m me a c d e ed j k l m m c
When I calculate tf, I get this 
(1048576,[99,100,106,107,108,109],[1.0,1.0,1.0,1.0,1.0,2.0])(1048576,[97,98,99,109],[2.0,1.0,1.0,2.0])(1048576,[97,99,100,101],[1.0,1.0,1.0,3.0]
idf is supposedly calculated idf = log((m + 1) / (d(t) + 1))m -number of 
documents (3 in my case).d(t) - in how many documents is term presenta: 
log(4/3) =0.1249387366b: log(4/2) =0.3010299957c: log(4/4) =0d: log(4/3) 
=0.1249387366e: log(4/2) =0.3010299957l: log(4/2) =0.3010299957m: log(4/3) 
=0.1249387366
When I output  idf vector ` 
idf.idf.toArray.filter(_.(0)).distinct.foreach(println(_)) `I get 
:1.38629436111989060.287682072451780850.6931471805599453
I understand why there are only 3 numbers, because only 3 are unique : 
log(4/2), log(4/3), log(4/4), but I don't understand how numbers in idf where 
calculated 
Best regards,Andrejs