[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata

GoranSMilovanovic Sun, 17 Mar 2019 05:45:56 -0700

GoranSMilovanovic added a comment.


  @RazShuty @Lydia_Pintscher @JAllemandou
  
  Our approach here will be to use an approx. 1M sized sample of WD items to 
produce the identifier x identifier Jaccard distance matrix
  
  - Ratio:
    - the dataset as produced in Spark has 250M rows x two columns 
(item-identifier pairs);
    - the desired binary contingency matrix to compute the Jaccard distances is 
of approx. dimesion 26M x 1000K+;
    - due to internal constraints, Spark `stat.crosstab()` cannot produce a 
binary contingency matrix that we need to compute the Jaccard distances;
    - while R {data.table} can manage the dataset, it still cannot produce the 
desired contingency matrix;
    - moreover, even if could have the contingency matrix produced in an 
efficient manner, it is questionable what procedure could deliver the Jaccard 
distances efficiently.
  
      The results of the following experiment testify that we can safely 
proceed with sampling:
  
  - Take 10 random samples from the ~250M items x identifiers pairs
  - by sampling identifiers proportionally (i.e. compute p(identifier), weight 
the identifier sample appropriately)
  - and including one observation for each identifier with p = 0 (due to 
rounding, not due to the absence of the identifier);
  - for each sample, produce a binary contingency matrix;
  - from each contingency matrix compute all pair-wise identifier-identifier 
Jaccard distances, store as vector;
  - compute Pearson correlation coefficients between the distance vectors 
obtained from 10 random samples.
  
  Here's the correlation matrix; obviously, ~1M sized proportional random 
samples of item-property pairs are quite representative of the approx. ~26M 
item-property pairs dataset:
  
  |          | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | 
sample7 | sample8 | sample9 | sample10 |
  | sample1  | 1       | 0.9992  | 0.9992  | 0.9989  | 0.9991  | 0.9991  | 
0.9992  | 0.9992  | 0.999   | 0.9992   |
  | sample2  | 0.9992  | 1       | 0.9995  | 0.9992  | 0.9996  | 0.9992  | 
0.9994  | 0.9995  | 0.9991  | 0.9994   |
  | sample3  | 0.9992  | 0.9995  | 1       | 0.9993  | 0.9996  | 0.9993  | 
0.9994  | 0.9996  | 0.9992  | 0.9994   |
  | sample4  | 0.9989  | 0.9992  | 0.9993  | 1       | 0.9992  | 0.9991  | 
0.9992  | 0.9991  | 0.999   | 0.9991   |
  | sample5  | 0.9991  | 0.9996  | 0.9996  | 0.9992  | 1       | 0.9994  | 
0.9994  | 0.9995  | 0.9992  | 0.9994   |
  | sample6  | 0.9991  | 0.9992  | 0.9993  | 0.9991  | 0.9994  | 1       | 
0.9993  | 0.9992  | 0.9991  | 0.9991   |
  | sample7  | 0.9992  | 0.9994  | 0.9994  | 0.9992  | 0.9994  | 0.9993  | 1    
   | 0.9994  | 0.9992  | 0.9994   |
  | sample8  | 0.9992  | 0.9995  | 0.9996  | 0.9991  | 0.9995  | 0.9992  | 
0.9994  | 1       | 0.9992  | 0.9995   |
  | sample9  | 0.999   | 0.9991  | 0.9992  | 0.999   | 0.9992  | 0.9991  | 
0.9992  | 0.9992  | 1       | 0.9991   |
  | sample10 | 0.9992  | 0.9994  | 0.9994  | 0.9991  | 0.9994  | 0.9991  | 
0.9994  | 0.9995  | 0.9991  | 1        |
  |
  
  
  
  - Next steps: (1) proceed to produce the dataset; (2) resolve to ticket and 
proceed to visualization: T204440 <https://phabricator.wikimedia.org/T204440>.

TASK DETAIL
  https://phabricator.wikimedia.org/T214897

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic, 
Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, 
rosalieper, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T214897: data for analyzing and visualizing the identifier landscape of Wikidata

Reply via email to