Hi All, >From man-k.org I was able to create a small dataset of queries, results and their relevance scores. I am working on trying out some machine learning models to improve the ranking algorithm of apropos(1).
Currently apropos has a weight for each of the sections such as NAME, DESCRIPTION, etc., and it multiplies a match in a section by this weight. This is required because a match in one section, for example, NAME is more relevant than a match in some other section, such as DESCRIPTION. These weights were put arbitrarily by me as I didn't have any way to learn their optimum value. I am trying out some machine learning techniques to learn these weights. The results till now have not been any drastic but they are definitely an improvement. Hopefully I will be able to get more concrete results soon. A small comparison of results between old weights and the weights learned from machine learning is below. apropos -n 10 -C fork #old weights fork (2) create a new process perlfork (1) Perls fork() emulation cpu_lwp_fork (9) finish a fork operation pthread_atfork (3) register handlers to be called when process forks rlogind (8) remote login server rshd (8) remote shell server rexecd (8) remote execution server script (1) make typescript of terminal session moncontrol (3) control execution profile vfork (2) spawn new process in a virtual memory efficient way apropos -n 10 -C fork #new weights fork (2) create a new process perlfork (1) Perls fork() emulation cpu_lwp_fork (9) finish a fork operation pthread_atfork (3) register handlers to be called when process forks vfork (2) spawn new process in a virtual memory efficient way clone (2) spawn new process with options <-- clone(2) appears in top 10 daemon (3) run in the background script (1) make typescript of terminal session openpty (3) tty utility functions rlogind (8) remote login server clone(2) shows up, rshd(8) and rexecd(8) go away, rlogind(8) moves down. apropos -n 10 -C create new process init (8) process control initialization fork (2) create a new process fork1 (9) create a new process timer_create (2) create a per-process timer getpgrp (2) get process group supfilesrv (8) sup server processes posix_spawn (3) spawn a process master (8) Postfix master process popen (3) process I/O _lwp_create (2) create a new light-weight process apropos -n 10 -C create new process #new weights fork (2) create a new process <-- fork(2) is number 1 fork1 (9) create a new process _lwp_create (2) create a new light-weight process pthread_create (3) create a new thread clone (2) spawn new process with options timer_create (2) create a per-process timer UI_new (3) New User Interface init (8) process control initialization posix_spawn (3) spawn a process master (8) Postfix master process fork(2) moves to number 1, init(8) moves to 7, clone(2) appears etc. I wrote a blog about it: http://abhinav-upadhyay.blogspot.in/2016/05/teaching-apropos-to-rank-work-in.html The data is available here: https://github.com/abhinav-upadhyay/man-nlp-experiments/tree/master/data Let me know your thoughts or concerns :) -- Abhinav