-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all,
I have coded a scrapy based crawler for .onion pages. This crawler is based on scrapy and uses Postgresql database. It should be quite straight forward to develop some kind of search algorithm for the crawled data. The data for each website is: 1) URL 2) keywords (HTML keywords, title, h1, h2, h3, h3 etc.) 3) All the stemmed words from the page and the number of them (word1: count_of_word1, word2: count_of_word2...) 4) Domain 5) Public WWW backlinks to the domain 6) Popularity according to the Tor2web stats 7) Number of clicks in the search results to the domain Note the stemming[1]. I realized that I have to find the words that are close enough for what is searched. For efficiency, it is useful to save the stemmed words from the page and use Levenshtein distance[2] to compare the search to these stemmed words. I am working on this. [1] https://en.wikipedia.org/wiki/Stemming [2] https://en.wikipedia.org/wiki/Levenshtein_distance Greetings, Juha -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTypKBAAoJELGTs54GL8vAGLgH+QHc7/UlP3Qnl3v18WXFtNJs nIUrZjKF2RKa5nMJf0kvovNGNejNiuz9lJ7J4tqh6HyadWprqgy1s3Pz/3SuFPV8 rZ6wR+FDVw5hN2Xdxogla/A2U1B0DJ+CMTkJmSSc+gYwrKV+k7ImBztJQcNo4LpX IMHGttUfct0vDn639J5NjOuScJvkTws1rIiLLADzGQRGmsTL64f93uAaZJGjiNlX /mL/CZze9B2Z/tochGqun6pKAyJcGLxoNvbv65gllGcnKIBbzG3nPihYGJw+QbMY 8zeLjFySKGpx7jedfnGjYOmuiV6iiiqulE/W+bNrBLuGU0DeCh0Z1fUqJ4iEQxs= =5TZF -----END PGP SIGNATURE----- _______________________________________________ tor-reports mailing list [email protected] https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-reports
