Karl, This is tremendously useful. Thanks very much for your insight.
Brian On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley <kmhig...@gmail.com> wrote: > Hi, > > It sounds like you're trying to solve the approximate nearest neighbor > (ANN) problem. With a large dataset, parallelizing a brute force O(n^2) > approach isn't likely to help all that much, because the number of pairwise > comparisons grows quickly as the size of the dataset increases. I'd look at > ways to avoid computing the similarity between all pairs, like > locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH -- > it's currently slated for the Spark 2.0.0 release, but AFAIK development on > it hasn't started yet.) > > There are a bunch of Python libraries that support various approaches to > the ANN problem (including LSH), though. It sounds like you need fast > lookups, so you might check out https://github.com/spotify/annoy. For > other alternatives, see this performance comparison of Python ANN libraries > : https://github.com/erikbern/ann-benchmarks. > > Hope that helps, > Karl > > On Wed, Feb 10, 2016 at 10:29 PM rokclimb15 <rokclim...@gmail.com> wrote: > >> Hi everyone, new to this list and Spark, so I'm hoping someone can point >> me >> in the right direction. >> >> I'm trying to perform this same sort of task: >> >> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql >> >> and I'm running into the same problem - it doesn't scale. Even on a very >> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to >> find a match with 30 million+ rows. >> >> What I would like to do is to load this data set from MySQL into Spark and >> compute the Hamming distance using all available cores, then select the >> rows >> matching a maximum distance. I'm most familiar with Python, so would >> prefer >> to use that. >> >> I found an example of loading data from MySQL >> >> >> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/ >> >> I found a related DataFrame commit and docs, but I'm not exactly sure how >> to >> put this all together. >> >> >> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E >> >> >> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR >> >> Could anyone please point me to a similar example I could follow as a >> Spark >> newb to try this out? Is this even worth attempting, or will it similarly >> fail performance-wise? >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>