Hello experts,

I’m new to Spark, and want to find K nearest neighbors on huge scale 
high-dimension points dataset in very short time.

The scenario is: the dataset contains more than 10 million points, whose 
dimension is 200d. I’m building a web service, to receive one new point at each 
request and return K nearest points inside that dataset, also need to ensure 
the time-cost not very high. I have a cluster with several high-memory nodes 
for this service.
 
Currently I only have these ideas here:
1. To create several ball-tree instances in each node when service 
initializing. This is fast, but not perform well at data scaling ability. I 
cannot insert new nodes to the ball-trees unless I restart the services and 
rebuild them.
2. To use sql based solution. Some database like PostgreSQL and SqlServer have 
features on spatial search. But these database may not perform well in big data 
environment. (Does SparkSQL have Spatial features or spatial index?)

Based on your experience, can I achieve this scenario in Spark SQL? Or do you 
know other projects in Spark stack acting well for this?
Any ideas are appreciated, thanks very much.

Regards,
Dong




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to