Hi, I am sure you can use spark for this but it seems like a problem that should be delegated to a text based indexing technology like elastic search or something based on lucene to serve the requests. Spark can be used to prepare the data that can be fed to the indexing service.
Using spark directly seems like there would be a lot of repeated computations between requests which can be avoided. There are a bunch of spark-elasticsearch bindings that can be used to make the process easier. Again, sparksql can help you convert most of the logic directly to spark jobs but I would suggest exploring text indexing technologies too. -- ankur -----Original Message----- From: "Сергей Мелехин" <cpro...@gmail.com> Sent: 5/24/2015 10:59 PM To: "user@spark.apache.org" <user@spark.apache.org> Subject: Using Spark like a search engine HI! We are developing scoring system for recruitment. Recruiter enters vacancy requirements, and we score tens of thousands of CVs to this requirements, and return e.g. top 10 matches. We do not use fulltext search and sometimes even dont filter input CVs prior to scoring (some vacancies do not have mandatory requirements that can be used as a filter effectively). So we have scoring function F(CV,VACANCY) that is currently inplemented in SQL and runs on Postgresql cluster. In worst case F is executed once on every CV in database. VACANCY part is fixed for one query, but changes between queries and there's very little we can process in advance. We expect to have about 100 000 000 CVs in next year, and do not expect our current implementation to offer desired low latency responce (<1 s) on 100M CVs. So we look for a horizontaly scaleable and fault-tolerant in-memory solution. Will Spark be usefull for our task? All tutorials I could find describe stream processing, or ML applications. What Spark extensions/backends can be useful? With best regards, Segey Melekhin