Hi,

I am sure you can use spark for this but it seems like a problem that should be 
delegated to a text based indexing technology like elastic search or something 
based on lucene to serve the requests. Spark can be used to prepare the data 
that can be fed to the indexing service. 

Using spark directly seems like there would be a lot of repeated computations 
between requests which can be avoided.

There are a bunch of spark-elasticsearch bindings that can be used to make the 
process easier. 

Again, sparksql can help you convert most of the logic directly to spark jobs 
but I would suggest exploring text indexing technologies too. 

-- ankur

-----Original Message-----
From: "Сергей Мелехин" <cpro...@gmail.com>
Sent: ‎5/‎24/‎2015 10:59 PM
To: "user@spark.apache.org" <user@spark.apache.org>
Subject: Using Spark like a search engine

HI!
We are developing scoring system for recruitment. Recruiter enters vacancy 
requirements, and we score tens of thousands of CVs to this requirements, and 
return e.g. top 10 matches.
We do not use fulltext search and sometimes even dont filter input CVs prior to 
scoring (some vacancies do not have mandatory requirements that can be used as 
a filter effectively).


So we have scoring function F(CV,VACANCY) that is currently inplemented in SQL 
and runs on Postgresql cluster. In worst case F is executed once on every CV in 
database. VACANCY part is fixed for one query, but changes between queries and 
there's very little we can process in advance.


We expect to have about 100 000 000 CVs in next year, and do not expect our 
current implementation to offer desired low latency responce (<1 s) on 100M 
CVs. So we look for a horizontaly scaleable and fault-tolerant in-memory 
solution.


Will Spark be usefull for our task? All tutorials I could find describe stream 
processing, or ML applications. What Spark extensions/backends can be useful?




With best regards, Segey Melekhin

Reply via email to