If you need anything close to realtime (~ few seconds) hadoop and its ilk is not a choice. Solr is fine. But be prepared to dedicate a lot of hardware for that
On Fri, Nov 7, 2008 at 10:53 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi Shalin, > > Thanks for your input. > > Yes I agree that my application is not much about full text search. > > Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But > where they fall short is in online querying of the huge data. > > I am specifically talking about Pig in this case which has benchmarking > figure in the order of 3-10 minutes with 11 nodes for around 4GB data size > (200 M records). Where as for Solr I can see processing time is under second > at 1 node (but higher memory) for around 1 GB data size (0.5 M records). > > Since for my application online query performance is one of the key > requirement (I think irrespective of type of application no user would like > to wait on the screen for more than a minute) I'm in dilemma. > > Regards, > Sourav > > > > -----Original Message----- > From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] > Sent: Friday, November 07, 2008 7:48 AM > To: solr-user@lucene.apache.org > Subject: Re: Solr Multicore ... > > From what I can understand, you have little full-text search involved here. > You should probably look at Hadoop and its contrib and sub-projects such as > Pig, Hive and Chukwa. > > http://wiki.apache.org/hadoop/ > http://wiki.apache.org/hadoop/Hive > http://wiki.apache.org/hadoop/Chukwa > http://incubator.apache.org/pig/ > > On Fri, Nov 7, 2008 at 9:03 PM, souravm <[EMAIL PROTECTED]> wrote: > >> Hi Guys, >> >> Here I'm struggling with to decide whether Solr would be a fitting solution >> for me. Highly appreciate you >> >> The key requirements can be summarized as below - >> >> 1. Need to process very high volume of data online from log files of >> various applications - around 100s of Millions of total size may be varying >> within a range of 30-40 GB. >> >> 2. Flexibility - Log file formats from different applications would be >> different. Also for the same application log file formats can vary. However, >> the log files would be in xml and if a new type has to be supported then the >> schema for the same would be known before hand. >> >> 3. The type of queries to be supported - >> a) Mostly aggregation type statistics (min, max, average, sd, count etc.) >> of response times, sales numbers etc. >> b) Ability to support adhoc queries relating multiple fields in a given >> logfile, joining similar fields in multiple logfiles >> >> 4. Flexibility - Log file formats from different applications would be >> different. Also for the same application log file formats can vary. However, >> the log files would be in xml and if a new type has to be supported then the >> schema for the same would be known before hand. >> >> 5. Expected performance would be around 10 to 20 sec for majority of the >> queries. For rest it may be a bit more higher. >> >> I'm planning to use Solr with multicore and distributed search feature. >> However also considering Hadoop with Hbase as that looks to be a natural >> solution to support multiple file formats and handling adhoc queries. >> >> I would surely like to have your viewpoints on this regard - whether given >> the key requirements above Solr is a right choice or Hadoop+HBase would be >> better (or any other open source product). >> >> Thanks in advance. >> >> Regards, >> Sourav >> >> **************** CAUTION - Disclaimer ***************** >> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended >> solely >> for the use of the addressee(s). If you are not the intended recipient, >> please >> notify the sender by e-mail and delete the original message. Further, you >> are not >> to copy, disclose, or distribute this e-mail or its contents to any other >> person and >> any such actions are unlawful. This e-mail may contain viruses. Infosys has >> taken >> every reasonable precaution to minimize this risk, but is not liable for >> any damage >> you may sustain as a result of any virus in this e-mail. You should carry >> out your >> own virus checks before opening the e-mail or attachment. Infosys reserves >> the >> right to monitor and review the content of all messages sent to or from >> this e-mail >> address. Messages sent to or from this e-mail address may be stored on the >> Infosys e-mail system. >> ***INFOSYS******** End of Disclaimer ********INFOSYS*** >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul