Re: Solr for large volume data processing with minimal full-text serach

Noble Paul നോബിള്‍ नोब्ळ् Fri, 07 Nov 2008 09:28:08 -0800

If you need anything close to realtime (~ few seconds) hadoop and its
ilk is not a choice. Solr is fine. But be prepared to dedicate a lot
of hardware for that


On Fri, Nov 7, 2008 at 10:53 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi Shalin,
>
> Thanks for your input.
>
> Yes I agree that my application is not much about full text search.
>
> Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But 
> where they fall short is in online querying of the huge data.
>
> I am specifically talking about Pig in this case which has benchmarking 
> figure in the order of 3-10 minutes with 11 nodes for around 4GB data size 
> (200 M records). Where as for Solr I can see processing time is under second 
> at 1 node (but higher memory) for around 1 GB data size (0.5 M records).
>
> Since for my application online query performance is one of the key 
> requirement (I think irrespective of type of application no user would like 
> to wait on the screen for more than a minute) I'm in dilemma.
>
> Regards,
> Sourav
>
>
>
> -----Original Message-----
> From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
> Sent: Friday, November 07, 2008 7:48 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Multicore ...
>
> From what I can understand, you have little full-text search involved here.
> You should probably look at Hadoop and its contrib and sub-projects such as
> Pig, Hive and Chukwa.
>
> http://wiki.apache.org/hadoop/
> http://wiki.apache.org/hadoop/Hive
> http://wiki.apache.org/hadoop/Chukwa
> http://incubator.apache.org/pig/
>
> On Fri, Nov 7, 2008 at 9:03 PM, souravm <[EMAIL PROTECTED]> wrote:
>
>> Hi Guys,
>>
>> Here I'm struggling with to decide whether Solr would be a fitting solution
>> for me. Highly appreciate you
>>
>> The key requirements can be summarized as below -
>>
>> 1. Need to process very high volume of data online from log files of
>> various applications - around 100s of Millions of total size may be varying
>> within a range of 30-40 GB.
>>
>> 2. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 3. The type of queries to be supported -
>> a) Mostly aggregation type statistics (min, max, average, sd, count etc.)
>> of response times, sales numbers etc.
>> b) Ability to support adhoc queries relating multiple fields in a given
>> logfile, joining similar fields in multiple logfiles
>>
>> 4. Flexibility - Log file formats from different applications would be
>> different. Also for the same application log file formats can vary. However,
>> the log files would be in xml and if a new type has to be supported then the
>> schema for the same would be known before hand.
>>
>> 5. Expected performance would be around 10 to 20 sec for majority of the
>> queries. For rest it may be a bit more higher.
>>
>> I'm planning to use Solr with multicore and distributed search feature.
>> However also considering Hadoop with Hbase as that looks to be a natural
>> solution to support multiple file formats and handling adhoc queries.
>>
>> I would surely like to have your viewpoints on this regard - whether given
>> the key requirements above Solr is a right choice or Hadoop+HBase would be
>> better (or any other open source product).
>>
>> Thanks in advance.
>>
>> Regards,
>> Sourav
>>
>> **************** CAUTION - Disclaimer *****************
>> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
>> solely
>> for the use of the addressee(s). If you are not the intended recipient,
>> please
>> notify the sender by e-mail and delete the original message. Further, you
>> are not
>> to copy, disclose, or distribute this e-mail or its contents to any other
>> person and
>> any such actions are unlawful. This e-mail may contain viruses. Infosys has
>> taken
>> every reasonable precaution to minimize this risk, but is not liable for
>> any damage
>> you may sustain as a result of any virus in this e-mail. You should carry
>> out your
>> own virus checks before opening the e-mail or attachment. Infosys reserves
>> the
>> right to monitor and review the content of all messages sent to or from
>> this e-mail
>> address. Messages sent to or from this e-mail address may be stored on the
>> Infosys e-mail system.
>> ***INFOSYS******** End of Disclaimer ********INFOSYS***
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul

Re: Solr for large volume data processing with minimal full-text serach

Reply via email to