Pritam, Let me rephrase what Bob has to say. It has some merit, but it also probably has a bit more sting than it needs to have.
The first question that you need to look at in any kind of textual analysis project is what kind of data you are likely to have. How will the data be presented to you? For instance, at two different extremes there are the twitter API (with a very well specified data format and lots of well coded meta-data) and there patient notes in raw image form (hand-written data with no transcriptions and possible very little meta-data). As you can imagine, the tasks that you need to do on each extreme are very, very different. Another key aspect of your data is how big it really is. If you only have millions of examples, then big data is going to be just a hindrance, not a help. If you have billions of text examples, then big data may become a requirement. Beyond the data source, you need to look at what kind of analysis you need to do. In particular, it is likely that there will be some sort of statistical analysis of the data that you are looking at. You might be looking at some indicators of particular test results that might be found in social media. Or you might be looking to predict cases of misdiagnosis. In either case Drill (or Hive) would only be useful for counting up the cases that have specific features. Finding the features and interpreting the counts you produce would require other software. This means that a SQL system like Drill or Hive will have a very minor role in your analysis. Indeed, many systems that are good for data reduction (like R or Spark) can do all the counting that Drill or Hive can do. I hope this helps. On Wed, Jun 7, 2017 at 3:32 AM, Bob Rudis <[email protected]> wrote: > You should likely spend some time studying statistics and machine > learning then examine the pluses and minuses of a few "data > science"-oriented programming languages and focus on one that has > idioms that make sense to you. Then you'll see just how inappropriate > your question is. > > On Tue, Jun 6, 2017 at 8:07 AM, Pritam Tambe <[email protected]> wrote: > > Dear Sir, > > > > I want to do Social Media Data analysis for Health Domain using Big Data. > > > > I am confused weather to go for Apache Drill or HIVE. > > > > Please Guide. > > > > > > -- > > Thanks & Regards, > > Pritam Tambe, > > Project Engineer - AAI Group, > > Centre for Development of Advanced Computing [C-DAC], > > > > ------------------------------------------------------------ > ------------------------------------------------------------------- > > [ C-DAC is on Social-Media too. Kindly follow us at: > > Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] > > > > This e-mail is for the sole use of the intended recipient(s) and may > > contain confidential and privileged information. If you are not the > > intended recipient, please contact the sender by reply e-mail and destroy > > all copies and the original message. Any unauthorized review, use, > > disclosure, dissemination, forwarding, printing or copying of this email > > is strictly prohibited and appropriate legal action will be taken. > > ------------------------------------------------------------ > ------------------------------------------------------------------- > > >
