That is trivial to do , I did it once when they were in json format
> On 08 Jun 2016, at 13:15, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Interesting. There is also apache nifi > > Also I note that one can store twitter data in Hive tables as well? > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 7 June 2016 at 15:59, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >> thanks I will have a look. >> >> Mich >> >> Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> http://talebzadehmich.wordpress.com >> >> >>> On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote: >>> Solr is basically an in-memory text index with a lot of capabilities for >>> language analysis extraction (you can compare it to a Google for your >>> tweets). The system itself has a lot of features and has a complexity >>> similar to Big data systems. This index files can be backed by HDFS. You >>> can put the tweets directly into solr without going via HDFS files. >>> >>> Carefully decide what fields to index / you want to search. It does not >>> make sense to index everything. >>> >>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Ok So basically for predictive off-line (as opposed to streaming) in a >>>> nutshell one can use Apache Flume to store twitter data in hdfs and use >>>> Solr to query the data? >>>> >>>> This is what it says: >>>> >>>> Solr is a standalone enterprise search server with a REST-like API. You >>>> put documents in it (called "indexing") via JSON, XML, CSV or binary over >>>> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary >>>> results. >>>> >>>> thanks >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>>> On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> Well I have seen that The algorithms mentioned are used for this. However >>>>> some preprocessing through solr makes sense - it takes care of synonyms, >>>>> homonyms, stemming etc >>>>> >>>>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Thanks Jorn, >>>>>> >>>>>> To start I would like to explore how can one turn some of the data into >>>>>> useful information. >>>>>> >>>>>> I would like to look at certain trend analysis. Simple correlation shows >>>>>> that the more there is a mention of a typical topic say for example >>>>>> "organic food" the more people are inclined to go for it. To see one can >>>>>> deduce that orgaind food is a potential growth area. >>>>>> >>>>>> Now I have all infra-structure to ingest that data. Like using flume to >>>>>> store it or Spark streaming to do near real time work. >>>>>> >>>>>> Now I want to slice and dice that data for say organic food. >>>>>> >>>>>> I presume this is a typical question. >>>>>> >>>>>> You mentioned Spark ml (machine learning?) . Is that something viable? >>>>>> >>>>>> Cheers >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>>> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote: >>>>>>> Spark ml Support Vector machines or neural networks could be >>>>>>> candidates. >>>>>>> For unstructured learning it could be clustering. >>>>>>> For doing a graph analysis On the followers you can easily use Spark >>>>>>> Graphx >>>>>>> Keep in mind that each tweet contains a lot of meta data (location, >>>>>>> followers etc) that is more or less structured. >>>>>>> For unstructured text analytics (eg tweet itself)I recommend >>>>>>> solr/ElasticSearch . >>>>>>> >>>>>>> However I am not sure what you want to do with the data exactly. >>>>>>> >>>>>>> >>>>>>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> This is really a general question. >>>>>>>> >>>>>>>> I use Spark to get twitter data. I did some looking at it >>>>>>>> >>>>>>>> val ssc = new StreamingContext(sparkConf, Seconds(2)) >>>>>>>> val tweets = TwitterUtils.createStream(ssc, None) >>>>>>>> val statuses = tweets.map(status => status.getText()) >>>>>>>> statuses.print() >>>>>>>> >>>>>>>> Ok >>>>>>>> >>>>>>>> Also I can use Apache flume to store data in hdfs directory >>>>>>>> >>>>>>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf >>>>>>>> Dflume.root.logger=DEBUG,console -n TwitterAgent >>>>>>>> Now that stores twitter data in binary format in hdfs directory. >>>>>>>> >>>>>>>> My question is pretty basic. >>>>>>>> >>>>>>>> What is the best tool/language to dif in to that data. For example >>>>>>>> twitter streaming data. I am getting all sorts od stuff coming in. Say >>>>>>>> I am only interested in certain topics like sport etc. How can I >>>>>>>> detect the signal from the noise using what tool and language? >>>>>>>> >>>>>>>> Thanks >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> LinkedIn >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >