Interesting. There is also apache nifi <https://nifi.apache.org/>
Also I note that one can store twitter data in Hive tables as well? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at 15:59, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > thanks I will have a look. > > Mich > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote: > >> Solr is basically an in-memory text index with a lot of capabilities for >> language analysis extraction (you can compare it to a Google for your >> tweets). The system itself has a lot of features and has a complexity >> similar to Big data systems. This index files can be backed by HDFS. You >> can put the tweets directly into solr without going via HDFS files. >> >> Carefully decide what fields to index / you want to search. It does not >> make sense to index everything. >> >> On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> Ok So basically for predictive off-line (as opposed to streaming) in a >> nutshell one can use Apache Flume to store twitter data in hdfs and use >> Solr to query the data? >> >> This is what it says: >> >> Solr is a standalone enterprise search server with a REST-like API. You >> put documents in it (called "indexing") via JSON, XML, CSV or binary over >> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary >> results. >> >> thanks >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote: >> >>> Well I have seen that The algorithms mentioned are used for this. >>> However some preprocessing through solr makes sense - it takes care of >>> synonyms, homonyms, stemming etc >>> >>> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> >>> wrote: >>> >>> Thanks Jorn, >>> >>> To start I would like to explore how can one turn some of the data into >>> useful information. >>> >>> I would like to look at certain trend analysis. Simple correlation shows >>> that the more there is a mention of a typical topic say for example >>> "organic food" the more people are inclined to go for it. To see one can >>> deduce that orgaind food is a potential growth area. >>> >>> Now I have all infra-structure to ingest that data. Like using flume to >>> store it or Spark streaming to do near real time work. >>> >>> Now I want to slice and dice that data for say organic food. >>> >>> I presume this is a typical question. >>> >>> You mentioned Spark ml (machine learning?) . Is that something viable? >>> >>> Cheers >>> >>> >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> >>> >>> LinkedIn * >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote: >>> >>>> Spark ml Support Vector machines or neural networks could be >>>> candidates. >>>> For unstructured learning it could be clustering. >>>> For doing a graph analysis On the followers you can easily use Spark >>>> Graphx >>>> Keep in mind that each tweet contains a lot of meta data (location, >>>> followers etc) that is more or less structured. >>>> For unstructured text analytics (eg tweet itself)I recommend >>>> solr/ElasticSearch . >>>> >>>> However I am not sure what you want to do with the data exactly. >>>> >>>> >>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> wrote: >>>> >>>> Hi, >>>> >>>> This is really a general question. >>>> >>>> I use Spark to get twitter data. I did some looking at it >>>> >>>> val ssc = new StreamingContext(sparkConf, Seconds(2)) >>>> val tweets = TwitterUtils.createStream(ssc, None) >>>> val statuses = tweets.map(status => status.getText()) >>>> statuses.print() >>>> >>>> Ok >>>> >>>> Also I can use Apache flume to store data in hdfs directory >>>> >>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf >>>> Dflume.root.logger=DEBUG,console -n TwitterAgent >>>> Now that stores twitter data in binary format in hdfs directory. >>>> >>>> My question is pretty basic. >>>> >>>> What is the best tool/language to dif in to that data. For example >>>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am >>>> only interested in certain topics like sport etc. How can I detect the >>>> signal from the noise using what tool and language? >>>> >>>> Thanks >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> >>> >> >