You can directly load it into solr. But think about what you want to index etc.
> On 08 Jun 2016, at 15:51, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > yes. use that is reasonable. > > What is the format of twitter data. Is that primarily json.? > > If I do > > duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat > /twitter_data/FlumeData.1464945101915|more > > 16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description"," > type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[ > "string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_ > reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur > l","type":["string","null"]}]} > ▒ぷろが説明書✋http://twpf.jp/960_krm > 8659292711026688 > ▒便座カバー > 男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか?」 > 父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」 > 男… > ter.com" rel="nofollow">Twitter Web Client</a>s, and fun! > Learning a new la... https://t.co/ejHfRcAucy > 3-2 7番 代議員 > 木村亮太 > 知ってる人はRT > ちびまる子ちゃんのEDですね! > この曲は12年前になります。 https://t.co/LfLZ8xX5u9 > twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296 > (ライトニングさんティナラムザアダマンA)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当 > ▒<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for > iPhone</a> > itter.com/download/android" rel="nofollow">Twitter for Android</a> Free Lyft > credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer</a> > naliar for iPad</a> > third person > 男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください > 固定ツイートお願いします > ラブライブに出会えて良かった! > 9人のみんなのこと忘れない > #LoveLiveforever > #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x > 07114803986434/photo/1$738659292685979648 > :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg > : 1000RT:【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集 > https://t.co/CVM2F7rNt1 > 放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http… > com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z > miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela > compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd > @chapobandz and more | PayPal accepted | DM for beats | Beats Starting at $10 > |resenting August Redmoon at the Hollywood premiere of Inside Metal: The > Metal Scene Explodes! 🤘🏼🤘🏼🤘🏼🤘🏼🎸🎸 https://t.… > .jpg > > -03T10:11:13Z > > I assume it is all json data. So I can use solr to build index on these files > and do a search? > > Or alternatively use it for a staging area for Hive table? > > > thanks > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > http://talebzadehmich.wordpress.com > > >> On 8 June 2016 at 14:01, Jörn Franke <jornfra...@gmail.com> wrote: >> I mean what you should also look at is ingestion capacity. If you have a >> lots of irregular writes such as from sensor data, it can make sense to >> store them first in hbase and flush them regularly to Orc/parquet hive >> tables for analysis >> >>> On 08 Jun 2016, at 13:15, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >>> >> >>> Interesting. There is also apache nifi >>> >>> Also I note that one can store twitter data in Hive tables as well? >>> >>> >>> >>> Dr Mich Talebzadeh >>> >>> LinkedIn >>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> >>> http://talebzadehmich.wordpress.com >>> >>> >>>> On 7 June 2016 at 15:59, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: >>>> thanks I will have a look. >>>> >>>> Mich >>>> >>>> Dr Mich Talebzadeh >>>> >>>> LinkedIn >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>>> On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote: >>>>> Solr is basically an in-memory text index with a lot of capabilities for >>>>> language analysis extraction (you can compare it to a Google for your >>>>> tweets). The system itself has a lot of features and has a complexity >>>>> similar to Big data systems. This index files can be backed by HDFS. You >>>>> can put the tweets directly into solr without going via HDFS files. >>>>> >>>>> Carefully decide what fields to index / you want to search. It does not >>>>> make sense to index everything. >>>>> >>>>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Ok So basically for predictive off-line (as opposed to streaming) in a >>>>>> nutshell one can use Apache Flume to store twitter data in hdfs and use >>>>>> Solr to query the data? >>>>>> >>>>>> This is what it says: >>>>>> >>>>>> Solr is a standalone enterprise search server with a REST-like API. You >>>>>> put documents in it (called "indexing") via JSON, XML, CSV or binary >>>>>> over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or >>>>>> binary results. >>>>>> >>>>>> thanks >>>>>> >>>>>> Dr Mich Talebzadeh >>>>>> >>>>>> LinkedIn >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>> >>>>>> http://talebzadehmich.wordpress.com >>>>>> >>>>>> >>>>>>> On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote: >>>>>>> Well I have seen that The algorithms mentioned are used for this. >>>>>>> However some preprocessing through solr makes sense - it takes care of >>>>>>> synonyms, homonyms, stemming etc >>>>>>> >>>>>>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Thanks Jorn, >>>>>>>> >>>>>>>> To start I would like to explore how can one turn some of the data >>>>>>>> into useful information. >>>>>>>> >>>>>>>> I would like to look at certain trend analysis. Simple correlation >>>>>>>> shows that the more there is a mention of a typical topic say for >>>>>>>> example "organic food" the more people are inclined to go for it. To >>>>>>>> see one can deduce that orgaind food is a potential growth area. >>>>>>>> >>>>>>>> Now I have all infra-structure to ingest that data. Like using flume >>>>>>>> to store it or Spark streaming to do near real time work. >>>>>>>> >>>>>>>> Now I want to slice and dice that data for say organic food. >>>>>>>> >>>>>>>> I presume this is a typical question. >>>>>>>> >>>>>>>> You mentioned Spark ml (machine learning?) . Is that something viable? >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Dr Mich Talebzadeh >>>>>>>> >>>>>>>> LinkedIn >>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>> >>>>>>>> http://talebzadehmich.wordpress.com >>>>>>>> >>>>>>>> >>>>>>>>> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote: >>>>>>>>> Spark ml Support Vector machines or neural networks could be >>>>>>>>> candidates. >>>>>>>>> For unstructured learning it could be clustering. >>>>>>>>> For doing a graph analysis On the followers you can easily use Spark >>>>>>>>> Graphx >>>>>>>>> Keep in mind that each tweet contains a lot of meta data (location, >>>>>>>>> followers etc) that is more or less structured. >>>>>>>>> For unstructured text analytics (eg tweet itself)I recommend >>>>>>>>> solr/ElasticSearch . >>>>>>>>> >>>>>>>>> However I am not sure what you want to do with the data exactly. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh >>>>>>>>>> <mich.talebza...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> This is really a general question. >>>>>>>>>> >>>>>>>>>> I use Spark to get twitter data. I did some looking at it >>>>>>>>>> >>>>>>>>>> val ssc = new StreamingContext(sparkConf, Seconds(2)) >>>>>>>>>> val tweets = TwitterUtils.createStream(ssc, None) >>>>>>>>>> val statuses = tweets.map(status => status.getText()) >>>>>>>>>> statuses.print() >>>>>>>>>> >>>>>>>>>> Ok >>>>>>>>>> >>>>>>>>>> Also I can use Apache flume to store data in hdfs directory >>>>>>>>>> >>>>>>>>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf >>>>>>>>>> Dflume.root.logger=DEBUG,console -n TwitterAgent >>>>>>>>>> Now that stores twitter data in binary format in hdfs directory. >>>>>>>>>> >>>>>>>>>> My question is pretty basic. >>>>>>>>>> >>>>>>>>>> What is the best tool/language to dif in to that data. For example >>>>>>>>>> twitter streaming data. I am getting all sorts od stuff coming in. >>>>>>>>>> Say I am only interested in certain topics like sport etc. How can I >>>>>>>>>> detect the signal from the noise using what tool and language? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Dr Mich Talebzadeh >>>>>>>>>> >>>>>>>>>> LinkedIn >>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>>>>> >>>>>>>>>> http://talebzadehmich.wordpress.com >