Re: Analyzing twitter data

Jörn Franke Wed, 08 Jun 2016 07:12:55 -0700

You can directly load it into solr.
But think about what you want to index etc.



> On 08 Jun 2016, at 15:51, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> yes. use that is reasonable.
> 
> What is the format of twitter data. Is that primarily json.?
> 
> If I do
> 
> duser@rhes564: /usr/lib/nifi-0.6.1/conf> hdfs dfs -cat 
> /twitter_data/FlumeData.1464945101915|more
> 
> 16/06/08 14:48:36 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> {"type":"record","name":"Doc","doc":"adoc","fields":[{"name":"id","type":"string"},{"name":"user_friends_count","type":["int","null"]},{"name":"user_location","type":["string","null"]},{"name":"user_description","
> type":["string","null"]},{"name":"user_statuses_count","type":["int","null"]},{"name":"user_followers_count","type":["int","null"]},{"name":"user_name","type":["string","null"]},{"name":"user_screen_name","type":[
> "string","null"]},{"name":"created_at","type":["string","null"]},{"name":"text","type":["string","null"]},{"name":"retweet_count","type":["long","null"]},{"name":"retweeted","type":["boolean","null"]},{"name":"in_
> reply_to_user_id","type":["long","null"]},{"name":"source","type":["string","null"]},{"name":"in_reply_to_status_id","type":["long","null"]},{"name":"media_url_https","type":["string","null"]},{"name":"expanded_ur
> l","type":["string","null"]}]}
> ▒ぷろが説明書✋http://twpf.jp/960_krm
> 8659292711026688
> ▒便座カバー
> 男児「こちらロストボーイ1、"施設"に侵入した。セキュリティが厄介で数日かかるな。偽装工作はうまくいっているか？」
> 父親「あぁ今じゃ立派なダメ親父と可哀想な子供扱いだ。これで万一見つかってもお前は安心さ」
> 男…
> ter.com" rel="nofollow">Twitter Web Client</a>s, and fun!
> Learning a new la... https://t.co/ejHfRcAucy
> 3-2 7番 代議員
> 木村亮太
> 知ってる人はRT
> ちびまる子ちゃんのEDですね！
> この曲は12年前になります。 https://t.co/LfLZ8xX5u9
> twitter.com/hijidora/status/737980634858029056/video/1$738659292677431296
> (ライトニングさんティナラムザアダマンＡ)/かんこれ/EXVS(モチベ↑雑魚後衛)/FGOその他適当
> ▒<a href="http://twitter.com/download/iphone"; rel="nofollow">Twitter for 
> iPhone</a>
> itter.com/download/android" rel="nofollow">Twitter for Android</a> Free Lyft 
> credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer</a>
> naliar for iPad</a>
> third person
> 男子南ことりが大好きなラブライバーです！　ラブライブ大好きな人ぜひフォローしてください
> 固定ツイートお願いします
> ラブライブに出会えて良かった！
> 9人のみんなのこと忘れない
> #LoveLiveforever
> #ラブライバーと繋がりたいRT https://t.co/kITPDLER9x
> 07114803986434/photo/1$738659292685979648
> :13Z://pbs.twimg.com/media/CkA-exTWYAAK8TU.jpg
> : 1000RT：【資金不足】「学園ハンサム」、クラウドファンディングでアニメ化支援を募集
> https://t.co/CVM2F7rNt1
> 放送局やキャストは「支援額に応じて変わる」とのこと。時期は10月から1クールと発表されている。 http…
> com/media/CkAftVyUYAA0nmn.jpg-06-03T10:11:13Z
> miga, sutiã que é do dia a dia ela só usa de ser obrigada, acha mesmo q ela 
> compraria mais d…r Promoter | Worked with @inkmonstarr @breadboi @ayookd 
> @chapobandz and more | PayPal accepted | DM for beats | Beats Starting at $10 
> |resenting August Redmoon at the Hollywood premiere of Inside Metal: The 
> Metal Scene Explodes! 🤘🏼🤘🏼🤘🏼🤘🏼🎸🎸 https://t.…
> .jpg
> 
> -03T10:11:13Z
>  
> I assume it is all json data. So I can use solr to build index on these files 
> and do a search?
> 
> Or alternatively use it for a staging area for Hive table?
> 
> 
> thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 8 June 2016 at 14:01, Jörn Franke <jornfra...@gmail.com> wrote:
>> I mean what you should also look at is ingestion capacity. If you have a 
>> lots of irregular writes such as from sensor data, it can make sense to 
>> store them first in hbase and flush them regularly to Orc/parquet hive 
>> tables for analysis 
>> 
>>> On 08 Jun 2016, at 13:15, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>> 
>> 
>>> Interesting. There is also apache nifi
>>> 
>>> Also I note that one can store twitter data in Hive tables as well?
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 7 June 2016 at 15:59, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
>>>> thanks I will have a look.
>>>> 
>>>> Mich
>>>> 
>>>> Dr Mich Talebzadeh
>>>>  
>>>> LinkedIn  
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>  
>>>> http://talebzadehmich.wordpress.com
>>>>  
>>>> 
>>>>> On 7 June 2016 at 13:38, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>> Solr is basically an in-memory text index with a lot of capabilities for 
>>>>> language analysis extraction (you can compare  it to a Google for your 
>>>>> tweets). The system itself has a lot of features and has a complexity 
>>>>> similar to Big data systems. This index files can be backed by HDFS. You 
>>>>> can put the tweets directly into solr without going via HDFS files.
>>>>> 
>>>>> Carefully decide what fields to index / you want to search. It does not 
>>>>> make sense to index everything.
>>>>> 
>>>>>> On 07 Jun 2016, at 13:51, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>> wrote:
>>>>>> 
>>>>>> Ok So basically for predictive off-line (as opposed to streaming) in a 
>>>>>> nutshell one can use Apache Flume to store twitter data in hdfs and use 
>>>>>> Solr to query the data?
>>>>>> 
>>>>>> This is what it says:
>>>>>> 
>>>>>> Solr is a standalone enterprise search server with a REST-like API. You 
>>>>>> put documents in it (called "indexing") via JSON, XML, CSV or binary 
>>>>>> over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or 
>>>>>> binary results.
>>>>>> 
>>>>>> thanks
>>>>>> 
>>>>>> Dr Mich Talebzadeh
>>>>>>  
>>>>>> LinkedIn  
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>  
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>  
>>>>>> 
>>>>>>> On 7 June 2016 at 12:39, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>>> Well I have seen that The algorithms mentioned are used for this. 
>>>>>>> However some preprocessing through solr makes sense - it takes care of 
>>>>>>> synonyms, homonyms, stemming etc
>>>>>>> 
>>>>>>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh <mich.talebza...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Thanks Jorn,
>>>>>>>> 
>>>>>>>> To start I would like to explore how can one turn some of the data 
>>>>>>>> into useful information.
>>>>>>>> 
>>>>>>>> I would like to look at certain trend analysis. Simple correlation 
>>>>>>>> shows that the more there is a mention of a typical topic say for 
>>>>>>>> example "organic food" the more people are inclined to go for it. To 
>>>>>>>> see one can deduce that orgaind food is a potential growth area.
>>>>>>>> 
>>>>>>>> Now I have all infra-structure to ingest that data. Like using flume 
>>>>>>>> to store it or Spark streaming to do near real time work.
>>>>>>>> 
>>>>>>>> Now I want to slice and dice that data for say organic food.
>>>>>>>> 
>>>>>>>> I presume this is a typical question.
>>>>>>>> 
>>>>>>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>  
>>>>>>>> LinkedIn  
>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>  
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>  
>>>>>>>> 
>>>>>>>>> On 7 June 2016 at 12:22, Jörn Franke <jornfra...@gmail.com> wrote:
>>>>>>>>> Spark ml Support Vector machines or neural networks could be 
>>>>>>>>> candidates. 
>>>>>>>>> For unstructured learning it could be clustering.
>>>>>>>>> For doing a graph analysis On the followers you can easily use Spark 
>>>>>>>>> Graphx
>>>>>>>>> Keep in mind that each tweet contains a lot of meta data (location, 
>>>>>>>>> followers etc) that is more or less structured.
>>>>>>>>> For unstructured text analytics (eg tweet itself)I recommend 
>>>>>>>>> solr/ElasticSearch .
>>>>>>>>> 
>>>>>>>>> However I am not sure what you want to do with the data exactly.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
>>>>>>>>>> <mich.talebza...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> This is really a general question.
>>>>>>>>>> 
>>>>>>>>>> I use Spark to get twitter data. I did some looking at it
>>>>>>>>>> 
>>>>>>>>>>     val ssc = new StreamingContext(sparkConf, Seconds(2))
>>>>>>>>>>     val tweets = TwitterUtils.createStream(ssc, None)
>>>>>>>>>>     val statuses = tweets.map(status => status.getText())
>>>>>>>>>>     statuses.print()
>>>>>>>>>> 
>>>>>>>>>> Ok
>>>>>>>>>> 
>>>>>>>>>> Also I can use Apache flume to store data in hdfs directory
>>>>>>>>>> 
>>>>>>>>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
>>>>>>>>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>>>>>>>>> Now that stores twitter data in binary format in  hdfs directory.
>>>>>>>>>> 
>>>>>>>>>> My question is pretty basic.
>>>>>>>>>> 
>>>>>>>>>> What is the best tool/language to dif in to that data. For example 
>>>>>>>>>> twitter streaming data. I am getting all sorts od stuff coming in. 
>>>>>>>>>> Say I am only interested in certain topics like sport etc. How can I 
>>>>>>>>>> detect the signal from the noise using what tool and language?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>>  
>>>>>>>>>> LinkedIn  
>>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>>>  
>>>>>>>>>> http://talebzadehmich.wordpress.com
>

Re: Analyzing twitter data

Reply via email to