This is a bit old but provides good information for schema design- http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php
Found this link as well- https://gist.github.com/702360 The types of the field may depend on the search requirements. Regards, Anuj On Mon, May 28, 2012 at 7:21 PM, Jack Krupansky <j...@basetechnology.com>wrote: > Other obvious metadata from the Twitter API to index would be hashtags, > user mentions (both the user id/screen name and user name), date/time, urls > mentioned (expanded if a URL shortener is used), and possibly coordinates > for spatial search. > > You would have to add all these fields and values yourself in your Solr > input document. Tika can't help you there. > > Although, I imagine quite a few people have already done this quite a few > times before, so maybe somebody could contribute their Twitter Solr schema. > Anybody? > > -- Jack Krupansky > > -----Original Message----- From: David Radunz > Sent: Monday, May 28, 2012 8:00 AM > To: solr-user@lucene.apache.org > Subject: Re: indexing unstructured text (tweets) > > > Hey, > > I think you might be over-thinking this. Tweets are structured. You > have the content (tweet), the user who tweeted it and various other meta > data. So your 'document', might look like this: > > <add> > <doc> > <field name="tweetId">ABCD1234</**field> > <field name="tweet">I bought some apples</field> > <field name="user">JohnnyBoy</field> > </doc> > </add> > > To get this structure, you can use any programming language your > comfortable with and load it into Solr via various means. Obviously you > can add more 'meta' fields that you get from twitter if you want as well. > > David > > On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote: > >> Hi all. >> >> I am in the process of setting up Solr for my application, >> which is full text search on a bunch of tweets from twitter. >> >> I am afraid I am missing something. >> From the books I am reading, "Apache Solr 3 Enterprise Search Server", >> it looks like Solr works with structured input, like XML or CVS, >> while I have the most wild and unstructured input ever (tweets). >> A section named "Indexing documents with Solr Cell" seems to address my >> problem, >> but also shows that before getting to Solr, I might need to use >> another Apache tool called Tika. >> >> Can anybody provide a brief explaination about the general picture? >> Can I index my tweets with Solr? >> Or do I need to put also Tika in my pipeline? >> >> Best regards, >> Giovanni Gherdovich >> > >