Re: indexing unstructured text (tweets)

Anuj Kumar Mon, 28 May 2012 07:05:00 -0700

This is a bit old but provides good information for schema design-
http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php


Found this link as well- https://gist.github.com/702360

The types of the field may depend on the search requirements.

Regards,
Anuj

On Mon, May 28, 2012 at 7:21 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> Other obvious metadata from the Twitter API to index would be hashtags,
> user mentions (both the user id/screen name and user name), date/time, urls
> mentioned (expanded if a URL shortener is used), and possibly coordinates
> for spatial search.
>
> You would have to add all these fields and values yourself in your Solr
> input document. Tika can't help you there.
>
> Although, I imagine quite a few people have already done this quite a few
> times before, so maybe somebody could contribute their Twitter Solr schema.
> Anybody?
>
> -- Jack Krupansky
>
> -----Original Message----- From: David Radunz
> Sent: Monday, May 28, 2012 8:00 AM
> To: solr-user@lucene.apache.org
> Subject: Re: indexing unstructured text (tweets)
>
>
> Hey,
>
>    I think you might be over-thinking this. Tweets are structured. You
> have the content (tweet), the user who tweeted it and various other meta
> data. So your 'document', might look like this:
>
> <add>
> <doc>
> <field name="tweetId">ABCD1234</**field>
> <field name="tweet">I bought some apples</field>
> <field name="user">JohnnyBoy</field>
> </doc>
> </add>
>
> To get this structure, you can use any programming language your
> comfortable with and load it into Solr via various means. Obviously you
> can add more 'meta' fields that you get from twitter if you want as well.
>
> David
>
> On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:
>
>> Hi all.
>>
>> I am in the process of setting up Solr for my application,
>> which is full text search on a bunch of tweets from twitter.
>>
>> I am afraid I am missing something.
>>  From the books I am reading, "Apache Solr 3 Enterprise Search Server",
>> it looks like Solr works with structured input, like XML or CVS,
>> while I have the most wild and unstructured input ever (tweets).
>> A section named "Indexing documents with Solr Cell" seems to address my
>> problem,
>> but also shows that before getting to Solr, I might need to use
>> another Apache tool called Tika.
>>
>> Can anybody provide a brief explaination about the general picture?
>> Can I index my tweets with Solr?
>> Or do I need to put also Tika in my pipeline?
>>
>> Best regards,
>> Giovanni Gherdovich
>>
>
>

Re: indexing unstructured text (tweets)

Reply via email to