Re: indexing unstructured text (tweets)

Jack Krupansky Mon, 28 May 2012 06:52:08 -0700

Other obvious metadata from the Twitter API to index would be hashtags, usermentions (both the user id/screen name and user name), date/time, urlsmentioned (expanded if a URL shortener is used), and possibly coordinatesfor spatial search.

You would have to add all these fields and values yourself in your Solrinput document. Tika can't help you there.

Although, I imagine quite a few people have already done this quite a fewtimes before, so maybe somebody could contribute their Twitter Solr schema.Anybody?


-- Jack Krupansky

-----Original Message-----From: David Radunz

Sent: Monday, May 28, 2012 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)

Hey,

    I think you might be over-thinking this. Tweets are structured. You
have the content (tweet), the user who tweeted it and various other meta
data. So your 'document', might look like this:

<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet">I bought some apples</field>
<field name="user">JohnnyBoy</field>
</doc>
</add>

To get this structure, you can use any programming language your
comfortable with and load it into Solr via various means. Obviously you
can add more 'meta' fields that you get from twitter if you want as well.

David

On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:

Hi all.

I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.

I am afraid I am missing something.
 From the books I am reading, "Apache Solr 3 Enterprise Search Server",
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).

A section named "Indexing documents with Solr Cell" seems to address myproblem,

but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.

Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?

Best regards,

Giovanni Gherdovich

Re: indexing unstructured text (tweets)

Reply via email to