Other obvious metadata from the Twitter API to index would be hashtags, user
mentions (both the user id/screen name and user name), date/time, urls
mentioned (expanded if a URL shortener is used), and possibly coordinates
for spatial search.
You would have to add all these fields and values yourself in your Solr
input document. Tika can't help you there.
Although, I imagine quite a few people have already done this quite a few
times before, so maybe somebody could contribute their Twitter Solr schema.
Anybody?
-- Jack Krupansky
-----Original Message-----
From: David Radunz
Sent: Monday, May 28, 2012 8:00 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing unstructured text (tweets)
Hey,
I think you might be over-thinking this. Tweets are structured. You
have the content (tweet), the user who tweeted it and various other meta
data. So your 'document', might look like this:
<add>
<doc>
<field name="tweetId">ABCD1234</field>
<field name="tweet">I bought some apples</field>
<field name="user">JohnnyBoy</field>
</doc>
</add>
To get this structure, you can use any programming language your
comfortable with and load it into Solr via various means. Obviously you
can add more 'meta' fields that you get from twitter if you want as well.
David
On 28/05/2012 9:37 PM, Giovanni Gherdovich wrote:
Hi all.
I am in the process of setting up Solr for my application,
which is full text search on a bunch of tweets from twitter.
I am afraid I am missing something.
From the books I am reading, "Apache Solr 3 Enterprise Search Server",
it looks like Solr works with structured input, like XML or CVS,
while I have the most wild and unstructured input ever (tweets).
A section named "Indexing documents with Solr Cell" seems to address my
problem,
but also shows that before getting to Solr, I might need to use
another Apache tool called Tika.
Can anybody provide a brief explaination about the general picture?
Can I index my tweets with Solr?
Or do I need to put also Tika in my pipeline?
Best regards,
Giovanni Gherdovich