Re: indexing unstructured text (tweets)

Dmitry Kan Mon, 28 May 2012 04:59:35 -0700

Hi,

You want to use Tika, if you have your data in some binary format, like pdf
or excel. It extracts text from the binary for you. If you just want to
index the text contents of tweets (including web links etc), using just
off-the-shelf Solr is enough. You'll have to wrap your text input (per each
tweet I would assume) into an xml or other supported structured format in
accordance with the schema that you have defined. So at minimum, you would
have two fields: a unique id of a document and its textual contents (a
tweet). So design your schema first, create (e.g.) xml with the documents
to add and post them onto SOLR.


Dmitry

On Mon, May 28, 2012 at 2:37 PM, Giovanni Gherdovich <g.gherdov...@gmail.com
> wrote:

> Hi all.
>
> I am in the process of setting up Solr for my application,
> which is full text search on a bunch of tweets from twitter.
>
> I am afraid I am missing something.
> From the books I am reading, "Apache Solr 3 Enterprise Search Server",
> it looks like Solr works with structured input, like XML or CVS,
> while I have the most wild and unstructured input ever (tweets).
> A section named "Indexing documents with Solr Cell" seems to address my
> problem,
> but also shows that before getting to Solr, I might need to use
> another Apache tool called Tika.
>
> Can anybody provide a brief explaination about the general picture?
> Can I index my tweets with Solr?
> Or do I need to put also Tika in my pipeline?
>
> Best regards,
> Giovanni Gherdovich
>



-- 
Regards,

Dmitry Kan

Re: indexing unstructured text (tweets)

Reply via email to