Re: Indexing Twitter - Hypothetical

Susheel Kumar Sun, 06 Mar 2016 04:43:28 -0800

Entity Recognition means you may want to recognize different entities
name/person, email, location/city/state/country etc. in your
tweets/messages with goal of  providing better relevant results to users.
NER can be used at query or indexing (data enrichment) time.


Thanks,
Susheel

On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you all very much for all the responses so far.  I've enjoyed reading
> them!  We have noticed that storing data inside of Solr results in
> significantly worse performance (particularly faceting); so we store the
> values of all the fields elsewhere, but index all the data with Solr
> Cloud.  I think the suggestion about splitting the data up into blocks of
> date/time is where we would be headed.  Having two Solr-Cloud clusters -
> one to handle ~30 days of data, and one to handle historical.  Another
> option is to use a single Solr Cloud cluster, but use multiple
> cores/collections.  Either way you'd need a job to come through and clean
> up old data. The historical cluster would have much worse performance,
> particularly for clustering and faceting the data, but that may be
> acceptable.
> I don't know what you mean by 'entity recognition in the queries' - could
> you elaborate?
>
> We would want to index and potentially facet on any of the fields - for
> example entities_media_url, username, even background color, but we do not
> know a-priori what fields will be important to users.
> As to why we would want to make the data searchable; well - I don't make
> the rules!  Tweets is not the only data source, but it's certainly the
> largest that we are currently looking at handling.
>
> I will read up on the Berlin Buzzwords - thank you for the info!
>
> -Joe
>
>
>
> On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
> > As always, the initial question is how you intend to query the data -
> query
> > drives data modeling. How real-time do you need queries to be? How fast
> do
> > you need archive queries to be? How many fields do you need to query on?
> > How much entity recognition do you need in queries?
> >
> >
> > -- Jack Krupansky
> >
> > On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> wrote:
> >
> > > On 03/03/2016 19:25, Toke Eskildsen wrote:
> > >
> > >> Joseph Obernberger <joseph.obernber...@gmail.com> wrote:
> > >>
> > >>> Hi All - would it be reasonable to index the Twitter 'firehose'
> > >>> with Solr Cloud - roughly 500-600 million docs per day indexing
> > >>> each of the fields (about 180)?
> > >>>
> > >>
> > >> Possible, yes. Reasonable? It is not going to be cheap.
> > >>
> > >> Twitter index the tweets themselves and have been quite open about
> > >> how they do it. I would suggest looking for their presentations;
> > >> slides or recordings. They have presented at Berlin Buzzwords and
> > >> Lucene/Solr Revolution and probably elsewhere too. The gist is that
> > >> they have done a lot of work and custom coding to handle it.
> > >>
> > >
> > > As I recall they're not using Solr, but rather an in-house layer built
> on
> > > a customised version of Lucene. They're indexing around half a trillion
> > > tweets.
> > >
> > > If the idea is to provide a searchable archive of all tweets, my first
> > > question would be 'why': if the idea is to monitor new tweets for
> > > particular patterns there are better ways to do this (Luwak for
> example).
> > >
> > > Charlie
> > >
> > >
> > >> If I were to guess at a sharded setup to handle such data, and keep
> > >>> 2 years worth, I would guess about 2500 shards.  Is that
> > >>> reasonable?
> > >>>
> > >>
> > >> I think you need to think well beyond standard SolrCloud setups. Even
> > >> if you manage to get 2500 shards running, you will want to do a lot
> > >> of tweaking on the way to issue queries so that each request does not
> > >> require all 2500 shards to be searched. Prioritizing newer material
> > >> and only query the older shards if there is not enough resent results
> > >> is an example.
> > >>
> > >> I highly doubt that a single SolrCloud is the best answer here. Maybe
> > >> one cloud for each month and a lot of external logic?
> > >>
> > >> - Toke Eskildsen
> > >>
> > >>
> > >
> > > --
> > > Charlie Hull
> > > Flax - Open Source Enterprise Search
> > >
> > > tel/fax: +44 (0)8700 118334
> > > mobile:  +44 (0)7767 825828
> > > web: www.flax.co.uk
> > >
> >
>

Re: Indexing Twitter - Hypothetical

Reply via email to