Entity Recognition means you may want to recognize different entities name/person, email, location/city/state/country etc. in your tweets/messages with goal of providing better relevant results to users. NER can be used at query or indexing (data enrichment) time.
Thanks, Susheel On Fri, Mar 4, 2016 at 7:55 PM, Joseph Obernberger < joseph.obernber...@gmail.com> wrote: > Thank you all very much for all the responses so far. I've enjoyed reading > them! We have noticed that storing data inside of Solr results in > significantly worse performance (particularly faceting); so we store the > values of all the fields elsewhere, but index all the data with Solr > Cloud. I think the suggestion about splitting the data up into blocks of > date/time is where we would be headed. Having two Solr-Cloud clusters - > one to handle ~30 days of data, and one to handle historical. Another > option is to use a single Solr Cloud cluster, but use multiple > cores/collections. Either way you'd need a job to come through and clean > up old data. The historical cluster would have much worse performance, > particularly for clustering and faceting the data, but that may be > acceptable. > I don't know what you mean by 'entity recognition in the queries' - could > you elaborate? > > We would want to index and potentially facet on any of the fields - for > example entities_media_url, username, even background color, but we do not > know a-priori what fields will be important to users. > As to why we would want to make the data searchable; well - I don't make > the rules! Tweets is not the only data source, but it's certainly the > largest that we are currently looking at handling. > > I will read up on the Berlin Buzzwords - thank you for the info! > > -Joe > > > > On Fri, Mar 4, 2016 at 9:59 AM, Jack Krupansky <jack.krupan...@gmail.com> > wrote: > > > As always, the initial question is how you intend to query the data - > query > > drives data modeling. How real-time do you need queries to be? How fast > do > > you need archive queries to be? How many fields do you need to query on? > > How much entity recognition do you need in queries? > > > > > > -- Jack Krupansky > > > > On Fri, Mar 4, 2016 at 4:19 AM, Charlie Hull <char...@flax.co.uk> wrote: > > > > > On 03/03/2016 19:25, Toke Eskildsen wrote: > > > > > >> Joseph Obernberger <joseph.obernber...@gmail.com> wrote: > > >> > > >>> Hi All - would it be reasonable to index the Twitter 'firehose' > > >>> with Solr Cloud - roughly 500-600 million docs per day indexing > > >>> each of the fields (about 180)? > > >>> > > >> > > >> Possible, yes. Reasonable? It is not going to be cheap. > > >> > > >> Twitter index the tweets themselves and have been quite open about > > >> how they do it. I would suggest looking for their presentations; > > >> slides or recordings. They have presented at Berlin Buzzwords and > > >> Lucene/Solr Revolution and probably elsewhere too. The gist is that > > >> they have done a lot of work and custom coding to handle it. > > >> > > > > > > As I recall they're not using Solr, but rather an in-house layer built > on > > > a customised version of Lucene. They're indexing around half a trillion > > > tweets. > > > > > > If the idea is to provide a searchable archive of all tweets, my first > > > question would be 'why': if the idea is to monitor new tweets for > > > particular patterns there are better ways to do this (Luwak for > example). > > > > > > Charlie > > > > > > > > >> If I were to guess at a sharded setup to handle such data, and keep > > >>> 2 years worth, I would guess about 2500 shards. Is that > > >>> reasonable? > > >>> > > >> > > >> I think you need to think well beyond standard SolrCloud setups. Even > > >> if you manage to get 2500 shards running, you will want to do a lot > > >> of tweaking on the way to issue queries so that each request does not > > >> require all 2500 shards to be searched. Prioritizing newer material > > >> and only query the older shards if there is not enough resent results > > >> is an example. > > >> > > >> I highly doubt that a single SolrCloud is the best answer here. Maybe > > >> one cloud for each month and a lot of external logic? > > >> > > >> - Toke Eskildsen > > >> > > >> > > > > > > -- > > > Charlie Hull > > > Flax - Open Source Enterprise Search > > > > > > tel/fax: +44 (0)8700 118334 > > > mobile: +44 (0)7767 825828 > > > web: www.flax.co.uk > > > > > >