bq: 10k as a max number of rows. This doesn't matter. In order to facet on the word count, Solr has to be prepared to facet on all possible docs. For all Solr knows, a _single_ document may contain every word so the size of the structure that contains the counters has to be prepared for N buckets, where N is the total number of distinct words in the entire corpus.
You'll really have to find an alternative approach, somehow restrict the choices etc. I think. Best, Erick On Tue, Nov 7, 2017 at 12:26 AM, Wael Kader <w...@softech-lb.com> wrote: > Hi, > > The whole index has 100M but when I add the criteria, it will filter the > data to maybe 10k as a max number of rows. > The facet isn't working when the total number of records in the index is > 100M but it was working at 5M. > > I have social media & RSS data in the index and I am trying to get the word > count for a specific user on specific date intervals. > > Regards, > Wael > > On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> _Why_ do you want to get the word counts? Faceting on all of the >> tokens for 100M docs isn't something Solr is ordinarily used for. As >> Emir says it'll take a huge amount of memory. You can use one of the >> function queries (termfreq IIRC) that will give you the count of any >> individual term you have and will be very fast. >> >> But getting all of the word counts in the index is probably not >> something I'd use Solr for. >> >> This may be an XY problem, you're asking how to do something specific >> (X) without explaining what the problem you're trying to solve is (Y). >> Perhaps there's another way to accomplish (Y) if we knew more about >> what it is. >> >> Best, >> Erick >> >> >> >> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović >> <emir.arnauto...@sematext.com> wrote: >> > Hi Wael, >> > You are faceting on analyzed field. This results in field being >> uninverted - fieldValueCache being built - on first call after every >> commit. This is both time and memory consuming (you can check in admin >> console in stats how much memory it took). >> > What you need to do is to create multivalue string field (not text) and >> parse values (do analysis steps) on client side and store it like that. >> This will allow you to enable docValues on that field and avoid building >> fieldValueCache. >> > >> > HTH, >> > Emir >> > -- >> > Monitoring - Log Management - Alerting - Anomaly Detection >> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> > >> > >> > >> >> On 6 Nov 2017, at 13:06, Wael Kader <w...@softech-lb.com> wrote: >> >> >> >> Hi, >> >> >> >> I am using a custom field. Below is the field definition. >> >> I am using this because I don't want stemming. >> >> >> >> >> >> <fieldType name="text_no_stem2" class="solr.TextField" >> >> positionIncrementGap="100"> >> >> <analyzer type="index"> >> >> <charFilter class="solr.MappingCharFilterFactory" >> >> mapping="mapping-ISOLatin1Accent.txt"/> >> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> >> >> >> <filter class="solr.StopFilterFactory" >> >> ignoreCase="true" >> >> words="stopwords.txt" >> >> enablePositionIncrements="true" >> >> /> >> >> <filter class="solr.WordDelimiterFilterFactory" >> >> protected="protwords.txt" >> >> generateWordParts="0" >> >> generateNumberParts="1" >> >> catenateWords="1" >> >> catenateNumbers="1" >> >> catenateAll="0" >> >> splitOnCaseChange="1" >> >> preserveOriginal="1"/> >> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> >> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> >> </analyzer> >> >> <analyzer type="query"> >> >> <charFilter class="solr.MappingCharFilterFactory" >> >> mapping="mapping-ISOLatin1Accent.txt"/> >> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> >> <filter class="solr.SynonymFilterFactory" >> synonyms="synonyms.txt" >> >> ignoreCase="true" expand="true"/> >> >> <filter class="solr.StopFilterFactory" >> >> ignoreCase="true" >> >> words="stopwords.txt" >> >> enablePositionIncrements="true" >> >> /> >> >> <!--ORIGINAL generateNumberParts="1"--> >> >> <filter class="solr.WordDelimiterFilterFactory" >> >> protected="protwords.txt" >> >> generateWordParts="0" >> >> catenateWords="0" >> >> catenateNumbers="0" >> >> catenateAll="0" >> >> splitOnCaseChange="1" >> >> preserveOriginal="1"/> >> >> <filter class="solr.LowerCaseFilterFactory"/> >> >> <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory" >> >> language="English" protected="protwords.txt"/--> >> >> <!-- Webel: switch off Porter-stemmer algorithm to enforce whole >> >> word match --> >> >> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> >> >> </analyzer> >> >> </fieldType> >> >> >> >> >> >> Regards, >> >> Wael >> >> >> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović < >> >> emir.arnauto...@sematext.com> wrote: >> >> >> >>> Hi Wael, >> >>> Can you provide your field definition and sample query. >> >>> >> >>> Thanks, >> >>> Emir >> >>> -- >> >>> Monitoring - Log Management - Alerting - Anomaly Detection >> >>> Solr & Elasticsearch Consulting Support Training - >> http://sematext.com/ >> >>> >> >>> >> >>> >> >>>> On 6 Nov 2017, at 08:30, Wael Kader <w...@softech-lb.com> wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> I am having an index with around 100 Million documents. >> >>>> I have a multivalued column that I am saving big chunks of text data >> in. >> >>> It >> >>>> has around 20 GB of RAM and 4 CPU's. >> >>>> >> >>>> I was doing faceting on it to get word cloud but it was taking around >> 1 >> >>>> second to retrieve when the data was 5-10 Million . >> >>>> Now I have more data and its taking minutes to get the results (that >> is >> >>> if >> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it run >> or >> >>>> maybe its not scalable to make it run on my current schema and design >> >>> with >> >>>> News articles. >> >>>> >> >>>> I am looking to find the best solution for this. Maybe create another >> >>> index >> >>>> to split the data while inserting it or maybe if I change some >> settings >> >>> in >> >>>> SolrConfig or add some RAM, it would perform better. >> >>>> >> >>>> -- >> >>>> Regards, >> >>>> Wael >> >>> >> >>> >> >> >> >> >> >> -- >> >> Regards, >> >> Wael >> > >> > > > > -- > Regards, > Wael