Re: Faceting Word Count

Erick Erickson Mon, 06 Nov 2017 07:42:57 -0800

_Why_ do you want to get the word counts? Faceting on all of the
tokens for 100M docs isn't something Solr is ordinarily used for. As
Emir says it'll take a huge amount of memory. You can use one of the
function queries (termfreq IIRC) that will give you the count of any
individual term you have and will be very fast.


But getting all of the word counts in the index is probably not
something I'd use Solr for.

This may be an XY problem, you're asking how to do something specific
(X) without explaining what the problem you're trying to solve is (Y).
Perhaps there's another way to accomplish (Y) if we knew more about
what it is.

Best,
Erick



On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
<emir.arnauto...@sematext.com> wrote:
> Hi Wael,
> You are faceting on analyzed field. This results in field being uninverted - 
> fieldValueCache being built - on first call after every commit. This is both 
> time and memory consuming (you can check in admin console in stats how much 
> memory it took).
> What you need to do is to create multivalue string field (not text) and parse 
> values (do analysis steps) on client side and store it like that. This will 
> allow you to enable docValues on that field and avoid building 
> fieldValueCache.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 6 Nov 2017, at 13:06, Wael Kader <w...@softech-lb.com> wrote:
>>
>> Hi,
>>
>> I am using a custom field. Below is the field definition.
>> I am using this because I don't want stemming.
>>
>>
>>    <fieldType name="text_no_stem2" class="solr.TextField"
>> positionIncrementGap="100">
>>      <analyzer type="index">
>>        <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="stopwords.txt"
>>                enablePositionIncrements="true"
>>                />
>>        <filter class="solr.WordDelimiterFilterFactory"
>>                protected="protwords.txt"
>>                generateWordParts="0"
>>                generateNumberParts="1"
>>                catenateWords="1"
>>                catenateNumbers="1"
>>                catenateAll="0"
>>                splitOnCaseChange="1"
>>                preserveOriginal="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>      <analyzer type="query">
>>        <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="stopwords.txt"
>>                enablePositionIncrements="true"
>>                />
>> <!--ORIGINAL                generateNumberParts="1"-->
>>        <filter class="solr.WordDelimiterFilterFactory"
>>                protected="protwords.txt"
>>                generateWordParts="0"
>>                catenateWords="0"
>>                catenateNumbers="0"
>>                catenateAll="0"
>>                splitOnCaseChange="1"
>>                preserveOriginal="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/-->
>>        <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
>> word match -->
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>      </analyzer>
>>    </fieldType>
>>
>>
>> Regards,
>> Wael
>>
>> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
>> emir.arnauto...@sematext.com> wrote:
>>
>>> Hi Wael,
>>> Can you provide your field definition and sample query.
>>>
>>> Thanks,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>
>>>
>>>
>>>> On 6 Nov 2017, at 08:30, Wael Kader <w...@softech-lb.com> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I am having an index with around 100 Million documents.
>>>> I have a multivalued column that I am saving big chunks of text data in.
>>> It
>>>> has around 20 GB of RAM and 4 CPU's.
>>>>
>>>> I was doing faceting on it to get word cloud but it was taking around 1
>>>> second to retrieve when the data was 5-10 Million .
>>>> Now I have more data and its taking minutes to get the results (that is
>>> if
>>>> it gets it and SOLR doesn't crash). Whats the best way to make it run or
>>>> maybe its not scalable to make it run on my current schema and design
>>> with
>>>> News articles.
>>>>
>>>> I am looking to find the best solution for this. Maybe create another
>>> index
>>>> to split the data while inserting it or maybe if I change some settings
>>> in
>>>> SolrConfig or add some RAM, it would perform better.
>>>>
>>>> --
>>>> Regards,
>>>> Wael
>>>
>>>
>>
>>
>> --
>> Regards,
>> Wael
>

Re: Faceting Word Count

Reply via email to