Re: Faceting Word Count

Erick Erickson Tue, 07 Nov 2017 07:43:09 -0800

bq: 10k as a max number of rows.

This doesn't matter. In order to facet on the word count, Solr has to
be prepared to facet on all possible docs. For all Solr knows, a
_single_ document may contain every word so the size of the structure
that contains the counters has to be prepared for N buckets, where N
is the total number of distinct words in the entire corpus.


You'll really have to find an alternative approach, somehow restrict
the choices etc. I think.

Best,
Erick

On Tue, Nov 7, 2017 at 12:26 AM, Wael Kader <w...@softech-lb.com> wrote:
> Hi,
>
> The whole index has 100M but when I add the criteria, it will filter the
> data to maybe 10k as a max number of rows.
> The facet isn't working when the total number of records in the index is
> 100M but it was working at 5M.
>
> I have social media & RSS data in the index and I am trying to get the word
> count for a specific user on specific date intervals.
>
> Regards,
> Wael
>
> On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> _Why_ do you want to get the word counts? Faceting on all of the
>> tokens for 100M docs isn't something Solr is ordinarily used for. As
>> Emir says it'll take a huge amount of memory. You can use one of the
>> function queries (termfreq IIRC) that will give you the count of any
>> individual term you have and will be very fast.
>>
>> But getting all of the word counts in the index is probably not
>> something I'd use Solr for.
>>
>> This may be an XY problem, you're asking how to do something specific
>> (X) without explaining what the problem you're trying to solve is (Y).
>> Perhaps there's another way to accomplish (Y) if we knew more about
>> what it is.
>>
>> Best,
>> Erick
>>
>>
>>
>> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
>> <emir.arnauto...@sematext.com> wrote:
>> > Hi Wael,
>> > You are faceting on analyzed field. This results in field being
>> uninverted - fieldValueCache being built - on first call after every
>> commit. This is both time and memory consuming (you can check in admin
>> console in stats how much memory it took).
>> > What you need to do is to create multivalue string field (not text) and
>> parse values (do analysis steps) on client side and store it like that.
>> This will allow you to enable docValues on that field and avoid building
>> fieldValueCache.
>> >
>> > HTH,
>> > Emir
>> > --
>> > Monitoring - Log Management - Alerting - Anomaly Detection
>> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >
>> >
>> >
>> >> On 6 Nov 2017, at 13:06, Wael Kader <w...@softech-lb.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am using a custom field. Below is the field definition.
>> >> I am using this because I don't want stemming.
>> >>
>> >>
>> >>    <fieldType name="text_no_stem2" class="solr.TextField"
>> >> positionIncrementGap="100">
>> >>      <analyzer type="index">
>> >>        <charFilter class="solr.MappingCharFilterFactory"
>> >> mapping="mapping-ISOLatin1Accent.txt"/>
>> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >>
>> >>        <filter class="solr.StopFilterFactory"
>> >>                ignoreCase="true"
>> >>                words="stopwords.txt"
>> >>                enablePositionIncrements="true"
>> >>                />
>> >>        <filter class="solr.WordDelimiterFilterFactory"
>> >>                protected="protwords.txt"
>> >>                generateWordParts="0"
>> >>                generateNumberParts="1"
>> >>                catenateWords="1"
>> >>                catenateNumbers="1"
>> >>                catenateAll="0"
>> >>                splitOnCaseChange="1"
>> >>                preserveOriginal="1"/>
>> >>        <filter class="solr.LowerCaseFilterFactory"/>
>> >>
>> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>      </analyzer>
>> >>      <analyzer type="query">
>> >>        <charFilter class="solr.MappingCharFilterFactory"
>> >> mapping="mapping-ISOLatin1Accent.txt"/>
>> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >>        <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>> >> ignoreCase="true" expand="true"/>
>> >>        <filter class="solr.StopFilterFactory"
>> >>                ignoreCase="true"
>> >>                words="stopwords.txt"
>> >>                enablePositionIncrements="true"
>> >>                />
>> >> <!--ORIGINAL                generateNumberParts="1"-->
>> >>        <filter class="solr.WordDelimiterFilterFactory"
>> >>                protected="protwords.txt"
>> >>                generateWordParts="0"
>> >>                catenateWords="0"
>> >>                catenateNumbers="0"
>> >>                catenateAll="0"
>> >>                splitOnCaseChange="1"
>> >>                preserveOriginal="1"/>
>> >>        <filter class="solr.LowerCaseFilterFactory"/>
>> >>        <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
>> >> language="English" protected="protwords.txt"/-->
>> >>        <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
>> >> word match -->
>> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> >>      </analyzer>
>> >>    </fieldType>
>> >>
>> >>
>> >> Regards,
>> >> Wael
>> >>
>> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
>> >> emir.arnauto...@sematext.com> wrote:
>> >>
>> >>> Hi Wael,
>> >>> Can you provide your field definition and sample query.
>> >>>
>> >>> Thanks,
>> >>> Emir
>> >>> --
>> >>> Monitoring - Log Management - Alerting - Anomaly Detection
>> >>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>> >>>
>> >>>
>> >>>
>> >>>> On 6 Nov 2017, at 08:30, Wael Kader <w...@softech-lb.com> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> I am having an index with around 100 Million documents.
>> >>>> I have a multivalued column that I am saving big chunks of text data
>> in.
>> >>> It
>> >>>> has around 20 GB of RAM and 4 CPU's.
>> >>>>
>> >>>> I was doing faceting on it to get word cloud but it was taking around
>> 1
>> >>>> second to retrieve when the data was 5-10 Million .
>> >>>> Now I have more data and its taking minutes to get the results (that
>> is
>> >>> if
>> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it run
>> or
>> >>>> maybe its not scalable to make it run on my current schema and design
>> >>> with
>> >>>> News articles.
>> >>>>
>> >>>> I am looking to find the best solution for this. Maybe create another
>> >>> index
>> >>>> to split the data while inserting it or maybe if I change some
>> settings
>> >>> in
>> >>>> SolrConfig or add some RAM, it would perform better.
>> >>>>
>> >>>> --
>> >>>> Regards,
>> >>>> Wael
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Wael
>> >
>>
>
>
>
> --
> Regards,
> Wael

Re: Faceting Word Count

Reply via email to