On top of what Shawn rightly said, two things:
1. Try to benchmark yourself (best bet) solution with and without the
shingles. Then you know better and have story with numbers to tell.
2. If you go with the shingles approach, consider removing duplicates with
https://wiki.apache.org/solr/Analyzers
On 10/27/2014 6:20 AM, Robust Links wrote:
> 1) we want to index and search all tokens in a document (i.e. we do not
> rely on external stores)
>
> 2) we need search time to be fast and willing to pay larger indexing time
> and index size,
>
> 3) be able to search as fast as possible ngrams of 3
Hi
We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make
sure our existing indexing strategy is still valid or not. The statistics
of the raw corpus are:
- 4.8 Billon total number of tokens in the entire corpus.
- 13MM documents
We have 3 requirements
1) we want to inde
Hi
We are trying to upgrade our index from 3.6.1 to 4.9.1 and I wanted to make
sure our existing indexing strategy is still valid or not. The statistics
of the raw corpus are:
- 4.8 Billon total number of tokens in the entire corpus.
- 13MM documents
We have 3 requirements
1) we want to inde