Re: Localize the largest fields (content) in index

2012-03-29 Thread Erick Erickson
The admin UI (schema browser) will give you the counts of unique terms in your fields, which is where I'd start. I suspect you've already seen this page, but if not: http://lucene.apache.org/java/3_5_0/fileformats.html#file-names the .fdt and .fdx file extensions are where data goes when you set

Re: Localize the largest fields (content) in index

2012-03-29 Thread Vadim Kisselmann
Hi Erick, thanks:) The admin UI give me the counts, so i can identify fields with big bulks of unique terms. I known this wiki-page, but i read it one more time. List of my file extensions with size in GB(Index size ~150GB): tvf 90GB fdt 30GB tim 18GB prx 15GB frq 12GB tip 200MB tvx 150MB tvf is

Re: Localize the largest fields (content) in index

2012-03-29 Thread Erick Erickson
Yeah, it's worth a try. The term vectors aren't entirely necessary for highlighting, although they do make things more efficient. As far as MLT, does MLT really need such a big field? But you may be on your way to sharding your index if you remove this info and testing shows problems Best

Re: Localize the largest fields (content) in index

2012-03-29 Thread Vadim Kisselmann
Yes, i think so, too :) MLT doesn´t need termVectors really, but it´s faster with them. I found out, what MLT works better on the title field in my case, instead of big text fields. Sharding is in planning, but my setup with SolrCloud, ZK and Tomcat doesn´t work, see here:

Re: Localize the largest fields (content) in index

2012-03-29 Thread Erick Erickson
I don't think there's really any reason SolrCloud won't work with Tomcat, the setup is probably just tricky. See: http://lucene.472066.n3.nabble.com/SolrCloud-new-td1528872.html It's about a year old, but might prove helpful. Best Erick On Thu, Mar 29, 2012 at 3:41 PM, Vadim Kisselmann

Localize the largest fields (content) in index

2012-03-28 Thread Vadim Kisselmann
Hello folks, i work with Solr 4.0 r1292064 from trunk. My index grows fast, with 10Mio. docs i get an index size of 150GB (25% stored, 75% indexed). I want to find out, which fields(content) are too large, to consider measures. How can i localize/discover the largest fields in my index?