No, sharding into multiple cores on the same machine still is limited by the physical memory available. It's still lots of stuf on a limited box.
But.... try backing up and re-thinking the problem a bit. Some possibilities off the top of my head: 1> have a new field "current". when you update a doc, reindex the old doc with current=0 and put current= 1 in the new doc (boolean field). Getting one and only one is really simple. 2> Use external file fields (EFF) for the same purpose, that won't require you to re-index the doc. The trick here is you use the value in the EFF as a multiplier for the score (that's what function queries do). So older versions of the doc have scores of 0 and just don't show up. 3> Implement a custom collector that replaces older hits with newer hits. Actually I don't particularly like this because it would potentially replace a higher-scoring document with a lower-scoring one in the results list... Bottom line here is I don't think grouping is a good approach for this problem.... Best Erick On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee <tchatter...@commvault.com> wrote: > Hi Erick, > You are so right on the memory calculations. I am happy that I know now that > I was doing something wrong. Yes I am getting confused with SQL. > > I will back up and let you know the use case. I am tracking file versions. > And I want to give an option to browse your system for the latest files. So > in order to remove dups (same filename) I used grouping. > > Also when you say Sharding is it okay if I do multi cores and does it mean > that each core needs a separate tomcat. I meant to say can I use the same > machine? 150 mill docs have 120 mill unique paths too. > > One more thing. If I need sharding and need a new box then it wont be great. > Because this system still have horsepower left which I can use. > > Thanks a ton for explaining the issue. > > Erick Erickson <erickerick...@gmail.com> wrote: > > > You'r putting a lot of data on a single box, then > asking to group on what I presume is a string > field. That's just going to eat up a _bunch_ of > memory. > > let's say your average file name is 16 bytes long. Each > unique value will take up 58 + 32 bytes (58 bytes > of overhead, I'm presuming Solr 3.X and 16*2 bytes > for the chars). So, we're up to 90 bytes/string * number > of distinct file names) Say you have, for argument's > sake, 100M distinct file names. You're up to 9G > memory requirement for sorting alone. Solr's > sorting reads all the unique values into memory whether > or not they satisfy the query... > > And Grouping can also be expensive. I don't think > you really want to group in this case, I'd simply use > a filter query something like: > fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307 > > Then you're also grouping on conv_sort which doesn't > make much sense, do you really want individual results returned > for _each_ file name? > > What it looks like to me is you're confusing SQL with > solr search and getting into bad situations... > > Also, 150M documents in a single shard is...really a lot. > You're probably at a point where you need to shard. Not > to mention that your 400G index is trying to be jammed > into 12G of memory. > > This actually feels like an XY problem, can you back > up and let us know what the use-case you're > trying to solve is? Perhaps there are less memory- > consumptive solutions possible. > > Best > Erick > > On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee > <tchatter...@commvault.com> wrote: >> Editing the query...remove <smb:.....> I don't know where it came from while >> I did copy/paste.... >> >> Tirthankar Chatterjee <tchatter...@commvault.com> wrote: >> >> >> Hi, >> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6) 2 >> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit >> >> >> Data Index Dir Size: 400GB >> Metadata of files is stored in it. I have around 15 schema fields. >> Total number of items:150million approx. >> >> I have a scenario which I will try to explain to the best of my knowledge >> here: >> >> Let us consider the fields I am interested in >> >> Url: Entire path of a file in windows file system including the filename. >> ex:C:\Documents\A.txt >> mtm: Modified Time of the file >> Jid:JOb ID >> conv_sort is string field type where the filename is stored. >> >> I run a job where the following gets inserted >> >> Total Items:2 >> Url:C:\personal\A1.txt >> mtm:08/14/2012 12:00:00 >> Jid:1 >> Conv_sort:A1.txt >> ----------------------------------- >> Url:C:\personal\B1.txt >> mtm:08/14/2012 12:01:00 >> Jid:1 >> Conv_sort:B1.txt >> In the second run only one item changes: >> >> Url:C:\personal\A1.txt >> mtm:08/15/2012 1:00:00 >> Jid:2 >> Conv_sort=A1.txt >> >> When queried I would like to return the latest A1.txt and B1.txt back to the >> end user. I am trying to use grouping with no luck. It keeps throwing OOM… >> can someone please help… as it is critical for my project >> >> The query I am trying is under a folder there are 1000 files and I putting a >> filtered query param too asking it to group by filenames or url and none of >> them work…what am I doing wrong here >> >> >> http://172.19.108.78:8080/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&group.query=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307<smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307>"&group=true&group.limit=1&group.field=conv_sort&group.ngroup=true >> >> >> The stack trace: >> >> >> SEVERE: java.lang.OutOfMemoryError: Java heap space >> at java.util.Arrays.copyOfRange(Unknown Source) >> at java.lang.String.<init>(Unknown Source) >> at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) >> at >> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184 >> ) >> at >> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue( >> FieldCacheImpl.java:882) >> at >> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java >> :233) >> at >> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl >> .java:856) >> at >> org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN >> extReader(TermFirstPassGroupingCollector.java:74) >> at >> org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector. >> java:113) >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:576) >> >> at >> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364) >> >> at >> org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:3 >> 76) >> at org.apache.solr.search.Grouping.execute(Grouping.java:298) >> at >> org.apache.solr.handler.component.QueryComponent.process(QueryCompone >> nt.java:372) >> at >> org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sea >> rchHandler.java:186) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl >> erBase.java:129) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) >> at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter >> .java:365) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte >> r.java:260) >> at >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl >> icationFilterChain.java:243) >> at >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF >> ilterChain.java:210) >> at >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV >> alve.java:225) >> at >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextV >> alve.java:123) >> at >> org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authentica >> torBase.java:472) >> at >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j >> ava:168) >> at >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j >> ava:98) >> at >> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: >> 927) >> at >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal >> ve.java:118) >> at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav >> a:407) >> at >> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp >> 11Processor.java:1001) >> at >> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process( >> AbstractProtocol.java:585) >> at >> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin >> t.java:1770) >> >> >> >> ******************Legal Disclaimer*************************** >> "This communication may contain confidential and privileged >> material for the sole use of the intended recipient. Any >> unauthorized review, use or distribution by others is strictly >> prohibited. If you have received the message in error, please >> advise the sender by reply email and delete the message. Thank >> you." >> *********************************************************