No, sharding into multiple cores on the same machine still
is limited by the physical memory available. It's still lots
of stuf on a limited box.

But.... try backing up and re-thinking the problem a bit.
Some possibilities off the top of my head:

1> have a new field "current". when you update a doc,
     reindex the old doc with current=0 and put current=
     1 in the new doc (boolean field). Getting one and
     only one is really simple.
2> Use external file fields (EFF) for the same purpose, that
     won't require you to re-index the doc. The trick
     here is you use the value in the EFF as a multiplier
     for the score (that's what function queries do). So older
     versions of the doc have scores of 0 and just don't
     show up.
3> Implement a custom collector that replaces older hits
     with newer hits. Actually I don't particularly like this
     because it would potentially replace a higher-scoring
    document with a lower-scoring one in the results list...

Bottom line here is I don't think grouping is a good approach
for this problem....

Best
Erick

On Wed, Aug 15, 2012 at 5:04 AM, Tirthankar Chatterjee
<tchatter...@commvault.com> wrote:
> Hi Erick,
> You are so right on the memory calculations. I am happy that I know now that 
> I was doing something wrong. Yes I am getting confused with SQL.
>
> I will back up and let you know the use case. I am tracking file versions. 
> And I want to give an option to browse your system for the latest files. So 
> in order to remove dups (same filename) I used grouping.
>
> Also when you say Sharding is it okay if I do multi cores and does it mean 
> that each core needs a separate tomcat. I meant to say can I use the same 
> machine? 150 mill docs have 120 mill unique paths too.
>
> One more thing. If I need sharding and need a new box then it wont be great. 
> Because this system still have horsepower left which I can use.
>
> Thanks a ton for explaining the issue.
>
> Erick Erickson <erickerick...@gmail.com> wrote:
>
>
> You'r putting  a lot of data on a single box, then
> asking to group on what I presume is a string
> field. That's just going to eat up a _bunch_ of
> memory.
>
> let's say your average file name is 16 bytes long. Each
> unique value will take up 58 + 32 bytes (58 bytes
> of overhead, I'm presuming Solr 3.X and 16*2 bytes
> for the chars). So, we're up to 90 bytes/string * number
> of distinct file names) Say you have, for argument's
> sake, 100M distinct file names. You're up to 9G
> memory requirement for sorting alone. Solr's
> sorting reads all the unique values into memory whether
> or not they satisfy the query...
>
> And Grouping can also be expensive. I don't think
> you really want to group in this case, I'd simply use
> a filter query something like:
> fq=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307
>
> Then you're also grouping on conv_sort which doesn't
> make much sense, do you really want individual results returned
> for _each_ file name?
>
> What it looks like to me is you're confusing SQL with
> solr search and getting into bad situations...
>
> Also, 150M documents in a single shard is...really a lot.
> You're probably at a point where you need to shard. Not
> to mention that your 400G index is trying to be jammed
> into 12G of memory.
>
> This actually feels like an XY problem, can you back
> up and let us know what the use-case you're
> trying to solve is? Perhaps there are less memory-
> consumptive solutions possible.
>
> Best
> Erick
>
> On Tue, Aug 14, 2012 at 6:38 AM, Tirthankar Chatterjee
> <tchatter...@commvault.com> wrote:
>> Editing the query...remove <smb:.....> I don't know where it came from while 
>> I did copy/paste....
>>
>> Tirthankar Chatterjee <tchatter...@commvault.com> wrote:
>>
>>
>> Hi,
>> I have a beefy box with 24Gb RAM (12GB for Tomcat7 which houses SOLR3.6)  2 
>> Processors Intel Xeon 64 bit Server, 30TB HDD. JDK 1.7.0_03 x64 bit
>>
>>
>> Data Index Dir Size: 400GB
>> Metadata of files is stored in it. I have around 15 schema fields.
>> Total number of items:150million approx.
>>
>> I have a scenario which I will try to explain to the best of my knowledge 
>> here:
>>
>> Let us consider the fields I am interested in
>>
>> Url: Entire path of a file in windows file system including the filename. 
>> ex:C:\Documents\A.txt
>> mtm: Modified Time of the file
>> Jid:JOb ID
>> conv_sort is string field type where the filename is stored.
>>
>> I run a job where the following gets inserted
>>
>> Total Items:2
>> Url:C:\personal\A1.txt
>> mtm:08/14/2012 12:00:00
>> Jid:1
>> Conv_sort:A1.txt
>> -----------------------------------
>> Url:C:\personal\B1.txt
>> mtm:08/14/2012 12:01:00
>> Jid:1
>> Conv_sort:B1.txt
>> In the second run only one item changes:
>>
>> Url:C:\personal\A1.txt
>> mtm:08/15/2012 1:00:00
>> Jid:2
>> Conv_sort=A1.txt
>>
>> When queried I would like to return the latest A1.txt and B1.txt back to the 
>> end user. I am trying to use grouping with no luck. It keeps throwing OOM… 
>> can someone please help… as it is critical for my project
>>
>> The query I am trying is under a folder there are 1000 files and I putting a 
>> filtered query param too asking it to group by filenames or url and none of 
>> them work…what am I doing wrong here
>>
>>
>> http://172.19.108.78:8080/solr/select/?q=*:*&version=2.2&start=0&rows=10&indent=on&group.query=filefolder:"E\:\\pd_dst\\646c6907-a948-4b83-ac1d-d44742bb0307<smb://pd_dst//646c6907-a948-4b83-ac1d-d44742bb0307>"&group=true&group.limit=1&group.field=conv_sort&group.ngroup=true
>>
>>
>> The stack trace:
>>
>>
>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>         at java.util.Arrays.copyOfRange(Unknown Source)
>>         at java.lang.String.<init>(Unknown Source)
>>         at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>>         at 
>> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184
>> )
>>         at 
>> org.apache.lucene.search.FieldCacheImpl$StringIndexCache.createValue(
>> FieldCacheImpl.java:882)
>>         at 
>> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java
>> :233)
>>         at 
>> org.apache.lucene.search.FieldCacheImpl.getStringIndex(FieldCacheImpl
>> .java:856)
>>         at 
>> org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.setN
>> extReader(TermFirstPassGroupingCollector.java:74)
>>         at 
>> org.apache.lucene.search.MultiCollector.setNextReader(MultiCollector.
>> java:113)
>>         at 
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:576)
>>
>>         at 
>> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:364)
>>
>>         at 
>> org.apache.solr.search.Grouping.searchWithTimeLimiter(Grouping.java:3
>> 76)
>>         at org.apache.solr.search.Grouping.execute(Grouping.java:298)
>>         at 
>> org.apache.solr.handler.component.QueryComponent.process(QueryCompone
>> nt.java:372)
>>         at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(Sea
>> rchHandler.java:186)
>>         at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
>> erBase.java:129)
>>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>>         at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
>> .java:365)
>>         at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
>> r.java:260)
>>         at 
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
>> icationFilterChain.java:243)
>>         at 
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
>> ilterChain.java:210)
>>         at 
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
>> alve.java:225)
>>         at 
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
>> alve.java:123)
>>         at 
>> org.apache.catalina.authenticator.AuthenticatorBase.invoke(Authentica
>> torBase.java:472)
>>         at 
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
>> ava:168)
>>         at 
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
>> ava:98)
>>         at 
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
>> 927)
>>         at 
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
>> ve.java:118)
>>         at 
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
>> a:407)
>>         at 
>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp
>> 11Processor.java:1001)
>>         at 
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(
>> AbstractProtocol.java:585)
>>         at 
>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
>> t.java:1770)
>>
>>
>>
>> ******************Legal Disclaimer***************************
>> "This communication may contain confidential and privileged
>> material for the sole use of the intended recipient. Any
>> unauthorized review, use or distribution by others is strictly
>> prohibited. If you have received the message in error, please
>> advise the sender by reply email and delete the message. Thank
>> you."
>> *********************************************************

Reply via email to