Grouping and sorting Together
Hi List I need your help to resolve a problem for which i had been struggling for days. Lets take an example of Shoes which are grouped on basis of size and Price With first group as size and price as "7 and 7000" i have 2 documents as below {id:1,color:blue,item sold:10} {id:5,price:yellow,item sold:1} with second group as size and price as "8 and 8000" i have 2 documents as below {id:2,color:blue,item sold:3} {id:3,price:yellow,item sold:5} Now i want to sort the records based on item sold. How I should look at the problem.should i remove grouping and sort result and show.I m asking this as u can see first group has item with item sold as 10,1 and second group as 3 and 5. What approach i should have to look at the problem Regards Neo -- Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Unbale to Create a Core
Hi List, I am unable to create a core.Unable to figure out what wrong. I get below error. ERROR: Failed to create collection 'XXX' due to: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error from server at http://xyz.com:8983/solr: Error CREATEing SolrCore 'docpocc_shard1_replica1': Unable to create core [docpocc_shard1_replica1] Caused by: Missing required init param 'defaultFieldType' in my solr config file i have the init param as below _text_ Any help or pointers.Thanks in advance. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Indexing part of Binary Documents and not the entire contents
Gus You are never biased. I explored a bit about JesterJ. Looks quite promising. I will keep you posted on my experience to you soon. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Indexing part of Binary Documents and not the entire contents
Thanks Erick I already have gone through the link from tika example you shared. Please look at the code in bold. I believe still the entire contents is pushed to memory with handler object. sorry i copied lengthy code from tika site. Regards Neo *Streaming the plain text in chunks* Sometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that. public List parseToPlainTextChunks() throws IOException, SAXException, TikaException { final List chunks = new ArrayList<>(); chunks.add(""); ContentHandlerDecorator handler = new ContentHandlerDecorator() { @Override public void characters(char[] ch, int start, int length) { String lastChunk = chunks.get(chunks.size() - 1); String thisStr = new String(ch, start, length); if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) { chunks.add(thisStr); } else { chunks.set(chunks.size() - 1, lastChunk + thisStr); } } }; AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) { *parser.parse(stream, handler, metadata);* return chunks; } } -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Indexing part of Binary Documents and not the entire contents
Thanks Shawn, Yes I agree ERH is never suggested in production. I am writing my custom ones. Any pointer with this? What exactly i am looking is a custom indexing program to compile precisely the information that you need and send that to Solr. On the other hand i see the below method is very expensive if document size is large. autoParser.parse(input, textHandler, metadata, context); Because ContentHandler would hold the entire contents in memory. Any suggestions? Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Indexing part of Binary Documents and not the entire contents
Thanks Erick, Though i saw this article in several places but never went through it seriously. Dont you think the below method is very exepensive autoParser.parse(input, textHandler, metadata, context); If the document size if bigger than it will need enough memory to hold the document(ie ContentHandler). Any other alternative? Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Indexing part of Binary Documents and not the entire contents
Hi List, I have a specific Requirement where i need to index below things Meta Data of any document Some parts from the Document that matches some keywords that i configure The first part i am able to achieve through ERH or FilelistEntityProcessor. I am struggling on second part.I am looking for an effective and smart approach to handle this. Can any one give me a pointer or help with this. Thanks in adavance! Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Decision on Number of shards and collection
Hi Shawn, Thanks for the long explanation. Now 2 Billion limit can be overcome by using shard. Now coming back to collection.Unless we have a logical or Business reason we should not go for more than one collection. Lets say i have 5 different entities and they have each 10,20,30,40 and 50 attributes(Columns) to be indexed/stored. Now if i store them in single collection.is there any ways empty spaces being created. On other way if i store heterogeneous data items in a single collection, Does by any means there is a poor utilization of memory by creation of empty holes. What are the pros and cons of single vs Multiple. Thanks team for spending your valuable time to clarify. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Decision on Number of shards and collection
Emir I read from the link you shared that "Shard cannot contain more than 2 billion documents since Lucene is using integer for internal IDs." In which java class of SOLR implimentaion repository this can be found. Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Decision on Number of shards and collection
Thanks every one for your beautifull explanation and valuable time. Thanks Emir for the Nice Link(http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html) Thanks Shawn for https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ When should we have more collection? We have a business reason to keep them in separate collection we dont need to query all data at once When should we have more shards? Define Latency Go on adding document to shards till you have acceptable Latency.That will define the shards size(SS) Get the size of all data to be indexed.(TS) numshards = TS/SS One quick question. @Shawn If i have data in more than one collection still i can query them at once.? I think yes as i read from SOLR site. What are pros and cons of single vs multiple collection? I have gone through the estimating Memory and storage for SOLR from Lucid.(https://lucidworks.com/2011/09/14/estimating-memory-and-storage-for-lucenesolr/) @SOLR4189 i will go through the book and get back to you.Thanks. Time is too short to explore the Long Lived Open source technology Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Indexing fails with partially done
Thanks Emir with context to DIH do we have any Resume mechanism? Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Decision on Number of shards and collection
Hi Emir, Thanks a lot for your reply. so when i design a solr eco system i should start with some rough guess on shards and increase the number of shards to make performance better.what is the accepted/ideal Response Time.There should be a trade off between Response time and the number of shards as data keeps growing. I agree we split our index when response time increases.So what could be that response time threshold or query Latency? Thanks again! Regards priyadarshi -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Indexing fails with partially done
with Solrcloud What happens if indexing is partially completed and ensemble goes down.What are the ways to Resume.In one of the scenario i am using 3 ZK Node in ensemble.Lets say i am indexing 5 million data and i have partially indexed the data and ZK ensemble goes down. What should be the best approach for handling such scenario Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Decision on Number of shards and collection
Hi Team First of all i take this opportunity to thank you all for creating a beautiful place where people can explore ,learn and debate. I have been on my knees for couple of days to decide on this. When i am creating a solr cloud eco system i need to decide on number of shards and collection. What are the best practices for taking this decisions. I believe heterogeneous data can be indexed to same collection and i can have multiple shards for the index to be partitioned.So whats the need of a second collection?. yes when collection size grows i should look for more collection.what exactly that size is? what KPI drives the decision of having more collection?Any pointers or links for best practice. when should i go for multiple shards? yes when shard size grows.Right? whats the size and how do i benchmark. I am sorry for my question if its already asked but googled all the ecospace quora,stackoverflow,lucid Regards Neo -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html