Re: how can i use solrj binary format for indexing?
Hi Gora, I really appreciate. Your reply was a great help to me. :) I hope everything is fine with you. Regards, Jason Gora Mohanty-3 wrote: On Mon, Oct 18, 2010 at 8:22 PM, Jason, Kim hialo...@gmail.com wrote: Sorry for the delay in replying. Was caught up in various things this week. Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Yes, that is correct. We sharded the data by user ID, so that each of the 25 cores held approximately 0.7 million out of the 3.5 million records. We could have used the sharded indices directly for search, but at least for now have decided to go with a single, merged index. Please let me know how to work multiple instances and cores in your case. [...] * Multi-core Solr setup is quite easy, via configuration in solr.xml: http://wiki.apache.org/solr/CoreAdmin . The configuration, i.e., schema, solrconfig.xml, etc. need to be replicated across the cores. * Decide which XML files you will post to which core, and do the POST with curl, as usual. You might need to write a little script to do this. * After indexing on the cores is done, make sure to do a commit on each. * Merge the sharded indexes (if desired) as described here: http://wiki.apache.org/solr/MergingSolrIndexes . One thing to watch out for here is disk space. When merging with Lucene IndexMergeTool, we found that a rough rule of thumb was that intermediate steps in the merge would require about twice as much space as the total size of the indexes to be merged. I.e., if one is merging 40GB of data in sharded indexes, one should have at least 120GB free. Regards, Gora -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1750669.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
On Mon, Oct 18, 2010 at 8:22 PM, Jason, Kim hialo...@gmail.com wrote: Sorry for the delay in replying. Was caught up in various things this week. Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Yes, that is correct. We sharded the data by user ID, so that each of the 25 cores held approximately 0.7 million out of the 3.5 million records. We could have used the sharded indices directly for search, but at least for now have decided to go with a single, merged index. Please let me know how to work multiple instances and cores in your case. [...] * Multi-core Solr setup is quite easy, via configuration in solr.xml: http://wiki.apache.org/solr/CoreAdmin . The configuration, i.e., schema, solrconfig.xml, etc. need to be replicated across the cores. * Decide which XML files you will post to which core, and do the POST with curl, as usual. You might need to write a little script to do this. * After indexing on the cores is done, make sure to do a commit on each. * Merge the sharded indexes (if desired) as described here: http://wiki.apache.org/solr/MergingSolrIndexes . One thing to watch out for here is disk space. When merging with Lucene IndexMergeTool, we found that a rough rule of thumb was that intermediate steps in the merge would require about twice as much space as the total size of the indexes to be merged. I.e., if one is merging 40GB of data in sharded indexes, one should have at least 120GB free. Regards, Gora
Re: how can i use solrj binary format for indexing?
Hi, you can try to parse the xml via Java yourself and then push the SolrInputDocuments it via SolrJ to solr. setting format to binaray + using the streaming update processor should improve performance, but I am not sure... and performant (+less mem!) reading xml in Java is another topic ... ;-) Regards, Peter. Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, -- http://jetwick.com twitter search prototype
Re: how can i use solrj binary format for indexing?
Hi, Gora I haven't tried yet indexing huge amount of xml files through curl or pure java(like a post.jar). Indexing through xml is really fast? How many files did you index? And How did it(using curl or pure java)? Thanks, Gora -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1724645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
On Mon, Oct 18, 2010 at 5:26 PM, Jason, Kim hialo...@gmail.com wrote: Hi, Gora I haven't tried yet indexing huge amount of xml files through curl or pure java(like a post.jar). Indexing through xml is really fast? How many files did you index? And How did it(using curl or pure java)? [...] We did it through curl. There were some 3.5 million XML files, and some 60 fields in the Solr schema, with minor tokenising, though with some facets. A total of about 40GB of data. We used five Solr instances, and five cores on each instance. From what I recall, it took 6h, though here we might have well been limited by the read speed on a slow network drive that held the data. If done in this way, one might need to merge the data from the various cores, a task which took us about 1.5h. Regards, Gora
Re: how can i use solrj binary format for indexing?
Do you already have the files as solr XML? If so, I don't think you need solrj If you need to build SolrInputDocuments from your existing structure, solrj is a good choice. If you are indexing lots of stuff, check the StreamingUpdateSolrServer: http://lucene.apache.org/solr/api/solrj/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html On Sun, Oct 17, 2010 at 11:01 PM, Jason, Kim hialo...@gmail.com wrote: Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1722612.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Please let me know how to work multiple instances and cores in your case. Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: how can i use solrj binary format for indexing?
Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, You might want to take a look at this section of the wiki too -- http://wiki.apache.org/solr/Solrj#Setting_the_RequestWriter -Jon -Original Message- From: Jason, Kim [mailto:hialo...@gmail.com] Sent: Monday, October 18, 2010 7:52 AM To: solr-user@lucene.apache.org Subject: Re: how can i use solrj binary format for indexing? Thank you for reply, Gora But I still have several questions. Did you use separate index? If so, you indexed 0.7 million Xml files per instance and merged it. Is it Right? Please let me know how to work multiple instances and cores in your case. Regards, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1725679.html Sent from the Solr - User mailing list archive at Nabble.com. - SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. -
how can i use solrj binary format for indexing?
Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. But I don't know how to use index through solrj binary format and can't find examples. Please give some help. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/how-can-i-use-solrj-binary-format-for-indexing-tp1722612p1722612.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how can i use solrj binary format for indexing?
On Mon, Oct 18, 2010 at 8:31 AM, Jason, Kim hialo...@gmail.com wrote: Hi all I have a huge amount of xml files for indexing. I want to index using solrj binary format to get performance gain. Because I heard that using xml files to index is quite slow. [...] Do not know about SolrJ's binary format, but indexing through XML is quite fast in our experience. Have you tried it out to see if it meets your requirements? Regards, Gora