Re: Indexing documents/files for production use
Thank you Alexandre, Jürgen and Erick for your replies. It is clear for me. Regards Olivier 2014-10-28 23:35 GMT+01:00 Erick Erickson erickerick...@gmail.com: And one other consideration in addition to the two excellent responses so far In a SolrCloud environment, SolrJ via CloudSolrServer will automatically route the documents to the correct shard leader, saving some additional overhead. Post.jar and cURL send the docs to a node, which in turn forward the docs to the correct shard leader which lowers throughput Best, Erick On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Solr via binary protocols. This way, you can properly handle incremental feeding, processing of data from remote locations (with the connector being closer to the data source), and also source data security. Also think about what happens if you do processing of incoming documents in Solr. What happens if Tika runs out of memory because of PDF problems? What if this crashes your Solr node? In our Solr projects, we generally do not do any sizable processing within Solr as document processing and document indexing or querying have all different scaling properties. Production use most typically is not achieved by deploying a vanilla Solr, but rather having a bit more glue and wrappage, so the whole will fit your requirements in terms of functionality, scaling, monitoring and robustness. Some similar platforms like Elasticsearch try to alleviate these pains of going to a production-style infrastructure, but that's at the expense of flexibility and comes with limitations. For proof-of-concept or demonstrator-style applications, the plain tools out of the box will be fine. For production applications, you want to have more robust components. Best regards, --Jürgen On 28.10.2014 22:12, Olivier Austina wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Indexing documents/files for production use
Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier
Re: Indexing documents/files for production use
What is your production use? You have to answer that for yourself. post.jar makes a couple of things easy. If your production use fits into those (e.g. no cluster) - great, use it. It is certainly not any worse than cURL. But if you are running a cluster and have specific requirements, then yes, use something that's cluster aware. Whether it is a custom client on top of SolrJ, Spring Data, or Cloudera pipeline will depend on your particular use case. Don't make your life over-complicated in advance. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On 28 October 2014 17:12, Olivier Austina olivier.aust...@gmail.com wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier
Re: Indexing documents/files for production use
Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Solr via binary protocols. This way, you can properly handle incremental feeding, processing of data from remote locations (with the connector being closer to the data source), and also source data security. Also think about what happens if you do processing of incoming documents in Solr. What happens if Tika runs out of memory because of PDF problems? What if this crashes your Solr node? In our Solr projects, we generally do not do any sizable processing within Solr as document processing and document indexing or querying have all different scaling properties. Production use most typically is not achieved by deploying a vanilla Solr, but rather having a bit more glue and wrappage, so the whole will fit your requirements in terms of functionality, scaling, monitoring and robustness. Some similar platforms like Elasticsearch try to alleviate these pains of going to a production-style infrastructure, but that's at the expense of flexibility and comes with limitations. For proof-of-concept or demonstrator-style applications, the plain tools out of the box will be fine. For production applications, you want to have more robust components. Best regards, --Jürgen On 28.10.2014 22:12, Olivier Austina wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Indexing documents/files for production use
And one other consideration in addition to the two excellent responses so far In a SolrCloud environment, SolrJ via CloudSolrServer will automatically route the documents to the correct shard leader, saving some additional overhead. Post.jar and cURL send the docs to a node, which in turn forward the docs to the correct shard leader which lowers throughput Best, Erick On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello Olivier, for real production use, you won't really want to use any toys like post.jar or curl. You want a decent connector to whatever data source there is, that fetches data, possibly massages it a bit, and then feeds it into Solr - by means of SolrJ or directly into the web service of Solr via binary protocols. This way, you can properly handle incremental feeding, processing of data from remote locations (with the connector being closer to the data source), and also source data security. Also think about what happens if you do processing of incoming documents in Solr. What happens if Tika runs out of memory because of PDF problems? What if this crashes your Solr node? In our Solr projects, we generally do not do any sizable processing within Solr as document processing and document indexing or querying have all different scaling properties. Production use most typically is not achieved by deploying a vanilla Solr, but rather having a bit more glue and wrappage, so the whole will fit your requirements in terms of functionality, scaling, monitoring and robustness. Some similar platforms like Elasticsearch try to alleviate these pains of going to a production-style infrastructure, but that's at the expense of flexibility and comes with limitations. For proof-of-concept or demonstrator-style applications, the plain tools out of the box will be fine. For production applications, you want to have more robust components. Best regards, --Jürgen On 28.10.2014 22:12, Olivier Austina wrote: Hi All, I am reading the solr documentation. I have understood that post.jar http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29 is not meant for production use, cURL https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing is not recommanded. Is SolrJ better for production? Thank you. Regards Olivier -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071