Re: Indexing documents/files for production use

2014-10-30 Thread Olivier Austina
Thank you Alexandre, Jürgen and Erick for your replies. It is clear for me.

Regards
Olivier


2014-10-28 23:35 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 And one other consideration in addition to the two excellent responses
 so far

 In a SolrCloud environment, SolrJ via CloudSolrServer will automatically
 route the documents to the correct shard leader, saving some additional
 overhead. Post.jar and cURL send the docs to a node, which in turn
 forward the docs to the correct shard leader which lowers
 throughput

 Best,
 Erick

 On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT)
 juergen.wag...@devoteam.com wrote:
  Hello Olivier,
for real production use, you won't really want to use any toys like
  post.jar or curl. You want a decent connector to whatever data source
 there
  is, that fetches data, possibly massages it a bit, and then feeds it into
  Solr - by means of SolrJ or directly into the web service of Solr via
 binary
  protocols. This way, you can properly handle incremental feeding,
 processing
  of data from remote locations (with the connector being closer to the
 data
  source), and also source data security. Also think about what happens if
 you
  do processing of incoming documents in Solr. What happens if Tika runs
 out
  of memory because of PDF problems? What if this crashes your Solr node?
 In
  our Solr projects, we generally do not do any sizable processing within
 Solr
  as document processing and document indexing or querying have all
 different
  scaling properties.
 
  Production use most typically is not achieved by deploying a vanilla
 Solr,
  but rather having a bit more glue and wrappage, so the whole will fit
 your
  requirements in terms of functionality, scaling, monitoring and
 robustness.
  Some similar platforms like Elasticsearch try to alleviate these pains of
  going to a production-style infrastructure, but that's at the expense of
  flexibility and comes with limitations.
 
  For proof-of-concept or demonstrator-style applications, the plain tools
 out
  of the box will be fine. For production applications, you want to have
 more
  robust components.
 
  Best regards,
  --Jürgen
 
 
  On 28.10.2014 22:12, Olivier Austina wrote:
 
  Hi All,
 
  I am reading the solr documentation. I have understood that post.jar
  
 http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
 
  is not meant for production use, cURL
  
 https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
 
  is not recommanded. Is SolrJ better for production?  Thank you.
  Regards
  Olivier
 
 
 
  --
 
  Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
  уважением
  i.A. Jürgen Wagner
  Head of Competence Center Intelligence
   Senior Cloud Consultant
 
  Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
  Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
 1543
  E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
 
  
  Managing Board: Jürgen Hatzipantelis (CEO)
  Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
  Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
 
 



Indexing documents/files for production use

2014-10-28 Thread Olivier Austina
Hi All,

I am reading the solr documentation. I have understood that post.jar
http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
is not meant for production use, cURL
https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
is not recommanded. Is SolrJ better for production?  Thank you.
Regards
Olivier


Re: Indexing documents/files for production use

2014-10-28 Thread Alexandre Rafalovitch
What is your production use? You have to answer that for yourself.

post.jar makes a couple of things easy. If your production use fits
into those (e.g. no cluster) - great, use it. It is certainly not any
worse than cURL.

But if you are running a cluster and have specific requirements, then
yes, use something that's cluster aware. Whether it is a custom client
on top of SolrJ, Spring Data, or Cloudera pipeline will depend on your
particular use case. Don't make your life over-complicated in advance.

Regards,
   Alex.


Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 28 October 2014 17:12, Olivier Austina olivier.aust...@gmail.com wrote:
 Hi All,

 I am reading the solr documentation. I have understood that post.jar
 http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
 is not meant for production use, cURL
 https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
 is not recommanded. Is SolrJ better for production?  Thank you.
 Regards
 Olivier


Re: Indexing documents/files for production use

2014-10-28 Thread Jürgen Wagner (DVT)
Hello Olivier,
  for real production use, you won't really want to use any toys like
post.jar or curl. You want a decent connector to whatever data source
there is, that fetches data, possibly massages it a bit, and then feeds
it into Solr - by means of SolrJ or directly into the web service of
Solr via binary protocols. This way, you can properly handle incremental
feeding, processing of data from remote locations (with the connector
being closer to the data source), and also source data security. Also
think about what happens if you do processing of incoming documents in
Solr. What happens if Tika runs out of memory because of PDF problems?
What if this crashes your Solr node? In our Solr projects, we generally
do not do any sizable processing within Solr as document processing and
document indexing or querying have all different scaling properties.

Production use most typically is not achieved by deploying a vanilla
Solr, but rather having a bit more glue and wrappage, so the whole will
fit your requirements in terms of functionality, scaling, monitoring and
robustness. Some similar platforms like Elasticsearch try to alleviate
these pains of going to a production-style infrastructure, but that's at
the expense of flexibility and comes with limitations.

For proof-of-concept or demonstrator-style applications, the plain tools
out of the box will be fine. For production applications, you want to
have more robust components.

Best regards,
--Jürgen

On 28.10.2014 22:12, Olivier Austina wrote:
 Hi All,

 I am reading the solr documentation. I have understood that post.jar
 http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
 is not meant for production use, cURL
 https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
 is not recommanded. Is SolrJ better for production?  Thank you.
 Regards
 Olivier



-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Indexing documents/files for production use

2014-10-28 Thread Erick Erickson
And one other consideration in addition to the two excellent responses
so far

In a SolrCloud environment, SolrJ via CloudSolrServer will automatically
route the documents to the correct shard leader, saving some additional
overhead. Post.jar and cURL send the docs to a node, which in turn
forward the docs to the correct shard leader which lowers
throughput

Best,
Erick

On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT)
juergen.wag...@devoteam.com wrote:
 Hello Olivier,
   for real production use, you won't really want to use any toys like
 post.jar or curl. You want a decent connector to whatever data source there
 is, that fetches data, possibly massages it a bit, and then feeds it into
 Solr - by means of SolrJ or directly into the web service of Solr via binary
 protocols. This way, you can properly handle incremental feeding, processing
 of data from remote locations (with the connector being closer to the data
 source), and also source data security. Also think about what happens if you
 do processing of incoming documents in Solr. What happens if Tika runs out
 of memory because of PDF problems? What if this crashes your Solr node? In
 our Solr projects, we generally do not do any sizable processing within Solr
 as document processing and document indexing or querying have all different
 scaling properties.

 Production use most typically is not achieved by deploying a vanilla Solr,
 but rather having a bit more glue and wrappage, so the whole will fit your
 requirements in terms of functionality, scaling, monitoring and robustness.
 Some similar platforms like Elasticsearch try to alleviate these pains of
 going to a production-style infrastructure, but that's at the expense of
 flexibility and comes with limitations.

 For proof-of-concept or demonstrator-style applications, the plain tools out
 of the box will be fine. For production applications, you want to have more
 robust components.

 Best regards,
 --Jürgen


 On 28.10.2014 22:12, Olivier Austina wrote:

 Hi All,

 I am reading the solr documentation. I have understood that post.jar
 http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
 is not meant for production use, cURL
 https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
 is not recommanded. Is SolrJ better for production?  Thank you.
 Regards
 Olivier



 --

 Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
 уважением
 i.A. Jürgen Wagner
 Head of Competence Center Intelligence
  Senior Cloud Consultant

 Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
 Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de

 
 Managing Board: Jürgen Hatzipantelis (CEO)
 Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
 Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071