Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Walter Underwood
Why is it better to require another large software system (Hadoop), when it 
works fine without it?

That just sounds like more stuff to configure, misconfigure, and cause problems 
with indexing.

wunder

On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:

 We are using Nutch to crawl web sites and it stores documents at Hbase.
 Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
 ecosystem as well. I think that there should be an implementation at Solrj
 that sends documents (via CloudSolrServer or something like that) as
 MapReduce jobs. Is there any implentation for it or is it not a good idea?





Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Jack Krupansky
Software developers are sometimes compensated based on the degree of 
complexity that they deal with.


And managers are sometimes compensated based on the number of people they 
manage, as well as the degree of complexity of what they manage.


And... training organizations can charge more and have a larger pool of 
eager customers when the subject matter has higher complexity.


And... consultants and contractors will be in higher demand and able to 
charge more, based on the degree of complexity that they have mastered.


So, more complexity results in greater opportunity for higher income!

(Oh, and, writers and book authors have more to write about and readers are 
more eager to purchase those writings as well, especially if the subject 
matter is constantly changing.)


Somebody please remind me I said this any time you catch me trying to argue 
for Solr to be made simpler and easier to use!


-- Jack Krupansky

-Original Message- 
From: Walter Underwood

Sent: Friday, July 05, 2013 12:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

Why is it better to require another large software system (Hadoop), when it 
works fine without it?


That just sounds like more stuff to configure, misconfigure, and cause 
problems with indexing.


wunder

On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:


We are using Nutch to crawl web sites and it stores documents at Hbase.
Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
ecosystem as well. I think that there should be an implementation at Solrj
that sends documents (via CloudSolrServer or something like that) as
MapReduce jobs. Is there any implentation for it or is it not a good idea?





Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Roman Chyla
I don't want to sound negative, but I think it is a valid question to
consider - for the lack of information and certain mental rigidity may make
it sound bad - first of all, it is probably not for few gigabytes of data
and I can imagine that building indexes at the side when data lives is much
faster/cheaper, then sending data to solr - if we think the index is the
product of the map, then the 'reduce' part may be this
http://wiki.apache.org/solr/MergingSolrIndexes

I don't really know enough about CloudSolrServer and how to fit the cloud
there

roman

On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 Software developers are sometimes compensated based on the degree of
 complexity that they deal with.

 And managers are sometimes compensated based on the number of people they
 manage, as well as the degree of complexity of what they manage.

 And... training organizations can charge more and have a larger pool of
 eager customers when the subject matter has higher complexity.

 And... consultants and contractors will be in higher demand and able to
 charge more, based on the degree of complexity that they have mastered.

 So, more complexity results in greater opportunity for higher income!

 (Oh, and, writers and book authors have more to write about and readers
 are more eager to purchase those writings as well, especially if the
 subject matter is constantly changing.)

 Somebody please remind me I said this any time you catch me trying to
 argue for Solr to be made simpler and easier to use!

 -- Jack Krupansky

 -Original Message- From: Walter Underwood
 Sent: Friday, July 05, 2013 12:11 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj


 Why is it better to require another large software system (Hadoop), when
 it works fine without it?

 That just sounds like more stuff to configure, misconfigure, and cause
 problems with indexing.

 wunder

 On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:

  We are using Nutch to crawl web sites and it stores documents at Hbase.
 Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
 ecosystem as well. I think that there should be an implementation at Solrj
 that sends documents (via CloudSolrServer or something like that) as
 MapReduce jobs. Is there any implentation for it or is it not a good idea?






Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Furkan KAMACI
Ok, I know that it is really unnecessary to start a complex design. On the
other hand if your resources and needs are adequate and if you have a
bottleneck at your design it is really a fail not to plan a new design.

We have more than terabytes of data and we have dedicated some developers
at Hadoop and Hbase side. There are many machines at our network
architecture (currently we are making some tests and improvements). Data is
stored as distributed and indexed at SolrCloud as distributed.

However there is a bottleneck at this architecture. Taking from data from
Hbase and sending to SolrCloud is not fast as much as other parts of
system. If we don't resolve that problem and use current architecture I
think that will be a design fault.

That's why I asked this question and it seems reasonable for me. I know
that some people are storing Lucene indexes at Hbase and that is the
correct design for them. Sending data at Solrj with Map Reduce jobs may be
another good thing according to our needs and I think there maybe some
people from community that has tried that or even think about it. Thanks
for the answers.

2013/7/5 Roman Chyla roman.ch...@gmail.com

 I don't want to sound negative, but I think it is a valid question to
 consider - for the lack of information and certain mental rigidity may make
 it sound bad - first of all, it is probably not for few gigabytes of data
 and I can imagine that building indexes at the side when data lives is much
 faster/cheaper, then sending data to solr - if we think the index is the
 product of the map, then the 'reduce' part may be this
 http://wiki.apache.org/solr/MergingSolrIndexes

 I don't really know enough about CloudSolrServer and how to fit the cloud
 there

 roman

 On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  Software developers are sometimes compensated based on the degree of
  complexity that they deal with.
 
  And managers are sometimes compensated based on the number of people they
  manage, as well as the degree of complexity of what they manage.
 
  And... training organizations can charge more and have a larger pool of
  eager customers when the subject matter has higher complexity.
 
  And... consultants and contractors will be in higher demand and able to
  charge more, based on the degree of complexity that they have mastered.
 
  So, more complexity results in greater opportunity for higher income!
 
  (Oh, and, writers and book authors have more to write about and readers
  are more eager to purchase those writings as well, especially if the
  subject matter is constantly changing.)
 
  Somebody please remind me I said this any time you catch me trying to
  argue for Solr to be made simpler and easier to use!
 
  -- Jack Krupansky
 
  -Original Message- From: Walter Underwood
  Sent: Friday, July 05, 2013 12:11 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj
 
 
  Why is it better to require another large software system (Hadoop), when
  it works fine without it?
 
  That just sounds like more stuff to configure, misconfigure, and cause
  problems with indexing.
 
  wunder
 
  On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:
 
   We are using Nutch to crawl web sites and it stores documents at Hbase.
  Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
  ecosystem as well. I think that there should be an implementation at
 Solrj
  that sends documents (via CloudSolrServer or something like that) as
  MapReduce jobs. Is there any implentation for it or is it not a good
 idea?
 
 
 
 



Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj

2013-07-05 Thread Otis Gospodnetic
Furkan,

It's perfectly fine.  Some people have small indices and lots of
queries, some have large indices and very few queries, and lucky ones
have very large indices and lots of queries at the same time.

We once helped a client take their indexing down from many hours to a
couple of minutes by using Hadoop MapReduce.  It made experimentation
with indexing, relevance, etc. easy, while before that it was nearly
impossible.

What you are looking to do it perfectly fine.  Saw your comment in
JIRA - go for it, but I suggest you look at what's been done before
(beyond Solr!) and learn a bit first, get some ideas, and then
implement.  For example, I'd look at
https://github.com/elasticsearch/elasticsearch-hadoop and learn from
and good ideas you see there (or what look like bad decisions if you
see any!) and then implement this for Solr(Cloud).

Otis
--
Solr  ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm



On Fri, Jul 5, 2013 at 5:58 PM, Furkan KAMACI furkankam...@gmail.com wrote:
 Ok, I know that it is really unnecessary to start a complex design. On the
 other hand if your resources and needs are adequate and if you have a
 bottleneck at your design it is really a fail not to plan a new design.

 We have more than terabytes of data and we have dedicated some developers
 at Hadoop and Hbase side. There are many machines at our network
 architecture (currently we are making some tests and improvements). Data is
 stored as distributed and indexed at SolrCloud as distributed.

 However there is a bottleneck at this architecture. Taking from data from
 Hbase and sending to SolrCloud is not fast as much as other parts of
 system. If we don't resolve that problem and use current architecture I
 think that will be a design fault.

 That's why I asked this question and it seems reasonable for me. I know
 that some people are storing Lucene indexes at Hbase and that is the
 correct design for them. Sending data at Solrj with Map Reduce jobs may be
 another good thing according to our needs and I think there maybe some
 people from community that has tried that or even think about it. Thanks
 for the answers.

 2013/7/5 Roman Chyla roman.ch...@gmail.com

 I don't want to sound negative, but I think it is a valid question to
 consider - for the lack of information and certain mental rigidity may make
 it sound bad - first of all, it is probably not for few gigabytes of data
 and I can imagine that building indexes at the side when data lives is much
 faster/cheaper, then sending data to solr - if we think the index is the
 product of the map, then the 'reduce' part may be this
 http://wiki.apache.org/solr/MergingSolrIndexes

 I don't really know enough about CloudSolrServer and how to fit the cloud
 there

 roman

 On Fri, Jul 5, 2013 at 12:23 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  Software developers are sometimes compensated based on the degree of
  complexity that they deal with.
 
  And managers are sometimes compensated based on the number of people they
  manage, as well as the degree of complexity of what they manage.
 
  And... training organizations can charge more and have a larger pool of
  eager customers when the subject matter has higher complexity.
 
  And... consultants and contractors will be in higher demand and able to
  charge more, based on the degree of complexity that they have mastered.
 
  So, more complexity results in greater opportunity for higher income!
 
  (Oh, and, writers and book authors have more to write about and readers
  are more eager to purchase those writings as well, especially if the
  subject matter is constantly changing.)
 
  Somebody please remind me I said this any time you catch me trying to
  argue for Solr to be made simpler and easier to use!
 
  -- Jack Krupansky
 
  -Original Message- From: Walter Underwood
  Sent: Friday, July 05, 2013 12:11 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Sending Documents via SolrServer as MapReduce Jobs at Solrj
 
 
  Why is it better to require another large software system (Hadoop), when
  it works fine without it?
 
  That just sounds like more stuff to configure, misconfigure, and cause
  problems with indexing.
 
  wunder
 
  On Jul 5, 2013, at 4:48 AM, Furkan KAMACI wrote:
 
   We are using Nutch to crawl web sites and it stores documents at Hbase.
  Nutch uses Solrj to send documents to be indexed. We have Hadoop at our
  ecosystem as well. I think that there should be an implementation at
 Solrj
  that sends documents (via CloudSolrServer or something like that) as
  MapReduce jobs. Is there any implentation for it or is it not a good
 idea?