Re: Replication for SolrCloud

2015-04-19 Thread gengmao
Thanks for the suggestion, Erick. However here what we need is not a patch,
is a clarification from practice perspective.

I think solr replication is a great feature to scale reads, and kind of
increase reliability. However, on HDFS it is not as useful as just
sharding. Sharding can scale both reads and writes at same time, and
doesn't have consistency concern along with replication. So I doubt Solr
replication on HDFS has real meanings?

I will try to reach out Mark Miller and will appreciate if he or anyone can
provide more convincing points on this.

Thanks,
Mao

On Sat, Apr 18, 2015 at 4:44 PM Erick Erickson erickerick...@gmail.com
wrote:

 AFAIK, the HDFS replication of Solr indexes isn't something that was
 designed, it just came along for the ride given HDFS replication.
 Having a shard with 1 leader and two followers have 9 copies of the
 index around _is_ overkill, nobody argues that at all.

 I know the folks at Cloudera (who contributed the original HDFS
 implementation) have discussed various options around this. In the
 grand scheme of things, there have been other priorities without
 tearing into the guts of Solr and/or HDFS since disk space is
 relatively cheap.

 That said, I'm also sure that this will get some attention as
 priorities change. All patches welcome of course ;), But if you're
 inclined to work on this issue, I'd _really_ discuss it with Mark
 Miller  etc. before investing too much effort in it. I don't quite
 know the tradeoffs well enough to have an opinion on the right
 implementation.

 Best
 Erick

 On Sat, Apr 18, 2015 at 1:59 AM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
  Some comments inline:
 
  On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:
 
  On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
  juergen.wag...@devoteam.com wrote:
 
Replication on the storage layer will provide a reliable storage for
 the
   index and other data of Solr. In particular, this replication does not
   guarantee your index files are consistent at any time as there may be
   intermediate states that are only partially replicated. Replication is
  only
   a convergent process, not an instant, atomic operation. With frequent
   changes, this becomes an issue.
  
  Firstly thanks for your reply. However I can't agree with you on this.
  HDFS guarantees the consistency even with replicates - you always read
 what
  you write, no partially replicated state will be read, which is
 guaranteed
  by HDFS server and client. Hence HBase can rely on HDFS for consistency
 and
  availability, without implementing another replication mechanism - if I
  understand correctly.
 
 
  Lucene index is not one file but a collection of files which are written
  independently. So if you replicate them out of order, Lucene might
 consider
  the index as corrupted (because of missing files). I don't think HBase
  works in that way.
 
 
 
   Replication inside SolrCloud as an application will not only maintain
 the
   consistency of the search-level interfaces to your indexes, but also
  scale
   in the sense of the application (query throughput).
  
   Split one shard into two shards can increase the query throughput too.
 
 
   Imagine a database: if you change one record, this may also result in
 an
   index change. If the record and the index are stored in different
 storage
   blocks, one will get replicated first. However, the replication target
  will
   only be consistent again when both have been replicated. So, you would
  have
   to suspend all accesses until the entire replication has completed.
  That's
   undesirable. If you replicate on the application (database management
   system) level, the application will employ a more fine-grained
 approach
  to
   replication, guaranteeing application consistency.
  
  In HBase, a region only locates on single region server at any time,
 which
  guarantee its consistency. Because your read/write always drops in one
  region, you won't have concern of parallel writes happens on multiple
  replicates of same region.
  The replication of HDFS is totally transparent to HBase. When a HDFS
 write
  call returns, HBase know the data is written and replicated so losing
 one
  copy of the data won't impact HBase at all.
  So HDFS means consistency and reliability for HBase. However, HBase
 doesn't
  use replicates (either HBase itself or HDFS's) to scale reads. If one
  region's is too hot for reads or write, you split that region into two
  regions, so that the reads and writes of that region can be distributed
  into two region servers. Hence HBase scales.
  I think this is the simplicity and beauty of HBase. Again, I am curious
 if
  SolrCloud has better reason to use replication on HDFS? As I described,
  HDFS provided consistency and reliability, meanwhile scalability can be
  achieved via sharding, even without Solr replication.
 
 
  That's something that has been considered and may even be in the roadmap
  for the 

Re: JSON Facet Analytics API in Solr 5.1

2015-04-19 Thread Lukáš Vlček
Oh... and btw, I think the readability of the JSON will be less and less
important going forward. Queries will grow in size anyway (due to nested
facets) and the ability to quickly validate the query using some parser
will be more useful and practical than relying on human eye doing the check
instead.

I assume that both the ES and Solr will end up having some higher level
language for people to express queries and facets/aggregations in readable
form (anyone remember SQL?) and this will be transformed to JSON (or other
native) format down the road. In my opinion the most important thing for
any non-trivial JSON based language format now is to make sure it is parser
friendly and grammars can be defined easily for it.

On Sun, Apr 19, 2015 at 8:09 AM, Lukáš Vlček lukas.vl...@gmail.com wrote:

 Late here but let me add one more thing: IIRC the recommendation for JSON
 is to never use data as a key in objects. One of the benefits of not using
 data as a keys in JSON is easier validation using JSON schema. If one wants
 to validate JSON query for Elasticsearch today it is necessary to implement
 custom parser (and grammar first of course).

 Lukas

 On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley ysee...@gmail.com wrote:

 Alther minor benefit to the flatter structure means that the smart
 merging of multiple JSON parameters works a little better in
 conjunction with facets.

 For example, if you already had a top_genre facet, you could insert
 a top_author facet more easily:

 json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5}

 (For anyone who doesn't know what smart merging is,  see
 http://yonik.com/solr-json-request-api/ )

 -Yonik


 On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley ysee...@gmail.com wrote:
  Thank you everyone for the feedback!
 
  I've implemented and committed the flatter structure:
  https://issues.apache.org/jira/browse/SOLR-7422
  So either form can now be used (and I'll be switching to the flatter
  method for examples when it actually reduces the levels).
 
  For those who want to try it out, I just made a 5.2-dev snapshot:
  https://github.com/yonik/lucene-solr/releases
 
  -Yonik





Re: Replication for SolrCloud

2015-04-19 Thread juergen.wag...@devoteam.com
In simple words:

HDFS is good for file-oriented replication. Solr is good for index replication.

Consequently, if atomic file update operations of an application (like Solr) 
are not atomic on a file level, HDFS is not adequate - like for Solr with live 
index updates. Running Solr on HDFS (as a file system) will pose limitations 
due to HDFS properties. Indexing, however, still won't use Hadoop.

If you produce indexes and distribute them as finalized, read-only structures 
(e.g., through Hadoop jobs), HDFS is fine. Solr does not need to be much aware 
of HDFS.

The third one in the picture is records-based replication to be handled by 
Hbase, Cassandra or Zookeeper, depending on requirements.

Cheers,
Jürgen

Re: JSON Facet Analytics API in Solr 5.1

2015-04-19 Thread Lukáš Vlček
Late here but let me add one more thing: IIRC the recommendation for JSON
is to never use data as a key in objects. One of the benefits of not using
data as a keys in JSON is easier validation using JSON schema. If one wants
to validate JSON query for Elasticsearch today it is necessary to implement
custom parser (and grammar first of course).

Lukas

On Sat, Apr 18, 2015 at 11:46 PM, Yonik Seeley ysee...@gmail.com wrote:

 Alther minor benefit to the flatter structure means that the smart
 merging of multiple JSON parameters works a little better in
 conjunction with facets.

 For example, if you already had a top_genre facet, you could insert
 a top_author facet more easily:

 json.facet.top_genre.facet.top_author={type:terms, field:author, limit:5}

 (For anyone who doesn't know what smart merging is,  see
 http://yonik.com/solr-json-request-api/ )

 -Yonik


 On Sat, Apr 18, 2015 at 11:36 AM, Yonik Seeley ysee...@gmail.com wrote:
  Thank you everyone for the feedback!
 
  I've implemented and committed the flatter structure:
  https://issues.apache.org/jira/browse/SOLR-7422
  So either form can now be used (and I'll be switching to the flatter
  method for examples when it actually reduces the levels).
 
  For those who want to try it out, I just made a 5.2-dev snapshot:
  https://github.com/yonik/lucene-solr/releases
 
  -Yonik



Re: Replication for SolrCloud

2015-04-19 Thread gengmao
Please see my response in line:

On Fri, Apr 17, 2015 at 10:59 PM Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Some comments inline:

 On Sat, Apr 18, 2015 at 2:12 PM, gengmao geng...@gmail.com wrote:

  On Sat, Apr 18, 2015 at 12:20 AM Jürgen Wagner (DVT) 
  juergen.wag...@devoteam.com wrote:
 
Replication on the storage layer will provide a reliable storage for
 the
   index and other data of Solr. In particular, this replication does not
   guarantee your index files are consistent at any time as there may be
   intermediate states that are only partially replicated. Replication is
  only
   a convergent process, not an instant, atomic operation. With frequent
   changes, this becomes an issue.
  
  Firstly thanks for your reply. However I can't agree with you on this.
  HDFS guarantees the consistency even with replicates - you always read
 what
  you write, no partially replicated state will be read, which is
 guaranteed
  by HDFS server and client. Hence HBase can rely on HDFS for consistency
 and
  availability, without implementing another replication mechanism - if I
  understand correctly.
 
 
 Lucene index is not one file but a collection of files which are written
 independently. So if you replicate them out of order, Lucene might consider
 the index as corrupted (because of missing files). I don't think HBase
 works in that way.

Again HDFS replication is transparent to HBase. You can set HDFS
replication factor to 1 and HBase will still work, but it will lose the
fault tolerance to any disk failure which is provided by HDFS replicates.
Also HBase doesn't directly utilize HDFS replicates. Increase HDFS
replication factors won't improve HBase's scalability. To achieve better
read/write throughput, split shards is the only approach.



 
   Replication inside SolrCloud as an application will not only maintain
 the
   consistency of the search-level interfaces to your indexes, but also
  scale
   in the sense of the application (query throughput).
  
   Split one shard into two shards can increase the query throughput too.
 
 
   Imagine a database: if you change one record, this may also result in
 an
   index change. If the record and the index are stored in different
 storage
   blocks, one will get replicated first. However, the replication target
  will
   only be consistent again when both have been replicated. So, you would
  have
   to suspend all accesses until the entire replication has completed.
  That's
   undesirable. If you replicate on the application (database management
   system) level, the application will employ a more fine-grained approach
  to
   replication, guaranteeing application consistency.
  
  In HBase, a region only locates on single region server at any time,
 which
  guarantee its consistency. Because your read/write always drops in one
  region, you won't have concern of parallel writes happens on multiple
  replicates of same region.
  The replication of HDFS is totally transparent to HBase. When a HDFS
 write
  call returns, HBase know the data is written and replicated so losing one
  copy of the data won't impact HBase at all.
  So HDFS means consistency and reliability for HBase. However, HBase
 doesn't
  use replicates (either HBase itself or HDFS's) to scale reads. If one
  region's is too hot for reads or write, you split that region into two
  regions, so that the reads and writes of that region can be distributed
  into two region servers. Hence HBase scales.
  I think this is the simplicity and beauty of HBase. Again, I am curious
 if
  SolrCloud has better reason to use replication on HDFS? As I described,
  HDFS provided consistency and reliability, meanwhile scalability can be
  achieved via sharding, even without Solr replication.
 
 
 That's something that has been considered and may even be in the roadmap
 for the Cloudera guys. See https://issues.apache.org/jira/browse/SOLR-6237

 But one problem that isn't solved by HDFS replication is of near-real-time
 indexing where you want the documents to be available for searchers as fast
 as possible. SolrCloud replication supports that by replicating documents
 as they come in and indexing them in several replicas. A new index searcher
 is opened on the flushed index files as well as on the internal data
 structures of the index writer. If we switch to relying on HDFS replication
 then this will be awfully expensive. However, as Jürgen mentioned, HDFS can
 certainly help with replicating static indexes

My understanding is near-real-time indexing is not necessary to rely on
replication.
https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching
just describes soft commit but doesn't mention replication. Also the
Cloudera Search, which is Solr based on HDFS, claims near-real-time
indexing however doesn't mention replication too. Quote from
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-search/v1-latest/Cloudera-Search-User-Guide/csug_introducing.html
:

Re: Addtion to solr wiki editor list

2015-04-19 Thread Erick Erickson
Done and thanks!

The Reference Guide
(https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide)
is another place to look, it's getting considerable attention at this
point. It's curated a bit more than the Wiki however, so if you see
anything wrong there please make comments on any page you see a
problem on. If you haven't downloaded the PDF of the ref guide I
recommend it highly!

On Sun, Apr 19, 2015 at 3:06 PM, Mirko Cegledi mrkc...@gmail.com wrote:
 Hi there!

 I'd like to be added to the list of people who are able to edit the solr
 wiki at https://wiki.apache.org/solr. I'm working as a Java developer for a
 german company using Solr (and like it a lot) a lot and I would like to be
 able to correct things as soon as I find them without going to the
 IRC-channel to get things changed.

 My wiki name should be campfire.

 Thanks in advance


Addtion to solr wiki editor list

2015-04-19 Thread Mirko Cegledi
Hi there!

I'd like to be added to the list of people who are able to edit the solr
wiki at https://wiki.apache.org/solr. I'm working as a Java developer for a
german company using Solr (and like it a lot) a lot and I would like to be
able to correct things as soon as I find them without going to the
IRC-channel to get things changed.

My wiki name should be campfire.

Thanks in advance


Re: help with schema containing nested documents

2015-04-19 Thread Alexandre Rafalovitch
There are no nested schemas as such. It's only a superset schema that
includes all the fields for parents and children. Obviously, the
fields that are not common should be optional.

The rest depends on what parent/child relation you are trying to
setup. Whether it is explicit with block indexing or more loose with
some other kind of cross-referencing.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 18 April 2015 at 06:47, Nicolae Pandrea npand...@expedia.com wrote:
 Hi,

 I need some documentation/samples on how to create a SOLR schema with nested 
 documents.
 I have been looking online but could not find anything.

 Thank you in advance,
 Nick Pandrea



Re: MoreLikeThis (mlt) in sharded SolrCloud

2015-04-19 Thread Ere Maijala
Thanks, Anshum. Looks like there's no way for this to work in 5.1 for us 
so I'll just have to wait to for the fixes. Relieving to know it wasn't 
just me, though.


--Ere

18.4.2015, 2.45, Anshum Gupta kirjoitti:

The other issue that would fix half of your problems is:
https://issues.apache.org/jira/browse/SOLR-7143

On Fri, Apr 17, 2015 at 4:35 PM, Anshum Gupta ans...@anshumgupta.net
wrote:


Ah, I meant SOLR-7418 https://issues.apache.org/jira/browse/SOLR-7418.

On Fri, Apr 17, 2015 at 4:30 PM, Anshum Gupta ans...@anshumgupta.net
wrote:


Hi Ere,

Those seem like valid issues. I've created an issue : SOLR-7275
https://issues.apache.org/jira/browse/SOLR-7275 and will create more
as I find more of those.
I plan to get to them and fix over the weekend.

On Wed, Apr 15, 2015 at 5:13 AM, Ere Maijala ere.maij...@helsinki.fi
wrote:


Hi,

I'm trying to gather information on how mlt works or is supposed to work
with SolrCloud and a sharded collection. I've read issues SOLR-6248,
SOLR-5480 and SOLR-4414, and docs at 
https://wiki.apache.org/solr/MoreLikeThis, but I'm still struggling
with multiple issues. I've been testing with Solr 5.1 and the Getting
Started sample cloud. So, with a freshly extracted Solr, these are the
steps I've done:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted docs/
bin/post -c gettingstarted example/exampledocs/books.json

After this I've tried different variations of queries with limited
success:

http://localhost:8983/solr/gettingstarted/select?q={!mlt}non-existing
causes java.lang.NullPointerException at
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:80)

http://localhost:8983/solr/gettingstarted/select?q={!mlt}978-0641723445



causes java.lang.NullPointerException at
org.apache.solr.search.mlt.CloudMLTQParser.parse(CloudMLTQParser.java:84)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=title}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=title%7D978-0641723445



causes java.lang.NullPointerException at
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=cat}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=cat%7D978-0641723445



actually gives results


http://localhost:8983/solr/gettingstarted/select?q={!mlt%20qf=author,cat}978-0641723445
http://localhost:8983/solr/gettingstarted/select?q=%7B!mlt%20qf=author,cat%7D978-0641723445



again causes Java.lang.NullPointerException at
org.apache.lucene.queries.mlt.MoreLikeThis.retrieveTerms(MoreLikeThis.java:759)


I guess the actual question is, how am I supposed to use the handler to
replicate behavior of non-distributed mlt that was formerly used with
qt=morelikethis and the following configuration in solrconfig.xml:

   requestHandler name=morelikethis class=solr.MoreLikeThisHandler
 lst name=defaults
   str
name=mlt.fltitle,title_short,callnumber-label,topic,language,author,publishDate/str
   str name=mlt.qf
 title^75
 title_short^100
 callnumber-label^400
 topic^300
 language^30
 author^75
 publishDate
   /str
   int name=mlt.mintf1/int
   int name=mlt.mindf1/int
   str name=mlt.boosttrue/str
   int name=mlt.count5/int
   int name=rows5/int
 /lst
   /requestHandler

Real-life full schema and config can be found at 
https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf

.


--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Anshum Gupta





--
Anshum Gupta








--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: Solr Cloud reclaiming disk space from deleted documents

2015-04-19 Thread Gili Nachum
I assume you don't have much free space available in your disk. Notice that
during optimization (merge into a single segment) your shard replica space
usage may peak to 2x-3x of it's normal size until optimization completes.
Is it a problem? Not if optimization occurs over shards serially and your
index is broken to many small shards.
On Apr 18, 2015 1:54 AM, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Thanks Shawn for the quick reply.
 Our indexes are running on SSD, so 3 should be ok.
 Any recommendation on bumping it up?

 I guess will have to run optimize for entire solr cloud and see if we can
 reclaim space.

 Thanks,
 Rishi.








 -Original Message-
 From: Shawn Heisey apa...@elyograg.org
 To: solr-user solr-user@lucene.apache.org
 Sent: Fri, Apr 17, 2015 6:22 pm
 Subject: Re: Solr Cloud reclaiming disk space from deleted documents


 On 4/17/2015 2:15 PM, Rishi Easwaran wrote:
  Running into an issue and wanted
 to see if anyone had some suggestions.
  We are seeing this with both solr 4.6
 and 4.10.3 code.
  We are running an extremely update heavy application, with
 millions of writes and deletes happening to our indexes constantly.  An
 issue we
 are seeing is that solr cloud reclaiming the disk space that can be used
 for new
 inserts, by cleanup up deletes.
 
  We used to run optimize periodically with
 our old multicore set up, not sure if that works for solr cloud.
 
  Num
 Docs:28762340
  Max Doc:48079586
  Deleted Docs:19317246
 
  Version
 1429299216227
  Gen 16525463
  Size 109.92 GB
 
  In our solrconfig.xml we
 use the following configs.
 
  indexConfig
  !-- Values here
 affect all index writers and act as a default unless overridden. --
 
 useCompoundFilefalse/useCompoundFile
 
 maxBufferedDocs1000/maxBufferedDocs
 
 maxMergeDocs2147483647/maxMergeDocs
 
 maxFieldLength1/maxFieldLength
 
 
 mergeFactor10/mergeFactor
  mergePolicy
 class=org.apache.lucene.index.TieredMergePolicy/
  mergeScheduler
 class=org.apache.lucene.index.ConcurrentMergeScheduler
  int
 name=maxThreadCount3/int
  int
 name=maxMergeCount15/int
  /mergeScheduler
 
 ramBufferSizeMB64/ramBufferSizeMB
 
  /indexConfig

 This
 part of my response won't help the issue you wrote about, but it
 can affect
 performance, so I'm going to mention it.  If your indexes are
 stored on regular
 spinning disks, reduce mergeScheduler/maxThreadCount
 to 1.  If they are stored
 on SSD, then a value of 3 is OK.  Spinning
 disks cannot do seeks (read/write
 head moves) fast enough to handle
 multiple merging threads properly.  All the
 seek activity required will
 really slow down merging, which is a very bad thing
 when your indexing
 load is high.  SSD disks do not have to seek, so multiple
 threads are OK
 there.

 An optimize is the only way to reclaim all of the disk
 space held by
 deleted documents.  Over time, as segments are merged
 automatically,
 deleted doc space will be automatically recovered, but it won't
 be
 perfect, especially as segments are merged multiple times into very
 large
 segments.

 If you send an optimize command to a core/collection in SolrCloud,
 the
 entire collection will be optimized ... the cloud will do one
 shard
 replica (core) at a time until the entire collection has been
 optimized.
 There is no way (currently) to ask it to only optimize a
 single core, or to do
 multiple cores simultaneously, even if they are on
 different
 servers.

 Thanks,
 Shawn