Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Shawn Heisey
On 12/16/2016 11:58 AM, Chetas Joshi wrote:
> How different the index data caching mechanism is for the Streaming
> API from the cursor approach?

Solr and Lucene do not handle that caching.  Systems external to Solr
(like the OS, or HDFS) handle the caching.  The cache effectiveness will
be a combination of the cache size, overall data size, and the data
access patterns of the application.  I do not know enough to tell you
how the cursorMark feature and the streaming API work when they access
the index data.  I would imagine them to be pretty similar, but cannot
be sure about that.

Thanks,
Shawn



Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Chetas Joshi
Thank you everyone. I would add nodes to the SolrCloud and split the shards.

Shawn,

Thank you for explaining why putting index data on local file system could
be a better idea than using HDFS. I need to find out how HDFS caches the
index files in a resource constrained environment.

I would also like to add that when I try the Streaming API instead of using
the cursor approach, it starts running into JSON parsing exceptions when my
nodes (running Solr shards) don't have enough RAM to fit the entire index
into memory. FYI: I have other services (Yarn, Spark) deployed on the same
boxes as well. Spark jobs also use a lot of disk cache.
When I have enough RAM (more than 70 GB so that all the index data could
fit in memory), the streaming API succeeds without running into any
exceptions. How different the index data caching mechanism is for the
Streaming API from the cursor approach?

Thanks!



On Fri, Dec 16, 2016 at 6:52 AM, Shawn Heisey  wrote:

> On 12/14/2016 11:58 AM, Chetas Joshi wrote:
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
>
> Query times will increase as the index size increases, but significant
> jumps in the query time may be an indication of a performance problem.
> Performance problems are usually caused by insufficient resources,
> memory in particular.
>
> With HDFS, I am honestly not sure *where* the cache memory is needed.  I
> would assume that it's needed on the HDFS hosts, that a lot of spare
> memory on the Solr (HDFS client) hosts probably won't make much
> difference.  I could be wrong -- I have no idea what kind of caching
> HDFS does.  If the HDFS client can cache data, then you probably would
> want extra memory on the Solr machines.
>
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
>
> If actual disk access is required to satisfy a query, Solr is going to
> be slow.  Caching is absolutely required for good performance.  If your
> query times are really long but used to be short, chances are that your
> index size has exceeded your system's ability to cache it effectively.
>
> One thing to keep in mind:  Gigabit Ethernet is comparable in speed to
> the sustained transfer rate of a single modern SATA magnetic disk, so if
> the data has to traverse a gigabit network, it probably will be nearly
> as slow as it would be if it were coming from a single disk.  Having a
> 10gig network for your storage is probably a good idea ... but current
> fast memory chips can leave 10gig in the dust, so if the data can come
> from cache and the chips are new enough, then it can be faster than
> network storage.
>
> Because the network can be a potential bottleneck, I strongly recommend
> putting index data on local disks.  If you have enough memory, the disk
> doesn't even need to be super-fast.
>
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
>
> Caching the files on the disk is not handled by Solr, so Solr won't wait
> for the entire index to be cached unless the underlying storage waits
> for some reason.  The caching is usually handled by the OS.  For HDFS,
> it might be handled by a combination of the OS and Hadoop, but I don't
> know enough about HDFS to comment.  Solr makes a request for the parts
> of the index files that it needs to satisfy the request.  If the
> underlying system is capable of caching the data, if that feature is
> enabled, and if there's memory available for that purpose, then it gets
> cached.
>
> Thanks,
> Shawn
>
>


Re: Solr on HDFS: increase in query time with increase in data

2016-12-16 Thread Piyush Kunal
I think 70GB is too huge for a shard.
How much memory does the system is having?
Incase solr does not have sufficient memory to load the indexes, it will
use only the amount of memory defined in your Solr Caches.

Although you are on HFDS, solr performances will be really bad if it has do
disk IO at the query time.

The best option for you is to shard it into atleast 8-10 nodes and create
appropriate replicas according to your read traffic.

Regards,
Piyush

On Fri, Dec 16, 2016 at 12:15 PM, Reth RM  wrote:

> I think the shard index size is huge and should be split.
>
> On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi 
> wrote:
>
> > Hi everyone,
> >
> > I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> > the following config.
> > maxShardsperNode: 1
> > replicationFactor: 1
> >
> > I have been ingesting data into Solr for the last 3 months. With increase
> > in data, I am observing increase in the query time. Currently the size of
> > my indices is 70 GB per shard (i.e. per node).
> >
> > I am using cursor approach (/export handler) using SolrJ client to get
> back
> > results from Solr. All the fields I am querying on and all the fields
> that
> > I get back from Solr are indexed and have docValues enabled as well. What
> > could be the reason behind increase in query time?
> >
> > Has this got something to do with the OS disk cache that is used for
> > loading the Solr indices? When a query is fired, will Solr wait for all
> > (70GB) of disk cache being available so that it can load the index file?
> >
> > Thnaks!
> >
>


Re: Solr on HDFS: increase in query time with increase in data

2016-12-15 Thread Reth RM
I think the shard index size is huge and should be split.

On Wed, Dec 14, 2016 at 10:58 AM, Chetas Joshi 
wrote:

> Hi everyone,
>
> I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
> the following config.
> maxShardsperNode: 1
> replicationFactor: 1
>
> I have been ingesting data into Solr for the last 3 months. With increase
> in data, I am observing increase in the query time. Currently the size of
> my indices is 70 GB per shard (i.e. per node).
>
> I am using cursor approach (/export handler) using SolrJ client to get back
> results from Solr. All the fields I am querying on and all the fields that
> I get back from Solr are indexed and have docValues enabled as well. What
> could be the reason behind increase in query time?
>
> Has this got something to do with the OS disk cache that is used for
> loading the Solr indices? When a query is fired, will Solr wait for all
> (70GB) of disk cache being available so that it can load the index file?
>
> Thnaks!
>


Solr on HDFS: increase in query time with increase in data

2016-12-14 Thread Chetas Joshi
Hi everyone,

I am running Solr 5.5.0 on HDFS. It is a solrCloud of 50 nodes and I have
the following config.
maxShardsperNode: 1
replicationFactor: 1

I have been ingesting data into Solr for the last 3 months. With increase
in data, I am observing increase in the query time. Currently the size of
my indices is 70 GB per shard (i.e. per node).

I am using cursor approach (/export handler) using SolrJ client to get back
results from Solr. All the fields I am querying on and all the fields that
I get back from Solr are indexed and have docValues enabled as well. What
could be the reason behind increase in query time?

Has this got something to do with the OS disk cache that is used for
loading the Solr indices? When a query is fired, will Solr wait for all
(70GB) of disk cache being available so that it can load the index file?

Thnaks!