RE: DIH Caching w/ BerkleyBackedCache
Todd, I have no idea if this will perform acceptable with so many multiple values. I doubt the solr/patch code was really optimized for such a use case. In my production environment, I have je-6.2.31.jar on the classpath. I don't think I've tried it with other versions. James Dyer Ingram Content Group -Original Message- From: Todd Long [mailto:lon...@gmail.com] Sent: Wednesday, December 16, 2015 10:21 AM To: solr-user@lucene.apache.org Subject: RE: DIH Caching w/ BerkleyBackedCache James, I apologize for the late response. Dyer, James-2 wrote > With the DIH request, are you specifying "cacheDeletePriorData=false" We are not specifying that property (it looks like it defaults to "false"). I'm actually seeing this issue when running a full clean/import. It appears that the Berkeley DB "cleaner" is always removing the oldest file once there are three. In this case, I'll see two 1GB files and then as the third file is being written (after ~200MB) the oldest 1GB file will fall off (i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any other configuration properties other than what I mentioned before. I simply cannot figure out what is going on with the "cleaner" logic that would deem that file "lowest utilized". Any other Berkeley DB/system configuration I could consider that would affect this? It's possible that this caching simply might not be suitable for our data set where one document might contain a field with tens of thousands of values... maybe this is the bottleneck with using this database as every add copies in the prior data and then the "cleaner" removes the old stuff. Maybe it's working like it should but just incredibly slow... I can get a full index without caching in about two hours, however, when using this caching it was still running after 24 hours (still caching the sub-entity). Thanks again for the reply. Respectfully, Todd -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DIH Caching w/ BerkleyBackedCache
James, I apologize for the late response. Dyer, James-2 wrote > With the DIH request, are you specifying "cacheDeletePriorData=false" We are not specifying that property (it looks like it defaults to "false"). I'm actually seeing this issue when running a full clean/import. It appears that the Berkeley DB "cleaner" is always removing the oldest file once there are three. In this case, I'll see two 1GB files and then as the third file is being written (after ~200MB) the oldest 1GB file will fall off (i.e. get deleted). I'm only utilizing ~13% disk space at the time. I'm using Berkeley DB version 4.1.6 with Solr 4.8.1. I'm not specifying any other configuration properties other than what I mentioned before. I simply cannot figure out what is going on with the "cleaner" logic that would deem that file "lowest utilized". Any other Berkeley DB/system configuration I could consider that would affect this? It's possible that this caching simply might not be suitable for our data set where one document might contain a field with tens of thousands of values... maybe this is the bottleneck with using this database as every add copies in the prior data and then the "cleaner" removes the old stuff. Maybe it's working like it should but just incredibly slow... I can get a full index without caching in about two hours, however, when using this caching it was still running after 24 hours (still caching the sub-entity). Thanks again for the reply. Respectfully, Todd -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4245777.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: DIH Caching w/ BerkleyBackedCache
Todd, With the DIH request, are you specifying "cacheDeletePriorData=false". Looking at the BerkleyBackedCache code if this is set to true, it deletes the cache and assumes the current update is to fully repopulate it. If you want to do an incremental update to the cache, it needs to be false. You might also need to specify "clean=false", but I'm not sure if this is a requirement. I've used DIH with BerkleyBackedCache for a few years and it works well for us. But rather than using it inline, we have a number of DIH handlers that just build caches, then when they're all built, a final DIH joins data from the caches and indexes it to solr. We also do like you are, with several handlers running at once, each doing part of the data. But I have to warn you this code hasn't been maintained by anyone. I'm using an older DIH jar (4.6) with newer solr. I think there might have been an api change or something that prevented the uncommitted caching code from working with newer versions, but I honestly forget. This is probably a viable solution if you don't want to write any code, but it might take some trial and error getting it to work. James Dyer Ingram Content Group -Original Message- From: Todd Long [mailto:lon...@gmail.com] Sent: Tuesday, November 17, 2015 8:11 AM To: solr-user@lucene.apache.org Subject: Re: DIH Caching w/ BerkleyBackedCache Mikhail Khludnev wrote > It's worth to mention that for really complex relations scheme it might be > challenging to organize all of them into parallel ordered streams. This will most likely be the issue for us which is why I would like to have the Berkley cache solution to fall back on, if possible. Again, I'm not sure why but it appears that the Berkley cache is overwriting itself (i.e. cleaning up unused data) when building the database... I've read plenty of other threads where it appears folks are having success using that caching solution. Mikhail Khludnev wrote > threads... you said? Which ones? Declarative parallelization in > EntityProcessor worked only with certain 3.x version. We are running multiple DIH instances which query against specific partitions of the data (i.e. mod of the document id we're indexing). -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
Mikhail Khludnev wrote > It's worth to mention that for really complex relations scheme it might be > challenging to organize all of them into parallel ordered streams. This will most likely be the issue for us which is why I would like to have the Berkley cache solution to fall back on, if possible. Again, I'm not sure why but it appears that the Berkley cache is overwriting itself (i.e. cleaning up unused data) when building the database... I've read plenty of other threads where it appears folks are having success using that caching solution. Mikhail Khludnev wrote > threads... you said? Which ones? Declarative parallelization in > EntityProcessor worked only with certain 3.x version. We are running multiple DIH instances which query against specific partitions of the data (i.e. mod of the document id we're indexing). -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
Mikhail Khludnev wrote > "External merge" join helps to avoid boilerplate caching in such simple > cases. Thank you for the reply. I can certainly look into this though I would have to apply the patch for our version (i.e. 4.8.1). I really just simplified our data configuration here which actually consists of many sub-entities that are successfully using the SortedMapBackedCache cache. I imagine this would still apply to those as the queries themselves are simple for the most part. I assume performance-wise this would only require the single table scan? I'm still very much interested in resolving this Berkley database cache issue. I'm sure there is some minor configuration I'm missing that is causing this behavior. Again, I've had no issues with the SortedMapBackedCache for its caching purpose... I've tried simplifying our data configuration to only one thread with a single sub-entity with the same results. Again, any help would be greatly appreciated with this. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
On Mon, Nov 16, 2015 at 5:08 PM, Todd Long <lon...@gmail.com> wrote: > Mikhail Khludnev wrote > > "External merge" join helps to avoid boilerplate caching in such simple > > cases. > > Thank you for the reply. I can certainly look into this though I would have > to apply the patch for our version (i.e. 4.8.1). I really just simplified > our data configuration here which actually consists of many sub-entities > that are successfully using the SortedMapBackedCache cache. I imagine this > would still apply to those as the queries themselves are simple for the > most > part. It's worth to mention that for really complex relations scheme it might be challenging to organize all of them into parallel ordered streams. > I assume performance-wise this would only require the single table > scan? > It sounds like that. But I'm an expert to comment in precise terms. > > I'm still very much interested in resolving this Berkley database cache > issue. I'm sure there is some minor configuration I'm missing that is > causing this behavior. Again, I've had no issues with the > SortedMapBackedCache for its caching purpose... I've tried simplifying our > data configuration to only one thread with a single sub-entity with the > same > results. Again, any help would be greatly appreciated with this. > threads... you said? Which ones? Declarative parallelization in EntityProcessor worked only with certain 3.x version. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142p4240356.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>
DIH Caching w/ BerkleyBackedCache
We currently index using DIH along with the SortedMapBackedCache cache implementation which has worked well until recently when we needed to index a much larger table. We were running into memory issues using the SortedMapBackedCache so we tried switching to the BerkleyBackedCache but appear to have some configuration issues. I've included our basic setup below. The issue we're running into is that it appears the Berkley database is evicting database files (see message below) before they've completed. When I watch the cache directory I only ever see two database files at a time with each one being ~1GB in size (this appears to be hard coded). Is there some additional configuration I'm missing to prevent the process from "cleaning" up database files before the index has finished? I think this "cleanup" continues to kickoff the caching which never completes... without caching the indexing is ~2 hours. Any help would be greatly appreciated. Thanks. Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0 ..." -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH Caching w/ BerkleyBackedCache
Hello Todd, "External merge" join helps to avoid boilerplate caching in such simple cases. it should be something On Fri, Nov 13, 2015 at 10:54 PM, Todd Long <lon...@gmail.com> wrote: > We currently index using DIH along with the SortedMapBackedCache cache > implementation which has worked well until recently when we needed to index > a much larger table. We were running into memory issues using the > SortedMapBackedCache so we tried switching to the BerkleyBackedCache but > appear to have some configuration issues. I've included our basic setup > below. The issue we're running into is that it appears the Berkley database > is evicting database files (see message below) before they've completed. > When I watch the cache directory I only ever see two database files at a > time with each one being ~1GB in size (this appears to be hard coded). Is > there some additional configuration I'm missing to prevent the process from > "cleaning" up database files before the index has finished? I think this > "cleanup" continues to kickoff the caching which never completes... without > caching the indexing is ~2 hours. Any help would be greatly appreciated. > Thanks. > > Cleaning message: "Chose lowest utilized file for cleaning. fileChosen: 0x0 > ..." > > > > > > query="select ID, tp.* from TABLE_PARENT tp"> > > query="select ID, NAME, VALUE from TABLE_CHILD" > > cacheImpl="org.apache.solr.handler.dataimport.BerkleyBackedCache" > cacheKey="ID" > cacheLookup="parent.ID" > persistCacheName="CHILD" > persistCacheBaseDir="/some/cache/dir" > persistCacheFieldNames="ID,NAME,VALUE" > persistCacheFieldTypes="STRING,STRING,STRING" > berkleyInternalCacheSize="100" > berkleyInternalShared="true" /> > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/DIH-Caching-w-BerkleyBackedCache-tp4240142.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics <http://www.griddynamics.com> <mkhlud...@griddynamics.com>