Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?
Thanks Colvin. Can you share the details in the ticket? I plan to debug this today. It's unlikely to be a synchronization issue because serialization/deserialization usually happens in single thread. On Sun, Nov 24, 2019, 4:09 AM Colvin Cowie wrote: > https://issues.apache.org/jira/browse/SOLR-13963 > > I'll see about modifying the test I have to fit in with the existing tests, > and if there's a better option then open to whatever > > On Sat, 23 Nov 2019 at 16:43, Colvin Cowie > wrote: > > > I've found the problem, JavaBinCodec has a CharArr,* arr*, which is > > modified in two different locations, but only one of which is protected > > with a synchronized block > > > > getStringProvider(), which is used when you call getValue() rather than > > getRawValue() on the string based SolrInputFields, synchronizes: > > > > > https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966 > > but _readStr() doesn't: > > > > > https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930 > > > > Adding a synchronized block into _readStr() fixes the problem. At least > as > > far as my test goes. > > > > I'll raise a JIRA issue and can provide a patch with the synchronized > > block, but not sure what test(s) should be updated / added to cover this? > > > > > > > > On Thu, 21 Nov 2019 at 18:23, Colvin Cowie > > wrote: > > > >> *> the difference is because the _default config has the dynamic schema > >> building in it, which I assume is pushing it down a different code > path. * > >> > >> Also to add to that, I assumed initially that this just meant that it > was > >> working because the corrupted field names would just cause it to create > a > >> field with the dodgy name (since that's the idea for the dynamic > schema), > >> but checking the documents on retrieval showed they all had the right > field > >> names... > >> So I assume it's a result of going into a different branch of code > >> instead. > >> > >> > >> On an unrelated matter, I saw this in the logs when running with > embedded > >> zookeeper... I don't think I've seen it mentioned anywhere else, so I > will > >> raise an issue for it > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> *2019-11-21 17:25:14.292 INFO (main) [ ] o.a.s.c.SolrZkServer > STARTING > >> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792 > >> INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper > >> client=localhost:99832019-11-21 17:25:18.833 WARN (Thread-13) [ ] > >> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting > >> JettyAdminServer => java.lang.NoClassDefFoundError: > >> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native > >> Method)java.lang.NoClassDefFoundError: > org/eclipse/jetty/server/Connector > >> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at > >> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at > >> > org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43) > >> ~[?:?] at > >> > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136) > >> ~[?:?] at > org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121) > >> ~[?:?]Caused by: java.lang.ClassNotFoundException: > >> org.eclipse.jetty.server.Connector at > >> > org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577) > >> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at > >> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] > ... 5 > >> more2019-11-21 17:25:19.365 INFO (main) [ ] > o.a.s.c.c.ConnectionManager > >> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO > >> (zkConnectionManagerCallback-7-thread-1) [ ] > o.a.s.c.c.ConnectionManager > >> zkClient has connected2019-11-21 17:25:19.396 INFO (main) [ ] > >> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper* > >> > >> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie > >> wrote: > >> > >>> I've been a bit snowed under, but I've found the difference is because > >>> the _default config has the dynamic schema building in it, which I > assume > >>> is pushing it down a different code path. > >>> > >>>>>> default="${update.autoCreateFields:true}" > >>> > >>> > processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> > >>> > >>> I'm using the vanilla Solr 8.3.0 binary8.3.0 > >>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 > with > >>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0 > >>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server > >>> VM 1.8.0_191 25.191-b12 as well > >>> > >>> I've put a testcase and configsets in Google Drive: > >>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U > >>> The configsets are a copy of
Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?
https://issues.apache.org/jira/browse/SOLR-13963 I'll see about modifying the test I have to fit in with the existing tests, and if there's a better option then open to whatever On Sat, 23 Nov 2019 at 16:43, Colvin Cowie wrote: > I've found the problem, JavaBinCodec has a CharArr,* arr*, which is > modified in two different locations, but only one of which is protected > with a synchronized block > > getStringProvider(), which is used when you call getValue() rather than > getRawValue() on the string based SolrInputFields, synchronizes: > > https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966 > but _readStr() doesn't: > > https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930 > > Adding a synchronized block into _readStr() fixes the problem. At least as > far as my test goes. > > I'll raise a JIRA issue and can provide a patch with the synchronized > block, but not sure what test(s) should be updated / added to cover this? > > > > On Thu, 21 Nov 2019 at 18:23, Colvin Cowie > wrote: > >> *> the difference is because the _default config has the dynamic schema >> building in it, which I assume is pushing it down a different code path. * >> >> Also to add to that, I assumed initially that this just meant that it was >> working because the corrupted field names would just cause it to create a >> field with the dodgy name (since that's the idea for the dynamic schema), >> but checking the documents on retrieval showed they all had the right field >> names... >> So I assume it's a result of going into a different branch of code >> instead. >> >> >> On an unrelated matter, I saw this in the logs when running with embedded >> zookeeper... I don't think I've seen it mentioned anywhere else, so I will >> raise an issue for it >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *2019-11-21 17:25:14.292 INFO (main) [ ] o.a.s.c.SolrZkServer STARTING >> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792 >> INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper >> client=localhost:99832019-11-21 17:25:18.833 WARN (Thread-13) [ ] >> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting >> JettyAdminServer => java.lang.NoClassDefFoundError: >> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native >> Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector >> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at >> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at >> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43) >> ~[?:?] at >> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136) >> ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121) >> ~[?:?]Caused by: java.lang.ClassNotFoundException: >> org.eclipse.jetty.server.Connector at >> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577) >> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at >> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5 >> more2019-11-21 17:25:19.365 INFO (main) [ ] o.a.s.c.c.ConnectionManager >> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO >> (zkConnectionManagerCallback-7-thread-1) [ ] o.a.s.c.c.ConnectionManager >> zkClient has connected2019-11-21 17:25:19.396 INFO (main) [ ] >> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper* >> >> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie >> wrote: >> >>> I've been a bit snowed under, but I've found the difference is because >>> the _default config has the dynamic schema building in it, which I assume >>> is pushing it down a different code path. >>> >>> >> default="${update.autoCreateFields:true}" >>> >>> >>> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> >>> >>> I'm using the vanilla Solr 8.3.0 binary8.3.0 >>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with >>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0 >>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server >>> VM 1.8.0_191 25.191-b12 as well >>> >>> I've put a testcase and configsets in Google Drive: >>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U >>> The configsets are a copy of the _default configset, except the >>> "problem" configset has autoCreateFields set to false. >>> I created a collection with 4 shards, replication factor 1 for each >>> configset. The test case reliably fails on the "problem" collection and >>> reliably passes against the "no_problem" collection. >>> >>> The test (well it's not actually a @Test but still) has static data >>> (though it was originally generated randomly). The data is a bit mad... but >>> it was easier to reproduce the problem reliabl
Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?
I've found the problem, JavaBinCodec has a CharArr,* arr*, which is modified in two different locations, but only one of which is protected with a synchronized block getStringProvider(), which is used when you call getValue() rather than getRawValue() on the string based SolrInputFields, synchronizes: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966 but _readStr() doesn't: https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930 Adding a synchronized block into _readStr() fixes the problem. At least as far as my test goes. I'll raise a JIRA issue and can provide a patch with the synchronized block, but not sure what test(s) should be updated / added to cover this? On Thu, 21 Nov 2019 at 18:23, Colvin Cowie wrote: > *> the difference is because the _default config has the dynamic schema > building in it, which I assume is pushing it down a different code path. * > > Also to add to that, I assumed initially that this just meant that it was > working because the corrupted field names would just cause it to create a > field with the dodgy name (since that's the idea for the dynamic schema), > but checking the documents on retrieval showed they all had the right field > names... > So I assume it's a result of going into a different branch of code instead. > > > On an unrelated matter, I saw this in the logs when running with embedded > zookeeper... I don't think I've seen it mentioned anywhere else, so I will > raise an issue for it > > > > > > > > > > > > > > > > > *2019-11-21 17:25:14.292 INFO (main) [ ] o.a.s.c.SolrZkServer STARTING > EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792 > INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper > client=localhost:99832019-11-21 17:25:18.833 WARN (Thread-13) [ ] > o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting > JettyAdminServer => java.lang.NoClassDefFoundError: > org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native > Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector > at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at > java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at > org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43) > ~[?:?] at > org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136) > ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121) > ~[?:?]Caused by: java.lang.ClassNotFoundException: > org.eclipse.jetty.server.Connector at > org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577) > ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at > java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5 > more2019-11-21 17:25:19.365 INFO (main) [ ] o.a.s.c.c.ConnectionManager > Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO > (zkConnectionManagerCallback-7-thread-1) [ ] o.a.s.c.c.ConnectionManager > zkClient has connected2019-11-21 17:25:19.396 INFO (main) [ ] > o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper* > > On Thu, 21 Nov 2019 at 17:30, Colvin Cowie > wrote: > >> I've been a bit snowed under, but I've found the difference is because >> the _default config has the dynamic schema building in it, which I assume >> is pushing it down a different code path. >> >> > default="${update.autoCreateFields:true}" >> >> >> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields"> >> >> I'm using the vanilla Solr 8.3.0 binary8.3.0 >> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with >> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0 >> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server >> VM 1.8.0_191 25.191-b12 as well >> >> I've put a testcase and configsets in Google Drive: >> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U >> The configsets are a copy of the _default configset, except the "problem" >> configset has autoCreateFields set to false. >> I created a collection with 4 shards, replication factor 1 for each >> configset. The test case reliably fails on the "problem" collection and >> reliably passes against the "no_problem" collection. >> >> The test (well it's not actually a @Test but still) has static data >> (though it was originally generated randomly). The data is a bit mad... but >> it was easier to reproduce the problem reliably with this data, than with >> the normal documents we use in our product. >> Each document has a different (dynamically named) field to index data >> into, but it's the same data in each field. >> The problem only appears (or probably is just more likely to appear?) >> when the field names in the request are of different lengths. >> The length / value of t