Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

2019-11-23 Thread Noble Paul
Thanks Colvin.
Can you share the details in the ticket?

I plan to debug this today.

It's unlikely to be a synchronization issue because
serialization/deserialization usually happens in single thread.

On Sun, Nov 24, 2019, 4:09 AM Colvin Cowie 
wrote:

> https://issues.apache.org/jira/browse/SOLR-13963
>
> I'll see about modifying the test I have to fit in with the existing tests,
> and if there's a better option then open to whatever
>
> On Sat, 23 Nov 2019 at 16:43, Colvin Cowie 
> wrote:
>
> > I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
> > modified in two different locations, but only one of which is protected
> > with a synchronized block
> >
> > getStringProvider(), which is used when you call getValue() rather than
> > getRawValue() on the string based SolrInputFields, synchronizes:
> >
> >
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
> > but  _readStr() doesn't:
> >
> >
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930
> >
> > Adding a synchronized block into _readStr() fixes the problem. At least
> as
> > far as my test goes.
> >
> > I'll raise a JIRA issue and can provide a patch with the synchronized
> > block, but not sure what test(s) should be updated / added to cover this?
> >
> >
> >
> > On Thu, 21 Nov 2019 at 18:23, Colvin Cowie 
> > wrote:
> >
> >> *> the difference is because the _default config has the dynamic schema
> >> building in it, which I assume is pushing it down a different code
> path. *
> >>
> >> Also to add to that, I assumed initially that this just meant that it
> was
> >> working because the corrupted field names would just cause it to create
> a
> >> field with the dodgy name (since that's the idea for the dynamic
> schema),
> >> but checking the documents on retrieval showed they all had the right
> field
> >> names...
> >> So I assume it's a result of going into a different branch of code
> >> instead.
> >>
> >>
> >> On an unrelated matter, I saw this in the logs when running with
> embedded
> >> zookeeper... I don't think I've seen it mentioned anywhere else, so I
> will
> >> raise an issue for it
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer
> STARTING
> >> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
> >> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
> >> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
> >> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
> >> JettyAdminServer => java.lang.NoClassDefFoundError:
> >> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
> >> Method)java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/Connector
> >> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
> >> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
> >>
> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
> >> ~[?:?] at
> >>
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
> >> ~[?:?] at
> org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
> >> ~[?:?]Caused by: java.lang.ClassNotFoundException:
> >> org.eclipse.jetty.server.Connector at
> >>
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
> >> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
> >> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191]
> ... 5
> >> more2019-11-21 17:25:19.365 INFO  (main) [   ]
> o.a.s.c.c.ConnectionManager
> >> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
> >>  (zkConnectionManagerCallback-7-thread-1) [   ]
> o.a.s.c.c.ConnectionManager
> >> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
> >> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
> >>
> >> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie 
> >> wrote:
> >>
> >>> I've been a bit snowed under, but I've found the difference is because
> >>> the _default config has the dynamic schema building in it, which I
> assume
> >>> is pushing it down a different code path.
> >>>
> >>>>>> default="${update.autoCreateFields:true}"
> >>>
> >>>
> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
> >>>
> >>> I'm using the vanilla Solr 8.3.0 binary8.3.0
> >>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22
> with
> >>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
> >>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
> >>> VM 1.8.0_191 25.191-b12 as well
> >>>
> >>> I've put a testcase and configsets in Google Drive:
> >>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
> >>> The configsets are a copy of

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

2019-11-23 Thread Colvin Cowie
https://issues.apache.org/jira/browse/SOLR-13963

I'll see about modifying the test I have to fit in with the existing tests,
and if there's a better option then open to whatever

On Sat, 23 Nov 2019 at 16:43, Colvin Cowie 
wrote:

> I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
> modified in two different locations, but only one of which is protected
> with a synchronized block
>
> getStringProvider(), which is used when you call getValue() rather than
> getRawValue() on the string based SolrInputFields, synchronizes:
>
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
> but  _readStr() doesn't:
>
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930
>
> Adding a synchronized block into _readStr() fixes the problem. At least as
> far as my test goes.
>
> I'll raise a JIRA issue and can provide a patch with the synchronized
> block, but not sure what test(s) should be updated / added to cover this?
>
>
>
> On Thu, 21 Nov 2019 at 18:23, Colvin Cowie 
> wrote:
>
>> *> the difference is because the _default config has the dynamic schema
>> building in it, which I assume is pushing it down a different code path. *
>>
>> Also to add to that, I assumed initially that this just meant that it was
>> working because the corrupted field names would just cause it to create a
>> field with the dodgy name (since that's the idea for the dynamic schema),
>> but checking the documents on retrieval showed they all had the right field
>> names...
>> So I assume it's a result of going into a different branch of code
>> instead.
>>
>>
>> On an unrelated matter, I saw this in the logs when running with embedded
>> zookeeper... I don't think I've seen it mentioned anywhere else, so I will
>> raise an issue for it
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING
>> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
>> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
>> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
>> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
>> JettyAdminServer => java.lang.NoClassDefFoundError:
>> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
>> Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector
>> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
>> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
>> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
>> ~[?:?] at
>> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
>> ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
>> ~[?:?]Caused by: java.lang.ClassNotFoundException:
>> org.eclipse.jetty.server.Connector at
>> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
>> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
>> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5
>> more2019-11-21 17:25:19.365 INFO  (main) [   ] o.a.s.c.c.ConnectionManager
>> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
>>  (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager
>> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
>> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
>>
>> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie 
>> wrote:
>>
>>> I've been a bit snowed under, but I've found the difference is because
>>> the _default config has the dynamic schema building in it, which I assume
>>> is pushing it down a different code path.
>>>
>>>   >> default="${update.autoCreateFields:true}"
>>>
>>>  
>>> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>>>
>>> I'm using the vanilla Solr 8.3.0 binary8.3.0
>>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
>>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
>>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
>>> VM 1.8.0_191 25.191-b12 as well
>>>
>>> I've put a testcase and configsets in Google Drive:
>>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
>>> The configsets are a copy of the _default configset, except the
>>> "problem" configset has autoCreateFields set to false.
>>> I created a collection with 4 shards, replication factor 1 for each
>>> configset. The test case reliably fails on the "problem" collection and
>>> reliably passes against the "no_problem" collection.
>>>
>>> The test (well it's not actually a @Test but still) has static data
>>> (though it was originally generated randomly). The data is a bit mad... but
>>> it was easier to reproduce the problem reliabl

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

2019-11-23 Thread Colvin Cowie
I've found the problem, JavaBinCodec has a CharArr,* arr*, which is
modified in two different locations, but only one of which is protected
with a synchronized block

getStringProvider(), which is used when you call getValue() rather than
getRawValue() on the string based SolrInputFields, synchronizes:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L966
but  _readStr() doesn't:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L930

Adding a synchronized block into _readStr() fixes the problem. At least as
far as my test goes.

I'll raise a JIRA issue and can provide a patch with the synchronized
block, but not sure what test(s) should be updated / added to cover this?



On Thu, 21 Nov 2019 at 18:23, Colvin Cowie 
wrote:

> *> the difference is because the _default config has the dynamic schema
> building in it, which I assume is pushing it down a different code path. *
>
> Also to add to that, I assumed initially that this just meant that it was
> working because the corrupted field names would just cause it to create a
> field with the dodgy name (since that's the idea for the dynamic schema),
> but checking the documents on retrieval showed they all had the right field
> names...
> So I assume it's a result of going into a different branch of code instead.
>
>
> On an unrelated matter, I saw this in the logs when running with embedded
> zookeeper... I don't think I've seen it mentioned anywhere else, so I will
> raise an issue for it
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2019-11-21 17:25:14.292 INFO  (main) [   ] o.a.s.c.SolrZkServer STARTING
> EMBEDDED STANDALONE ZOOKEEPER SERVER at port 99832019-11-21 17:25:14.792
> INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
> client=localhost:99832019-11-21 17:25:18.833 WARN  (Thread-13) [   ]
> o.a.z.s.a.AdminServerFactory Unable to load jetty, not starting
> JettyAdminServer => java.lang.NoClassDefFoundError:
> org/eclipse/jetty/server/Connector at java.lang.Class.forName0(Native
> Method)java.lang.NoClassDefFoundError: org/eclipse/jetty/server/Connector
> at java.lang.Class.forName0(Native Method) ~[?:1.8.0_191] at
> java.lang.Class.forName(Class.java:264) ~[?:1.8.0_191] at
> org.apache.zookeeper.server.admin.AdminServerFactory.createAdminServer(AdminServerFactory.java:43)
> ~[?:?] at
> org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:136)
> ~[?:?] at org.apache.solr.cloud.SolrZkServer$1.run(SolrZkServer.java:121)
> ~[?:?]Caused by: java.lang.ClassNotFoundException:
> org.eclipse.jetty.server.Connector at
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:577)
> ~[jetty-webapp-9.4.19.v20190610.jar:9.4.19.v20190610] at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_191] ... 5
> more2019-11-21 17:25:19.365 INFO  (main) [   ] o.a.s.c.c.ConnectionManager
> Waiting for client to connect to ZooKeeper2019-11-21 17:25:19.396 INFO
>  (zkConnectionManagerCallback-7-thread-1) [   ] o.a.s.c.c.ConnectionManager
> zkClient has connected2019-11-21 17:25:19.396 INFO  (main) [   ]
> o.a.s.c.c.ConnectionManager Client is connected to ZooKeeper*
>
> On Thu, 21 Nov 2019 at 17:30, Colvin Cowie 
> wrote:
>
>> I've been a bit snowed under, but I've found the difference is because
>> the _default config has the dynamic schema building in it, which I assume
>> is pushing it down a different code path.
>>
>>   > default="${update.autoCreateFields:true}"
>>
>>  
>> processor="uuid,remove-blank,field-name-mutating,parse-boolean,parse-long,parse-double,parse-date,add-schema-fields">
>>
>> I'm using the vanilla Solr 8.3.0 binary8.3.0
>> 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:15:22 with
>> Eclipse OpenJ9 Eclipse OpenJ9 VM 1.8.0_232 openj9-0.17.0
>> and I've checked with Oracle Corporation Java HotSpot(TM) 64-Bit Server
>> VM 1.8.0_191 25.191-b12 as well
>>
>> I've put a testcase and configsets in Google Drive:
>> https://drive.google.com/open?id=1ibKNWvowT8cXTwSa3bcTwKYLSRNur86U
>> The configsets are a copy of the _default configset, except the "problem"
>> configset has autoCreateFields set to false.
>> I created a collection with 4 shards, replication factor 1 for each
>> configset. The test case reliably fails on the "problem" collection and
>> reliably passes against the "no_problem" collection.
>>
>> The test (well it's not actually a @Test but still) has static data
>> (though it was originally generated randomly). The data is a bit mad... but
>> it was easier to reproduce the problem reliably with this data, than with
>> the normal documents we use in our product.
>> Each document has a different (dynamically named) field to index data
>> into, but it's the same data in each field.
>> The problem only appears (or probably is just more likely to appear?)
>> when the field names in the request are of different lengths.
>> The length / value of t