I found out the problem:
I think I know what is going on, inside the RegexStringComparator, the
compareTo is using find() rather than matches():
public int compareTo(byte[] value, int offset, int length) {
// Use find() for subsequence match instead of matches() (full sequence
// match) to adhere to the principle of least surprise.
String tmp;
if (length < value.length / 2) {
// See HBASE-9428. Make a copy of the relevant part of the byte[],
// or the JDK will copy the entire byte[] during String decode
tmp = new String(Arrays.copyOfRange(value, offset, offset + length),
charset);
} else {
tmp = new String(value, offset, length, charset);
}
return pattern.matcher(tmp).find() ? 0 : 1;
}
I use a simple program to test the difference between matches() and find():
String s = "hung";
Pattern p = Pattern.compile("u", Pattern.DOTALL);
Matcher m = p.matcher(s);
System.out.println(m.matches()); // return false
System.out.println(m.find()); // return true
p = Pattern.compile(".*u.*", Pattern.DOTALL);
m = p.matcher(s);
System.out.println(m.matches()); // return true
System.out.println(m.find()); // return false
The method matches() is what I needed right now, and to me it is more
reasonable to use, but I don't know how to change it without modify the source
code.
@Ted:
What you are suggesting is true, but for our user base it rather
counterintuitive, because we are accustomed to searching keyword with
expression "abc.*" to search with prefix "abc" rather than have to explicitly
use "^abc.*".
If I can't change the RegexStringComparator compareTo() method from "find()"
to "matches()", then I suppose I can implement a hard fix by adding "^" at the
beginning of search keyword.
Thanks for you quick responses.
Best regards,
Henry
-----Original Message-----
From: Ted Yu [mailto:[email protected]]
Sent: Monday, June 16, 2014 11:32 AM
To: [email protected]
Subject: Re: RegexStringComparator problem: Why pattern "u" has the same result
as ".*u.*" ?
"u" is part of "hung", producing a match.
Do you want to find string whose value is "u" (not a substring) ?
In that case you can specify "^u$"
Cheers
On Sun, Jun 15, 2014 at 8:20 PM, Henry Hung <[email protected]> wrote:
>
> I have this data set and the value I want to test is "cf:c" = "hung":
>
> hbase(main):001:0> scan 'TEST'
> ROW COLUMN+CELL
> \x00\x00\x00\x03abc\x00\x00\x00\x02 column=cf:a,
> timestamp=1402649511909, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x02 column=cf:b,
> timestamp=1402649511909, value=\x00\x00\x00\x02
> \x00\x00\x00\x03abc\x00\x00\x00\x02 column=cf:c,
> timestamp=1402649511909, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x02 column=cf:d,
> timestamp=1402649511909, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x03abc\x00\x00\x00\x03 column=cf:a,
> timestamp=1402649610557, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x03 column=cf:b,
> timestamp=1402649610557, value=\x00\x00\x00\x03
> \x00\x00\x00\x03abc\x00\x00\x00\x03 column=cf:c,
> timestamp=1402649610557, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x03 column=cf:d,
> timestamp=1402649610557, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x03abc\x00\x00\x00\x04 column=cf:a,
> timestamp=1402650015602, value=abc
> \x00\x00\x00\x03abc\x00\x00\x00\x04 column=cf:b,
> timestamp=1402650015602, value=\x00\x00\x00\x04
> \x00\x00\x00\x03abc\x00\x00\x00\x04 column=cf:c,
> timestamp=1402650015602, value=def
> \x00\x00\x00\x03abc\x00\x00\x00\x04 column=cf:d,
> timestamp=1402650015602, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x05henry\x00\x00\x00\x06 column=cf:a,
> timestamp=1402886404698, value=henry
> \x00\x00\x00\x05henry\x00\x00\x00\x06 column=cf:b,
> timestamp=1402886404698, value=\x00\x00\x00\x06
> \x00\x00\x00\x05henry\x00\x00\x00\x06 column=cf:c,
> timestamp=1402886404698, value=hung
> \x00\x00\x00\x05henry\x00\x00\x00\x06 column=cf:d,
> timestamp=1402886404698, value=\x00\x00\x01F\xA2\x8A\xBD\xA0
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01 column=cf:a,
> timestamp=1402650022755, value=abcdef
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01 column=cf:b,
> timestamp=1402650022755, value=\x00\x00\x00\x01
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01 column=cf:c,
> timestamp=1402650022755, value=def
> \x00\x00\x00\x06abcdef\x00\x00\x00\x01 column=cf:d,
> timestamp=1402650022755, value=\x00\x00\x01F\x93\x81s\xA8
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02 column=cf:a,
> timestamp=1402650025763, value=abcdef
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02 column=cf:b,
> timestamp=1402650025763, value=\x00\x00\x00\x02
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02 column=cf:c,
> timestamp=1402650025763, value=def
> \x00\x00\x00\x06abcdef\x00\x00\x00\x02 column=cf:d,
> timestamp=1402650025763, value=\x00\x00\x01F\x93\x81s\xA8
> 6 row(s) in 0.1090 seconds
>
>
> I wrote some program to test it:
>
> HTable conn = new HTable(HBaseConfiguration.create(), "TEST"); try {
> Scan scan = new Scan();
> RegexStringComparator comp = new
> RegexStringComparator("u");
> SingleColumnValueFilter filter =new
> SingleColumnValueFilter(Bytes.toBytes("cf"), Bytes.toBytes("c"),
> CompareOp.EQUAL, comp);
> FilterList filters = new
> FilterList(Operator.MUST_PASS_ALL);
> filters.addFilter(filter);
> scan.setFilter(filters);
> ResultScanner rs = conn.getScanner(scan);
> try {
> Result r = rs.next();
>
> System.out.println(Bytes.toString(r.getValue(Bytes.toBytes("cf"),
> Bytes.toBytes("c"))));
> }
> finally {
> rs.close();
> }
> }
> finally {
> conn.close();
> }
>
> Because I use regex "u" as the value comparator, the program should
> throw a null value exception.
> But when execute it, the result is "hung".
>
> Question is why the SingleColumnValueFilter do not abide the regex
> comparator? Or why is regex comparator "u" is the same as ".*u.*"?
>
> Best regards,
> Henry Hung
>
> ________________________________
> The privileged confidential information contained in this email is
> intended for use only by the addressees as indicated by the original
> sender of this email. If you are not the addressee indicated in this
> email or are not responsible for delivery of the email to such a
> person, please kindly reply to the sender indicating this fact and
> delete all copies of it from your computer and network server
> immediately. Your cooperation is highly appreciated. It is advised
> that any unauthorized use of confidential information of Winbond is
> strictly prohibited; and any information in this email irrelevant to
> the official business of Winbond shall be deemed as neither given nor
> endorsed by Winbond.
>
The privileged confidential information contained in this email is intended for
use only by the addressees as indicated by the original sender of this email.
If you are not the addressee indicated in this email or are not responsible for
delivery of the email to such a person, please kindly reply to the sender
indicating this fact and delete all copies of it from your computer and network
server immediately. Your cooperation is highly appreciated. It is advised that
any unauthorized use of confidential information of Winbond is strictly
prohibited; and any information in this email irrelevant to the official
business of Winbond shall be deemed as neither given nor endorsed by Winbond.