[ https://issues.apache.org/jira/browse/ASTERIXDB-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15475341#comment-15475341 ]
Taewoo Kim commented on ASTERIXDB-1487: --------------------------------------- What's the main issue here? > Fuzzy select-join on inverted index poses inconsistent results. > --------------------------------------------------------------- > > Key: ASTERIXDB-1487 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1487 > Project: Apache AsterixDB > Issue Type: Bug > Components: AsterixDB > Environment: MAC 4 cores, 8GB memory. The current master till > 3/17/2016. > Reporter: Wenhai > Assignee: Wenhai > Priority: Critical > Attachments: csx-small-multi-id.txt, dblp-small-multi-id.txt > > > As shown in below. After we switching the two "for" branches of the fuzzy > join over a select, the results are consistent. > Schema > {noformat} > drop dataverse test if exists; > create dataverse test; > use dataverse test; > create type DBLPNestedType as closed { > id: int64, > dblpid: string, > title: string, > authors: string, > misc: string > } > create type DBLPType as closed { > nested: DBLPNestedType > } > create type CSXNestedType as closed { > id: int64, > csxid: string, > title: string, > authors: string, > misc: string > } > create type CSXType as closed { > nested: CSXNestedType > } > create dataset DBLPtmp(DBLPNestedType) primary key id; > create dataset CSXtmp(CSXNestedType) primary key id; > create dataset DBLP(DBLPType) primary key nested.id; > create dataset CSX(CSXType) primary key nested.id; > use dataverse test; > load dataset DBLPtmp > using localfs > (("path"="asterix_nc1://data/dblp-small/dblp-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000")) > pre-sorted; > load dataset CSXtmp > using localfs > (("path"="asterix_nc1://data/pub-small/csx-small-multi-id.txt"),("format"="delimited-text"),("delimiter"=":"),("quote"="\u0000")); > insert into dataset DBLP( > for $x in dataset DBLPtmp > return { > "nested": $x > } > ); > insert into dataset CSX( > for $x in dataset CSXtmp > return { > "nested": $x > } > ); > {noformat} > Indexes > {noformat} > create index keyword_index on DBLP(nested.title) type keyword; > create index keyword_indexdbauhors on DBLP(nested.authors) type keyword; > create index keyword_indexcsxauthors on CSX(nested.authors) type keyword; > {noformat} > The following query > {noformat} > use dataverse test; > set simthresholds '.1' > let $s := count( > for $o in dataset DBLP > for $t in dataset CSX > where contains($o.nested.title, "System") and word-tokens($o.nested.authors) > ~= word-tokens($t.nested.authors) > return $o > ) > return $s > {noformat} > will return 28, while the query > {noformat} > use dataverse test; > set simthresholds '.1' > let $s := count( > for $t in dataset CSX > for $o in dataset DBLP > where contains($o.nested.title, "System") and word-tokens($o.nested.authors) > ~= word-tokens($t.nested.authors) > return $o > ) > return $s > {noformat} > will return 3 or pose a error in a big dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332)