FWIW, you can take the same general approach on the client side to intersect results in an inverted index. This is pretty close to your standard sort-merge-join.

You can create a scanner over the row for each term you want to intersect. Every time the top element of the scanner have equal ID's (the cq), that's a match. If the top elements are not equal, advance the Scanner with the lowest (lexicographic sorting) ID.

The only difference is that you have to do this at your client instead of pushing it down in an SKVI to Accumulo (but this is still a very efficient approach).

David Boyd wrote:
Josh:

     Thanks for the reply.

As I thought through it I realized the incorrect assumption I made as
the "anding" only happens within a rowid.   So time to come up with
another approach.

FYI - Love to hear the reasoning on taking down documentation ;-) I did
find the javadoc at:
https://static.javadoc.io/org.apache.accumulo/accumulo-core/1.6.1



On 5/26/16 11:57 AM, Josh Elser wrote:
Hi David,

Generally, I think you're confusing the type of table with what these
Iterators are meant to run over.

Remember, "shard" or "sharded" refers to distributing some amount of
data across many servers by some hash partitioning (commonly, at
least). This involves setting some "salt" or bit in the rowId to
distribute your records across many servers instead of on a single
server.

The table you describe is what's referred to as an "inverted index".
The term is the primary sort order. This makes it very quick to find
all pointers to "documents" which contain the given term.

The iterators you're trying to use as designed to operate over what's
referred to as a "local index". In this form, the index records are
co-located with the data records in a separate column. So, for each
rowId, one column (family) is devoted to storing index records, while
another is devote to storing the actual data records. This structure
is what the iterators are designed to work over. These iterators are
novel because of some of the assumptions they can make on the physical
data model of Accumulo tables, but let's ignore that for now :)

I know this isn't super helpful to you as-is. I'll see if I can find
any time to make a better write-up for you.

Finally, as far as the iterator javadocs not being published was an
intentional change, but one I believe we should revert. <-- **ping
Chistopher**

- Josh

David Boyd wrote:
All:

    I am using accumulo 1.6.1.  I am using a sharded index to search for
data
that matches the values of certain fields.  Here is the situation:

I have four fields EntityId, EntityIdType, EntityName, EntitySource.

Sometimes I need all records which match EntityId and EntityIdType
Othertimes I need all records which match all four fields.

The plan was use uses ranges in the scanner to determine which fields to
match against.  I have tried both subsets of ranges, setting all
ranges, and
gotten the same result.

I created an Index as follows:
RowId = fieldname
ColumnFamily = fieldvalue
ColumnQualifier = the overall record id (RowID) of my main record in
another table.

Here is the output of a scan of my index table:

entityid
1707945d-34d8-455d-85b1-55610739ce62:1707945d-34d8-455d-85b1-55610739ce62

[]
entityidtype GUID:1707945d-34d8-455d-85b1-55610739ce62 []
name TestEntity:1707945d-34d8-455d-85b1-55610739ce62 []
source Unit Test:1707945d-34d8-455d-85b1-55610739ce62 []

NOTE:  While in this case the entityid equals the overall RowID from the
other table that is not always true

When I run the code below it does not return any rows in the scanner.
In the debugger when running the code below terms show as follows:
[1707945d-34d8-455d-85b1-55610739ce62, GUID, TestEntity, Unit Test]

I have tried both IntersectingIterator and IndexDocIterator both have
the same results.
For whatever reason the API docs for these classes is not showing up on
the Apache
Accumulo site.

Am I missing the purpose/function of this iterator?

Do I have to call IndexedDocIterator.setColfs with some values so I get
the column qualifiers back?

Below is my code:

public List<String> getCoalesceEntityKeysForEntityId(String entityId,
                                                          String
entityIdType,
                                                          String
entityName,
                                                          String
entitySource) throws CoalescePersistorException
     {
         // Use are sharded term index to find the merged keys
         Connector dbConnector = null;

         ArrayList<String> keys = new ArrayList<String>();

         Text[] terms = {new Text(entityId), new Text(entityIdType),
                 new Text(entityName), new Text(entitySource)};


         try {
             dbConnector = AccumuloDataConnector.getDBConnector();

             BatchScanner keyscanner =
dbConnector.createBatchScanner(AccumuloDataConnector.coalesceEntityIndex,
Authorizations.EMPTY,
4);

             // Set up an IntersectingIterator for the values
             IteratorSetting iter = new IteratorSetting(1, "intersect",
IndexedDocIterator.class);
             IndexedDocIterator.setColumnFamilies(iter,terms);
             keyscanner.addScanIterator(iter);

             // Use ranges to limit the bins searched
             //ArrayList<Range> ranges = new ArrayList<Range>();
             // May not be necessary to restrict ranges but will do it
to be safe
             //ranges.add(new Range("entityid"));
             //ranges.add(new Range("entityitype"));
             //ranges.add(new Range("entityname"));
            // ranges.add(new Range("source"));
             //keyscanner.setRanges(ranges);
             keyscanner.setRanges(Collections.singleton(new Range()));

             // Return the list of keys
             for(Entry<Key,Value> entry : keyscanner) {

keys.add(entry.getKey().getColumnQualifier().toString());
             }

         } catch (TableNotFoundException ex) {
             System.err.println(ex.getLocalizedMessage());
             return null;
         }

         return keys;
     }



--
=========mailto:[email protected]  ============
David W. Boyd
VP,  Data Solutions
10432 Balls Ford, Suite 240
Manassas, VA 20109
office:   +1-703-552-2862
cell:     +1-703-402-7908
==============http://www.incadencecorp.com/  ============
ISO/IEC JTC1 WG9, editor ISO/IEC 20547 Big Data Reference Architecture
Chair ANSI/INCITS TC Big Data
Co-chair NIST Big Data Public Working Group Reference Architecture
First Robotic Mentor - FRC, FTC -www.iliterobotics.org
Board Member- USSTEM Foundation -www.usstem.org

The information contained in this message may be privileged
and/or confidential and protected from disclosure.
If the reader of this message is not the intended recipient
or an employee or agent responsible for delivering this message
to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication
is strictly prohibited.  If you have received this communication
in error, please notify the sender immediately by replying to
this message and deleting the material from any computer.




--
=========mailto:[email protected]  ============
David W. Boyd
VP,  Data Solutions
10432 Balls Ford, Suite 240
Manassas, VA 20109
office:   +1-703-552-2862
cell:     +1-703-402-7908
==============http://www.incadencecorp.com/  ============
ISO/IEC JTC1 WG9, editor ISO/IEC 20547 Big Data Reference Architecture
Chair ANSI/INCITS TC Big Data
Co-chair NIST Big Data Public Working Group Reference Architecture
First Robotic Mentor - FRC, FTC -www.iliterobotics.org
Board Member- USSTEM Foundation -www.usstem.org

The information contained in this message may be privileged
and/or confidential and protected from disclosure.
If the reader of this message is not the intended recipient
or an employee or agent responsible for delivering this message
to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication
is strictly prohibited.  If you have received this communication
in error, please notify the sender immediately by replying to
this message and deleting the material from any computer.



Reply via email to