[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-rya/pull/153


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-21 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r134297278
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
+try {
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+if (hasDuplicate) {
+break;
+}
+}
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+if (!explicitTypeIds.isEmpty()) {
+// Grab the first type from the explicit type IDs.
+final RyaURI firstType = explicitTypeIds.get(0);
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
--- End diff --

To clarify one the first point, querying with fuzzy-matching for properties 
is not supported by Rya and would need to be done at the MongoDB level.  A new 
JIRA ticket for that improvement has been opened, 
[RYA-349](https://issues.apache.org/jira/browse/RYA-349).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-21 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r134259051
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
+try {
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+if (hasDuplicate) {
+break;
+}
+}
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+if (!explicitTypeIds.isEmpty()) {
+// Grab the first type from the explicit type IDs.
+final RyaURI firstType = explicitTypeIds.get(0);
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
--- End diff --

Unfortunately, we can't use the properties to query for potential 
duplicates.  That would bring back Entities with EXACT matches of the 
properties.  We want to include Entities that would be NEAR matches (based on 
the tolerance).  So, if we have a tolerance of 1% for longitudes we'd want to 
consider 99° to be the same as an Entity with 100° which means we couldn't 
query for it based off its property value of 100°.

I kind of think Entities should be judged as a whole and not as part of 
their components.  If we have 2 completely different Joe Smiths it's possible 
that they're supposed to have 2 Employee TypedEntities and only one of them has 
a Person TypedEntity.  If we try to create the one that only has a Employee 
TypedEntity and we say it's a duplicate of the other Joe Smith's Employee 
TypedEntity (due to setting high tolerance values) then it won't get created.  
But if we saw that they had different TypedEntities associated with them then 
we'd consider them to not be duplicates and both Joe Smith's Entities would be 
created.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-18 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r134025772
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
+try {
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+if (hasDuplicate) {
+break;
+}
+}
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+if (!explicitTypeIds.isEmpty()) {
+// Grab the first type from the explicit type IDs.
+final RyaURI firstType = explicitTypeIds.get(0);
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
--- End diff --

Instead of getting all of the TypedEntities in the database with the given 
Type, you should call the method Event.makeTypedEntity(...) for each typeId.  
Then use the Type and Property map of each TypedEntity to query the DB.  This 
will provide a more constrained query that uses the actual property values.  
Finally, I think that you should add a compareTypedEntities method to your 
DuplicateDataDetector so that you can then apply it to compare the returned 
TypedEntities with the TypedEntity that you created from the original Entity.  
This eliminates the need the re-query the DB to get the Entities that each 
TypedEntity is derived from.
Also, comparing all TypedEntities derived from a given Entity with all 
other TypeEntities in the database provides a stricter notion of duplicate data 
detection.  For example, if an Entity contains the Types People and Employee 
with associated properties, then the approach I'm describing would compare the 
People TypedEntity and the Employee TypedEntity with all other People and 
Employee TypedEntities in the DB.  None of those TypedEntities could be 
duplicates in order for the Entity to be deemed a non-duplicate.  As it's 
currently implemented, if an Employee TypedEntity was ingested and derived from 
an Entity whose sole type was Employee, then an Entity with Type Person and 
Employee would not be considered a duplicate even if the Employee properties 
were exactly the same!  So in effect, I think we should detect if any 
TypedEntites derived from an Entity are duplicate to avoid duplicating 
TypedEntities (I think that these are more meaningful and concrete than 
Entities, which are e
 ssen

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825842
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+// Now grab all the Entities that have the subjects we found.
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+for (final RyaURI subject : subjects) {
+final Optional entityF

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825074
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

The remaining typeIds get compared down with the Entity query results 
below.  But I hastily pulled all this working logic into a separate function 
and ended up comparing them to themselves.  I'll fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133821430
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+// Now grab all the Entities that have the subjects we found.
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+for (final RyaURI subject : subjects) {
+final Optional entityF

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133820424
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

It doesn't seem like you do anything with the remaining typeIds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133817921
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map> uriMap = new 
HashMap<>();
+private final Map, ApproxEqualsDetector> classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector}.
+ * @param tolera

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133804678
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map> uriMap = new 
HashMap<>();
+private final Map, ApproxEqualsDetector> classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector}.
+ * @param tolera

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133803255
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

Oops, it's only grabbing one Entity to compare.  I reworked so it now finds 
a set of potential Entities to compare based on them having all the same 
explicit type IDs.

The subjects don't matter and querying for properties doesn't help us since 
we're trying to find properties that are CLOSE but not quite equal.  That 
leaves us with only the Types to narrow our initial search of Entities to 
check. Once we grab the Entities the (near) duplicate data detector is run over 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133773816
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
--- End diff --

If I recall correctly, a TypedEntity is one part of an Entity and an Entity 
is comprised of possibly several TypedEntities.  (i.e an Entity could be made 
up of a person TypedEntity and an employee TypedEntity).  So, TypedEntity 
shouldn't extend Entity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133742379
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map> uriMap = new 
HashMap<>();
+private final Map, ApproxEqualsDetector> classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector}.
+ * @param tolera

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133739286
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
--- End diff --

Why doesn't TypedEntity extend Entity?  Also, I noticed that your 
compareEntities method for the DuplicateDataDetector applies to Entities but 
then immediately checks for Type.  Obviously you want to return false if two 
Entities don't have the same Type, but maybe it would be useful to have a 
compareTypedEntities method?  Then your compareEntities method could 
effectively delegate to that (check for Type and if the Types are the same, 
convert the Entities to TypedEntities and call the compareTypedEntities 
method).   It seems like a compareTypedEntities method would align with the use 
case better -- there would be no need to convert all of the TypedEntities 
returned in this loop to an Entity.  You could convert the given Entity to a 
TypedEntity if it has a Type, and do a direct comparison to each TypedEntity in 
this loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133737467
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

I'm not quite following what you are doing with the Entity Builder here.  
It seems like you are using it primarily to convert each TypedEntity returned 
in this loop to an Entity.   If that is the case, you should be creating a new 
Builder for each TypedEntity and then doing your duplicate comparison within 
this loop.  As it is currently written, it seems like you are just overwriting 
properties as you iterate through the TypedEntities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-11 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132695882
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1059 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
--- End diff --

I updated the javadocs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-11 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132695871
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +282,49 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+final ImmutableMap typePropertyMap = 
entity.getProperties().get(typeRyaUri);
+final Set properties = new 
HashSet<>(typePropertyMap.values());
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+//final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), properties);
--- End diff --

Removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-10 Thread isper3at
Github user isper3at commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132585736
  
--- Diff: 
common/rya.api/src/main/java/org/apache/rya/api/domain/RyaTypeUtils.java ---
@@ -24,12 +24,44 @@
 import org.joda.time.DateTimeZone;
 import org.joda.time.format.ISODateTimeFormat;
 import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
 import org.openrdf.model.vocabulary.XMLSchema;
 
+import com.google.common.collect.ImmutableMap;
+
 /**
  * Utility methods for using {@link RyaType}.
  */
 public final class RyaTypeUtils {
+private static final ImmutableMap, RyaTypeMethod> METHOD_MAP =
+ImmutableMap., RyaTypeMethod>builder()
+.put(Boolean.class, (v) -> booleanRyaType((Boolean) v))
+.put(Byte.class, (v) -> byteRyaType((Byte) v))
+.put(Date.class, (v) -> dateRyaType((Date) v))
+.put(DateTime.class, (v) -> dateRyaType((DateTime) v))
+.put(Double.class, (v) -> doubleRyaType((Double) v))
+.put(Float.class, (v) -> floatRyaType((Float) v))
+.put(Integer.class, (v) -> intRyaType((Integer) v))
+.put(Long.class, (v) -> longRyaType((Long) v))
+.put(Short.class, (v) -> shortRyaType((Short) v))
+.put(String.class, (v) -> stringRyaType((String) v))
+.put(URI.class, (v) -> uriRyaType((URI) v))
+.put(URIImpl.class, (v) -> uriRyaType((URIImpl) v))
+.build();
+
+/**
+ * Represents a method inside the {@link RyaTypeUtils} class that can 
be
+ * call.
+ */
+private static interface RyaTypeMethod {
--- End diff --

ignore this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-10 Thread isper3at
Github user isper3at commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132576212
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1059 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
--- End diff --

nearly identical?  can you define that a bit more in the class docs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-10 Thread isper3at
Github user isper3at commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132571187
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +282,49 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+final ImmutableMap typePropertyMap = 
entity.getProperties().get(typeRyaUri);
+final Set properties = new 
HashSet<>(typePropertyMap.values());
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+//final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), properties);
--- End diff --

commented code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-10 Thread isper3at
Github user isper3at commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r132557770
  
--- Diff: 
common/rya.api/src/main/java/org/apache/rya/api/domain/RyaTypeUtils.java ---
@@ -24,12 +24,44 @@
 import org.joda.time.DateTimeZone;
 import org.joda.time.format.ISODateTimeFormat;
 import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
 import org.openrdf.model.vocabulary.XMLSchema;
 
+import com.google.common.collect.ImmutableMap;
+
 /**
  * Utility methods for using {@link RyaType}.
  */
 public final class RyaTypeUtils {
+private static final ImmutableMap, RyaTypeMethod> METHOD_MAP =
+ImmutableMap., RyaTypeMethod>builder()
+.put(Boolean.class, (v) -> booleanRyaType((Boolean) v))
+.put(Byte.class, (v) -> byteRyaType((Byte) v))
+.put(Date.class, (v) -> dateRyaType((Date) v))
+.put(DateTime.class, (v) -> dateRyaType((DateTime) v))
+.put(Double.class, (v) -> doubleRyaType((Double) v))
+.put(Float.class, (v) -> floatRyaType((Float) v))
+.put(Integer.class, (v) -> intRyaType((Integer) v))
+.put(Long.class, (v) -> longRyaType((Long) v))
+.put(Short.class, (v) -> shortRyaType((Short) v))
+.put(String.class, (v) -> stringRyaType((String) v))
+.put(URI.class, (v) -> uriRyaType((URI) v))
+.put(URIImpl.class, (v) -> uriRyaType((URIImpl) v))
+.build();
+
+/**
+ * Represents a method inside the {@link RyaTypeUtils} class that can 
be
+ * call.
+ */
+private static interface RyaTypeMethod {
--- End diff --

confused.is this for reflection?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-04-13 Thread ejwhite922
GitHub user ejwhite922 opened a pull request:

https://github.com/apache/incubator-rya/pull/153

RYA-250 Smart URI avoiding data duplication

## Description
Added data duplication detection methods to Smart URI/Entities. These use 
configured tolerances for each data type to decide if an Entity is considered 
nearly equal. Also, string terms that are considered equivalent can be 
configured.

**!  NOTE  !**
Only review the latest commit.  The other commit is from another PR.

### Tests
Unit tests

### Links
[Jira](https://issues.apache.org/jira/browse/RYA-250)

### Checklist
- [ ] Code Review
- [ ] Squash Commits

 People To Review
@kchilton2
@isper3at
@meiercaleb 
@pujav65
@amihalik
@DLotts


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ejwhite922/incubator-rya 
RYA-250_SmartURIAvoidingDataDuplication

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-rya/pull/153.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #153


commit 7f84778a90d5b33d5332e21e43a838b7c6f54f6d
Author: eric.white 
Date:   2017-02-27T16:12:05Z

RYA-250 Smart URI

commit 7e02475a024267afa9f2dca74d384819618c9c61
Author: eric.white 
Date:   2017-04-12T15:15:03Z

RYA-250 Added data duplication detection methods to Smart URI/Entities.  
These use configured tolerances for each data type to decide if an Entity is 
considered nearly equal.  Also, string terms that are considered equivalent can 
be configured.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---