[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131264#comment-16131264
 ] 

ASF GitHub Bot commented on RYA-250:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133820424
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

It doesn't seem like you do anything with the remaining typeIds.


> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133823522
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/HasSelfVisitor.java 
---
@@ -0,0 +1,59 @@
+package org.apache.rya.rdftriplestore.inference;
+
+import org.apache.rya.api.RdfCloudTripleStoreConfiguration;
+import org.openrdf.model.Resource;
+import org.openrdf.model.URI;
+import org.openrdf.model.vocabulary.RDF;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.ExtensionElem;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.Var;
+
+/**
+ *
+ */
+public class HasSelfVisitor extends AbstractInferVisitor {
+private static final Var TYPE_VAR = new Var("p", RDF.TYPE);
+public HasSelfVisitor(final RdfCloudTripleStoreConfiguration conf, 
final InferenceEngine inferenceEngine) {
+super(conf, inferenceEngine);
+include = conf.isInferInverseOf();
--- End diff --

I think it's probably best to make them configurable.  We could just make 
the default returned by the config getter method be true.  But it provides a 
way of prioritizing some rules over others.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131274#comment-16131274
 ] 

ASF GitHub Bot commented on RYA-250:


Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825074
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

The remaining typeIds get compared down with the Entity query results 
below.  But I hastily pulled all this working logic into a separate function 
and ended up comparing them to themselves.  I'll fix that.


> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131265#comment-16131265
 ] 

ASF GitHub Bot commented on RYA-250:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133821430
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+ 

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133821430
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+// Now grab all the Entities that have the subjects we found.
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+for (final RyaURI subject : subjects) {
+final Optional 

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133820424
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

It doesn't seem like you do anything with the remaining typeIds.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133824669
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -416,22 +419,57 @@ private void 
refreshHasValueRestrictions(Map restrictions) throws
 }
 }
 
-private static Vertex getVertex(Graph graph, Object id) {
-Iterator it = graph.vertices(id.toString());
+private void refreshHasSelfRestrictions(final Map 
restrictions) throws QueryEvaluationException {
+hasSelfByType = new HashMap<>();
+hasSelfByProperty = new HashMap<>();
+
+restrictions.forEach((type, property) -> {
+try {
+final CloseableIteration iter = RyaDAOHelper.query(ryaDAO, type, HASSELF, 
null, conf);
+try {
+if (iter.hasNext()) {
+Set typeSet = hasSelfByType.get(type);
+Set propSet = 
hasSelfByProperty.get(property);
+
+if (typeSet == null) {
+typeSet = new HashSet<>();
+}
+if (propSet == null) {
+propSet = new HashSet<>();
+}
+typeSet.add(property);
+propSet.add(type);
+
+hasSelfByType.put(type, typeSet);
+hasSelfByProperty.put(property, propSet);
+}
+} catch (final QueryEvaluationException e) {
--- End diff --

Looks like the Exception is getting swallowed here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825074
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
--- End diff --

The remaining typeIds get compared down with the Entity query results 
below.  But I hastily pulled all this working logic into a separate function 
and ended up comparing them to themselves.  I'll fix that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131276#comment-16131276
 ] 

ASF GitHub Bot commented on RYA-250:


Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825842
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+ 

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133825842
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) 
throws EntityStorageExcept
 if (mongoTypeStorage == null) {
 mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
 }
-final Builder builder = new Builder();
-builder.setSubject(entity.getSubject());
-boolean abort = false;
-for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
-Optional type;
+
+// Grab all entities that have all the same explicit types as 
our
+// original Entity.
+final List comparisonEntities = 
searchHasAllExplicitTypes(entity.getExplicitTypeIds());
+
+// Now that we have our set of potential duplicates, compare 
them.
+// We can stop when we find one duplicate.
+for (final Entity compareEntity : comparisonEntities) {
 try {
-type = mongoTypeStorage.get(typeRyaUri);
-} catch (final TypeStorageException e) {
-throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+hasDuplicate = 
duplicateDataDetector.compareEntities(entity, compareEntity);
+} catch (final SmartUriException e) {
+throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
 }
-if (type.isPresent()) {
-final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
-while (cursor.hasNext()) {
-final TypedEntity typedEntity = cursor.next();
-builder.setExplicitType(typeRyaUri);
-for (final Property property : 
typedEntity.getProperties()) {
-builder.setProperty(typeRyaUri, property);
-}
-}
-} else {
-abort = true;
+if (hasDuplicate) {
 break;
 }
 }
-if (!abort) {
-final Entity entity2 = builder.build();
-try {
-hasDuplicate = 
duplicateDataDetector.compareEntities(entity, entity2);
-} catch (final SmartUriException e) {
-throw new EntityStorageException("Encountered an error 
while comparing entities.", e);
+}
+return hasDuplicate;
+}
+
+/**
+ * Searches the Entity storage for all Entities that contain all the
+ * specified explicit type IDs.
+ * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s 
that
+ * are being searched for.
+ * @return the {@link List} of {@link Entity}s that have all the 
specified
+ * explicit type IDs. If nothing was found an empty {@link List} is
+ * returned.
+ * @throws EntityStorageException
+ */
+private List searchHasAllExplicitTypes(final 
ImmutableList explicitTypeIds) throws EntityStorageException {
+// Grab the first type from the explicit type IDs.
+RyaURI firstType = null;
+if (!explicitTypeIds.isEmpty()) {
+firstType = explicitTypeIds.get(0);
+}
+
+// Check if that type exists anywhere in storage.
+final List subjects = new ArrayList<>();
+Optional type;
+try {
+type = mongoTypeStorage.get(firstType);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity type: " 
+ firstType, e);
+}
+if (type.isPresent()) {
+// Grab the subjects for all the types we found matching 
"firstType"
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
+final RyaURI subject = typedEntity.getSubject();
+subjects.add(subject);
+}
+}
+
+// Now grab all the Entities that have the subjects we found.
+final List hasAllExplicitTypesEntities = new ArrayList<>();
+for (final RyaURI subject : subjects) {
+final Optional 

[GitHub] incubator-rya issue #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/153
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/407/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131347#comment-16131347
 ] 

ASF GitHub Bot commented on RYA-250:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/153
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/407/



> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RYA-298) Implement rdfs:domain inference

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131376#comment-16131376
 ] 

ASF GitHub Bot commented on RYA-298:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/197
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/408/



> Implement rdfs:domain inference
> ---
>
> Key: RYA-298
> URL: https://issues.apache.org/jira/browse/RYA-298
> Project: Rya
>  Issue Type: Sub-task
>  Components: sail
>Reporter: Jesse Hatfield
>Assignee: Jesse Hatfield
>
> If a predicate has an *{{rdfs:domain}}* of some class, than the subject of 
> any triple including that predicate belongs to the class.
> If the ontology states that {{:advisor}} has the domain of {{:Person}}, then 
> the inference engine should rewrite queries of the form {{?x rdf:type 
> :Person}} to check for resources which have any {{:advisor}} (as well as any 
> specifically stated to have type {{:Person}} ).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya issue #197: RYA-298, RYA-299 Domain/range inference.

2017-08-17 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/197
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/408/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #198: Rya 283

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/198#discussion_r133719717
  
--- Diff: 
extras/rya.pcj.fluo/pcj.fluo.app/src/main/java/org/apache/rya/indexing/pcj/fluo/app/JoinResultUpdater.java
 ---
@@ -160,8 +183,55 @@ public void updateJoinResults(
 public static enum Side {
 LEFT, RIGHT;
 }
+
+
+/**
+ * Fetches batch to be processed by scanning over the Span specified 
by the
+ * {@link JoinBatchInformation}. The number of results is less than or 
equal
+ * to the batch size specified by the JoinBatchInformation.
+ * 
+ * @param tx - Fluo transaction in which batch operation is performed
+ * @param siblingSpan - span of sibling to retrieve elements to join 
with
+ * @param bsSet- set that batch results are added to
+ * @return Set - containing results of sibling scan.
+ * @throws Exception 
+ */
+private Optional fillSiblingBatch(TransactionBase tx, Span 
siblingSpan, Column siblingColumn, Set bsSet, int 
batchSize) throws Exception {
+
+RowScanner rs = 
tx.scanner().over(siblingSpan).fetch(siblingColumn).byRow().build();
+Iterator colScannerIter = rs.iterator();
+
+boolean batchLimitMet = false;
+Bytes row = siblingSpan.getStart().getRow();
+while (colScannerIter.hasNext() && !batchLimitMet) {
+ColumnScanner colScanner = colScannerIter.next();
+row = colScanner.getRow();
+Iterator iter = colScanner.iterator();
+while (iter.hasNext()) {
--- End diff --

Yeah, okay.  That'd probably be a bit cleaner.  Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #198: Rya 283

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/198#discussion_r133722022
  
--- Diff: 
extras/rya.pcj.fluo/pcj.fluo.integration/src/test/java/org/apache/rya/indexing/pcj/fluo/integration/KafkaExportIT.java
 ---
@@ -425,6 +432,160 @@ public void groupByManyBindings_avaerages() throws 
Exception {
 assertEquals(expectedResults, results);
 }
 
+
+@Test
+public void nestedGroupByManyBindings_averages() throws Exception {
+// A query that groups what is aggregated by two of the keys.
+final String sparql =
+"SELECT ?type ?location ?averagePrice {" +
+"FILTER(?averagePrice > 4) " +
+"{SELECT ?type ?location (avg(?price) as ?averagePrice) {" 
+
+"?id  ?type . " +
+"?id  ?location ." +
+"?id  ?price ." +
+"} " +
+"GROUP BY ?type ?location }}";
+
+// Create the Statements that will be loaded into Rya.
+final ValueFactory vf = new ValueFactoryImpl();
+final Collection statements = Sets.newHashSet(
+// American items that will be averaged.
+vf.createStatement(vf.createURI("urn:1"), 
vf.createURI("urn:type"), vf.createLiteral("apple")),
+vf.createStatement(vf.createURI("urn:1"), 
vf.createURI("urn:location"), vf.createLiteral("USA")),
+vf.createStatement(vf.createURI("urn:1"), 
vf.createURI("urn:price"), vf.createLiteral(2.50)),
+
+vf.createStatement(vf.createURI("urn:2"), 
vf.createURI("urn:type"), vf.createLiteral("cheese")),
+vf.createStatement(vf.createURI("urn:2"), 
vf.createURI("urn:location"), vf.createLiteral("USA")),
+vf.createStatement(vf.createURI("urn:2"), 
vf.createURI("urn:price"), vf.createLiteral(4.25)),
+
+vf.createStatement(vf.createURI("urn:3"), 
vf.createURI("urn:type"), vf.createLiteral("cheese")),
+vf.createStatement(vf.createURI("urn:3"), 
vf.createURI("urn:location"), vf.createLiteral("USA")),
+vf.createStatement(vf.createURI("urn:3"), 
vf.createURI("urn:price"), vf.createLiteral(5.25)),
+
+// French items that will be averaged.
+vf.createStatement(vf.createURI("urn:4"), 
vf.createURI("urn:type"), vf.createLiteral("cheese")),
+vf.createStatement(vf.createURI("urn:4"), 
vf.createURI("urn:location"), vf.createLiteral("France")),
+vf.createStatement(vf.createURI("urn:4"), 
vf.createURI("urn:price"), vf.createLiteral(8.5)),
+
+vf.createStatement(vf.createURI("urn:5"), 
vf.createURI("urn:type"), vf.createLiteral("cigarettes")),
+vf.createStatement(vf.createURI("urn:5"), 
vf.createURI("urn:location"), vf.createLiteral("France")),
+vf.createStatement(vf.createURI("urn:5"), 
vf.createURI("urn:price"), vf.createLiteral(3.99)),
+
+vf.createStatement(vf.createURI("urn:6"), 
vf.createURI("urn:type"), vf.createLiteral("cigarettes")),
+vf.createStatement(vf.createURI("urn:6"), 
vf.createURI("urn:location"), vf.createLiteral("France")),
+vf.createStatement(vf.createURI("urn:6"), 
vf.createURI("urn:price"), vf.createLiteral(4.99)));
+
+// Create the PCJ in Fluo and load the statements into Rya.
+final String pcjId = loadData(sparql, statements);
+
+// Create the expected results of the SPARQL query once the PCJ 
has been computed.
+final Set expectedResults = new HashSet<>();
+
+MapBindingSet bs = new MapBindingSet();
+bs.addBinding("type", vf.createLiteral("cheese", 
XMLSchema.STRING));
+bs.addBinding("location", vf.createLiteral("France", 
XMLSchema.STRING));
+bs.addBinding("averagePrice", vf.createLiteral("8.5", 
XMLSchema.DECIMAL));
+expectedResults.add( new VisibilityBindingSet(bs));
+
+bs = new MapBindingSet();
+bs.addBinding("type", vf.createLiteral("cigarettes", 
XMLSchema.STRING));
+bs.addBinding("location", vf.createLiteral("France", 
XMLSchema.STRING));
+bs.addBinding("averagePrice", vf.createLiteral("4.49", 
XMLSchema.DECIMAL));
+expectedResults.add( new VisibilityBindingSet(bs) );
+
+bs = new MapBindingSet();
+bs.addBinding("type", vf.createLiteral("cheese", 
XMLSchema.STRING));
+bs.addBinding("location", vf.createLiteral("USA", 
XMLSchema.STRING));
+bs.addBinding("averagePrice", vf.createLiteral("4.75", 
XMLSchema.DECIMAL));
+expectedResults.add( new VisibilityBindingSet(bs) );
   

[GitHub] incubator-rya issue #206: RYA-292 Added owl:intersectionOf inference.

2017-08-17 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/206
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/404/Failed
 Tests: 3incubator-rya-master-with-optionals-pull-requests/org.apache.rya:rya.prospector:
 3org.apache.rya.prospector.mr.ProspectorTest.testCountorg.apache.rya.prospector.service.ProspectorServiceEvalStatsDAOTest.testCountorg.apache.rya.prospector.service.ProspectorServiceEvalStatsDAOTest.testNoAuthsCount



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130673#comment-16130673
 ] 

ASF GitHub Bot commented on RYA-250:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133742379
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws 

[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133742379
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link 

[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130672#comment-16130672
 ] 

ASF GitHub Bot commented on RYA-250:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133737467
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

I'm not quite following what you are doing with the Entity Builder here.  
It seems like you are using it primarily to convert each TypedEntity returned 
in this loop to an Entity.   If that is the case, you should be creating a new 
Builder for each TypedEntity and then doing your duplicate comparison within 
this loop.  As it is currently written, it seems like you are just overwriting 
properties as you iterate through the TypedEntities.


> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133745206
  
--- Diff: 
sail/src/test/java/org/apache/rya/rdftriplestore/inference/HasSelfVisitorTest.java
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.rdftriplestore.inference;
+
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.when;
+
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.rya.accumulo.AccumuloRdfConfiguration;
+import org.junit.Assert;
+import org.junit.Test;
+import org.openrdf.model.Resource;
+import org.openrdf.model.URI;
+import org.openrdf.model.ValueFactory;
+import org.openrdf.model.impl.ValueFactoryImpl;
+import org.openrdf.model.vocabulary.RDF;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.ExtensionElem;
+import org.openrdf.query.algebra.Projection;
+import org.openrdf.query.algebra.ProjectionElem;
+import org.openrdf.query.algebra.ProjectionElemList;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.Union;
+import org.openrdf.query.algebra.Var;
+
+public class HasSelfVisitorTest {
+private final AccumuloRdfConfiguration conf = new 
AccumuloRdfConfiguration();
+private final ValueFactory vf = new ValueFactoryImpl();
+
+private final URI narcissist = vf.createURI("urn:Narcissist");
+private final URI love = vf.createURI("urn:love");
+private final URI self = vf.createURI("urn:self");
--- End diff --

These URI's and vf can be made static final.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133737670
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -43,28 +52,17 @@
 import org.openrdf.model.vocabulary.RDF;
 import org.openrdf.model.vocabulary.RDFS;
 import org.openrdf.query.QueryEvaluationException;
-import org.apache.tinkerpop.gremlin.structure.Direction;
-import org.apache.tinkerpop.gremlin.structure.Edge;
-import org.apache.tinkerpop.gremlin.structure.Graph;
-import org.apache.tinkerpop.gremlin.structure.T;
-import org.apache.tinkerpop.gremlin.structure.VertexProperty;
-import org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerGraph;
-import org.apache.tinkerpop.gremlin.structure.Vertex;
-
-import com.google.common.collect.Iterators;
 
 import info.aduna.iteration.CloseableIteration;
-import org.apache.rya.api.RdfCloudTripleStoreConfiguration;
-import org.apache.rya.api.persist.RyaDAO;
-import org.apache.rya.api.persist.RyaDAOException;
-import org.apache.rya.api.persist.utils.RyaDAOHelper;
 
 /**
  * Will pull down inference relationships from dao every x seconds. 
  * Will infer extra relationships. 
  * Will cache relationships in Graph for later use. 
  */
 public class InferenceEngine {
+private static final ValueFactory VF = ValueFactoryImpl.getInstance();
+private static final URI HASSELF = VF.createURI(OWL.NAMESPACE, 
"hasSelf");
--- End diff --

Rename to HAS_SELF


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133735126
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/HasSelfVisitor.java 
---
@@ -0,0 +1,59 @@
+package org.apache.rya.rdftriplestore.inference;
+
+import org.apache.rya.api.RdfCloudTripleStoreConfiguration;
+import org.openrdf.model.Resource;
+import org.openrdf.model.URI;
+import org.openrdf.model.vocabulary.RDF;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.ExtensionElem;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.Var;
+
+/**
+ *
+ */
+public class HasSelfVisitor extends AbstractInferVisitor {
+private static final Var TYPE_VAR = new Var("p", RDF.TYPE);
+public HasSelfVisitor(final RdfCloudTripleStoreConfiguration conf, 
final InferenceEngine inferenceEngine) {
--- End diff --

javadocs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133744669
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -416,22 +419,57 @@ private void 
refreshHasValueRestrictions(Map restrictions) throws
 }
 }
 
-private static Vertex getVertex(Graph graph, Object id) {
-Iterator it = graph.vertices(id.toString());
+private void refreshHasSelfRestrictions(final Map 
restrictions) throws QueryEvaluationException {
+hasSelfByType = new HashMap<>();
+hasSelfByProperty = new HashMap<>();
+
+restrictions.forEach((type, property) -> {
+try {
+final CloseableIteration iter = RyaDAOHelper.query(ryaDAO, type, HASSELF, 
null, conf);
+try {
+if (iter.hasNext()) {
+Set typeSet = hasSelfByType.get(type);
+Set propSet = 
hasSelfByProperty.get(property);
+
+if (typeSet == null) {
+typeSet = new HashSet<>();
+}
+if (propSet == null) {
+propSet = new HashSet<>();
+}
+typeSet.add(property);
+propSet.add(type);
+
+hasSelfByType.put(type, typeSet);
+hasSelfByProperty.put(property, propSet);
+}
+} catch (final QueryEvaluationException e) {
+if (iter != null) {
+iter.close();
+}
+}
+} catch (final QueryEvaluationException e) {
--- End diff --

We shouldn't silently swallow this Exception.  Maybe don't use a Java 8 
lambda here since they're poor at handling throwing checked Exceptions.  There 
are some workarounds for throwing an Exception but most of them aren't ideal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133735003
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/HasSelfVisitor.java 
---
@@ -0,0 +1,59 @@
+package org.apache.rya.rdftriplestore.inference;
+
+import org.apache.rya.api.RdfCloudTripleStoreConfiguration;
+import org.openrdf.model.Resource;
+import org.openrdf.model.URI;
+import org.openrdf.model.vocabulary.RDF;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.ExtensionElem;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.Var;
+
+/**
+ *
+ */
+public class HasSelfVisitor extends AbstractInferVisitor {
--- End diff --

Add javadocs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #209: Rya 296 hasSelf

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/209#discussion_r133736937
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/HasSelfVisitor.java 
---
@@ -0,0 +1,59 @@
+package org.apache.rya.rdftriplestore.inference;
+
+import org.apache.rya.api.RdfCloudTripleStoreConfiguration;
+import org.openrdf.model.Resource;
+import org.openrdf.model.URI;
+import org.openrdf.model.vocabulary.RDF;
+import org.openrdf.query.algebra.Extension;
+import org.openrdf.query.algebra.ExtensionElem;
+import org.openrdf.query.algebra.StatementPattern;
+import org.openrdf.query.algebra.Var;
+
+/**
+ *
+ */
+public class HasSelfVisitor extends AbstractInferVisitor {
+private static final Var TYPE_VAR = new Var("p", RDF.TYPE);
+public HasSelfVisitor(final RdfCloudTripleStoreConfiguration conf, 
final InferenceEngine inferenceEngine) {
+super(conf, inferenceEngine);
+include = conf.isInferInverseOf();
--- End diff --

Change to:
include = true;
(@jessehatfield, @meiercaleb) Or should we have a config option for each 
restriction?  (i.e. conf.isInferHasSelf())  It appears that some visitors do 
have a unique config option while some don't.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130665#comment-16130665
 ] 

ASF GitHub Bot commented on RYA-250:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/153
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/405/



> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133739286
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
--- End diff --

Why doesn't TypedEntity extend Entity?  Also, I noticed that your 
compareEntities method for the DuplicateDataDetector applies to Entities but 
then immediately checks for Type.  Obviously you want to return false if two 
Entities don't have the same Type, but maybe it would be useful to have a 
compareTypedEntities method?  Then your compareEntities method could 
effectively delegate to that (check for Type and if the Types are the same, 
convert the Entities to TypedEntities and call the compareTypedEntities 
method).   It seems like a compareTypedEntities method would align with the use 
case better -- there would be no need to convert all of the TypedEntities 
returned in this loop to an Entity.  You could convert the given Entity to a 
TypedEntity if it has a Type, and do a direct comparison to each TypedEntity in 
this loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133737467
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

I'm not quite following what you are doing with the Entity Builder here.  
It seems like you are using it primarily to convert each TypedEntity returned 
in this loop to an Entity.   If that is the case, you should be creating a new 
Builder for each TypedEntity and then doing your duplicate comparison within 
this loop.  As it is currently written, it seems like you are just overwriting 
properties as you iterate through the TypedEntities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #198: Rya 283

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/198#discussion_r133721289
  
--- Diff: 
extras/rya.pcj.fluo/pcj.fluo.app/src/main/java/org/apache/rya/indexing/pcj/fluo/app/query/QueryMetadataVisitorBase.java
 ---
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.pcj.fluo.app.query;
+
+import org.apache.rya.indexing.pcj.fluo.app.NodeType;
+
+import com.google.common.base.Optional;
+import com.google.common.base.Preconditions;
+
+public abstract class QueryMetadataVisitorBase {
--- End diff --

Because there is a need to navigate the FluoQuery.Builder (which is 
essentially a tree of Builders) as well as the QueryMetadatada (which is a tree 
of metadata).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya issue #198: Rya 283

2017-08-17 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/198
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/403/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133773816
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
--- End diff --

If I recall correctly, a TypedEntity is one part of an Entity and an Entity 
is comprised of possibly several TypedEntities.  (i.e an Entity could be made 
up of a person TypedEntity and an employee TypedEntity).  So, TypedEntity 
shouldn't extend Entity.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130848#comment-16130848
 ] 

ASF GitHub Bot commented on RYA-250:


Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133773816
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
+final TypedEntity typedEntity = cursor.next();
--- End diff --

If I recall correctly, a TypedEntity is one part of an Entity and an Entity 
is comprised of possibly several TypedEntities.  (i.e an Entity could be made 
up of a person TypedEntity and an employee TypedEntity).  So, TypedEntity 
shouldn't extend Entity.


> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RYA-345) KafkaBindingSetExporterFactory does not support multiple Kafka bootstrap servers.

2017-08-17 Thread Jeff Dasch (JIRA)

 [ 
https://issues.apache.org/jira/browse/RYA-345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Dasch updated RYA-345:
---
Description: 
KafkaBindingSetExporterFactory needs to be updated to be able to handle a CSV 
(or ;SV) for the key ProducerConfig.BOOTSTRAP_SERVERS_CONFIG in the event the 
user is providing multiple bootstrap servers for Kafka.  This implementation 
would be similar to how RyaBindingSetExporterFactory handles the CSV for the 
zookeeper connect.

An alternative option is to migrate to Fluo 1.1.0-incubating and implement an 
ObserverProvider.  This may then eliminate the need for the above modification.

Or possibly take advantage of this section of the fluo.properties file:
{noformat}
#Application properties
#---
#Properties with a prefix of fluo.app are stored in zookeeper at
#initialization time and can easily be retrieved by a Fluo application running
#on any node in the cluster.
#fluo.app.config1=val1
{noformat}




  was:
KafkaBindingSetExporterFactory needs to be updated to be able to handle a CSV 
(or ;SV) for the key ProducerConfig.BOOTSTRAP_SERVERS_CONFIG in the event the 
user is providing multiple bootstrap servers for Kafka.  This implementation 
would be similar to how RyaBindingSetExporterFactory handles the CSV for the 
zookeeper connect.

An alternative option is to migrate to Fluo 1.1.0-incubating and implement an 
ObserverProvider.  This may then eliminate the need for the above modification.


> KafkaBindingSetExporterFactory does not support multiple Kafka bootstrap 
> servers.
> -
>
> Key: RYA-345
> URL: https://issues.apache.org/jira/browse/RYA-345
> Project: Rya
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 3.2.10
>Reporter: Jeff Dasch
>Priority: Minor
>
> KafkaBindingSetExporterFactory needs to be updated to be able to handle a CSV 
> (or ;SV) for the key ProducerConfig.BOOTSTRAP_SERVERS_CONFIG in the event the 
> user is providing multiple bootstrap servers for Kafka.  This implementation 
> would be similar to how RyaBindingSetExporterFactory handles the CSV for the 
> zookeeper connect.
> An alternative option is to migrate to Fluo 1.1.0-incubating and implement an 
> ObserverProvider.  This may then eliminate the need for the above 
> modification.
> Or possibly take advantage of this section of the fluo.properties file:
> {noformat}
> #Application properties
> #---
> #Properties with a prefix of fluo.app are stored in zookeeper at
> #initialization time and can easily be retrieved by a Fluo application running
> #on any node in the cluster.
> #fluo.app.config1=val1
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (RYA-343) AccumuloLoadStatementsFile fails when loading data to a PCJ-enabled table.

2017-08-17 Thread Jeff Dasch (JIRA)

 [ 
https://issues.apache.org/jira/browse/RYA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Dasch resolved RYA-343.

   Resolution: Fixed
Fix Version/s: 3.2.11

Fix merged to master.

> AccumuloLoadStatementsFile fails when loading data to a PCJ-enabled table.
> --
>
> Key: RYA-343
> URL: https://issues.apache.org/jira/browse/RYA-343
> Project: Rya
>  Issue Type: Sub-task
>  Components: clients
>Reporter: Jeff Dasch
>Assignee: Jeff Dasch
> Fix For: 3.2.11
>
>
> Issue occurs when calling {{AccumuloLoadStatementsFile.loadStatements()}} to 
> loading data to a PCJ-enabled table.  I believe this is a recent regression.
> {noformat}
> 2017-08-14 13:46:51,802 [Spring Shell] WARN  
> org.apache.rya.api.client.accumulo.AccumuloLoadStatementsFile - Exception 
> while loading:
> org.apache.rya.api.persist.RyaDAOException: 
> java.lang.IllegalArgumentException: The 'rya.indexing.pcj.storageType' 
> property must have one of the following values: [ACCUMULO]
>   at org.apache.rya.accumulo.AccumuloRyaDAO.init(AccumuloRyaDAO.java:165)
>   at 
> org.apache.rya.sail.config.RyaSailFactory.getAccumuloDAO(RyaSailFactory.java:155)
>   at 
> org.apache.rya.sail.config.RyaSailFactory.getRyaSail(RyaSailFactory.java:100)
>   at 
> org.apache.rya.sail.config.RyaSailFactory.getInstance(RyaSailFactory.java:67)
>   at 
> org.apache.rya.api.client.accumulo.AccumuloLoadStatementsFile.loadStatements(AccumuloLoadStatementsFile.java:91)
>   at org.apache.rya.shell.RyaCommands.loadData(RyaCommands.java:121)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:210)
>   at 
> org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:64)
>   at 
> org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:57)
>   at 
> org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:127)
>   at 
> org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533)
>   at org.springframework.shell.core.JLineShell.run(JLineShell.java:179)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: The 
> 'rya.indexing.pcj.storageType' property must have one of the following 
> values: [ACCUMULO]
>   at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:92)
>   at 
> org.apache.rya.indexing.external.PrecomputedJoinStorageSupplier.get(PrecomputedJoinStorageSupplier.java:68)
>   at 
> org.apache.rya.indexing.external.PrecomputedJoinIndexer.init(PrecomputedJoinIndexer.java:139)
>   at org.apache.rya.accumulo.AccumuloRyaDAO.init(AccumuloRyaDAO.java:156)
>   ... 16 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #198: Rya 283

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/198#discussion_r133720806
  
--- Diff: 
extras/rya.pcj.fluo/pcj.fluo.app/src/main/java/org/apache/rya/indexing/pcj/fluo/app/batch/JoinBatchInformation.java
 ---
@@ -149,12 +137,12 @@ public boolean equals(Object other) {
 
 JoinBatchInformation batch = (JoinBatchInformation) other;
 return super.equals(other) &&  Objects.equals(this.bs, batch.bs) 
&& Objects.equals(this.join, batch.join)
--- End diff --

java.utils.Object.equals() returns a boolean, so you cannot chain these 
calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133803255
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

Oops, it's only grabbing one Entity to compare.  I reworked so it now finds 
a set of potential Entities to compare based on them having all the same 
explicit type IDs.

The subjects don't matter and querying for properties doesn't help us since 
we're trying to find properties that are CLOSE but not quite equal.  That 
leaves us with only the Types to narrow our initial search of Entities to 
check. Once we grab the Entities the (near) duplicate data detector is run over 
them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131024#comment-16131024
 ] 

ASF GitHub Bot commented on RYA-250:


Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133803255
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java
 ---
@@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final 
RyaURI typeId) {
 
 return Stream.of(dataTypeFilter, valueFilter);
 }
+
+private boolean detectDuplicates(final Entity entity) throws 
EntityStorageException {
+boolean hasDuplicate = false;
+if (duplicateDataDetector.isDetectionEnabled()) {
+if (mongoTypeStorage == null) {
+mongoTypeStorage = new MongoTypeStorage(mongo, 
ryaInstanceName);
+}
+final Builder builder = new Builder();
+builder.setSubject(entity.getSubject());
+boolean abort = false;
+for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) {
+Optional type;
+try {
+type = mongoTypeStorage.get(typeRyaUri);
+} catch (final TypeStorageException e) {
+throw new EntityStorageException("Unable to get entity 
type: " + typeRyaUri, e);
+}
+if (type.isPresent()) {
+final ConvertingCursor cursor = 
search(Optional.empty(), type.get(), Collections.emptySet());
+while (cursor.hasNext()) {
--- End diff --

Oops, it's only grabbing one Entity to compare.  I reworked so it now finds 
a set of potential Entities to compare based on them having all the same 
explicit type IDs.

The subjects don't matter and querying for properties doesn't help us since 
we're trying to find properties that are CLOSE but not quite equal.  That 
leaves us with only the Types to narrow our initial search of Entities to 
check. Once we grab the Entities the (near) duplicate data detector is run over 
them.


> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread ejwhite922
Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133804678
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link 

[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131033#comment-16131033
 ] 

ASF GitHub Bot commented on RYA-250:


Github user ejwhite922 commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133804678
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws 

[jira] [Resolved] (RYA-297) Implement owl:equivalentClass inference

2017-08-17 Thread Jesse Hatfield (JIRA)

 [ 
https://issues.apache.org/jira/browse/RYA-297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesse Hatfield resolved RYA-297.

Resolution: Resolved

> Implement owl:equivalentClass inference
> ---
>
> Key: RYA-297
> URL: https://issues.apache.org/jira/browse/RYA-297
> Project: Rya
>  Issue Type: Sub-task
>  Components: sail
>Reporter: Jesse Hatfield
>Assignee: Jesse Hatfield
>
> An *{{owl:equivalentClass}}* statement is equivalent to stating that two 
> classes are each subclasses of the other.
> The inference engine already supports subclass reasoning, but appears not to 
> check for equivalent class statements. This can likely be handled by adding 
> the relationship to the subclass graph in both directions, as seems to be 
> done for equivalent properties.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya issue #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread asfgit
Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/153
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/406/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131174#comment-16131174
 ] 

ASF GitHub Bot commented on RYA-250:


Github user asfgit commented on the issue:

https://github.com/apache/incubator-rya/pull/153
  

Refer to this link for build results (access rights to CI server needed): 

https://builds.apache.org/job/incubator-rya-master-with-optionals-pull-requests/406/



> Smart URI avoid data duplication
> 
>
> Key: RYA-250
> URL: https://issues.apache.org/jira/browse/RYA-250
> Project: Rya
>  Issue Type: Task
>  Components: dao
>Affects Versions: 3.2.10
>Reporter: Eric White
>Assignee: Eric White
> Fix For: 3.2.10
>
>
> Implement Smart URI methods for avoiding data duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #201: RYA-295 owl:allValuesFrom inference

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/201#discussion_r133811286
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -416,6 +418,36 @@ private void refreshHasValueRestrictions(Map restrictions) throws
 }
 }
 
+private void refreshAllValuesFromRestrictions(Map 
restrictions) throws QueryEvaluationException {
--- End diff --

I think it would be good to outline the flow of logic here.  E.g. refreshes 
allValuesRestrictions by creating a map of maps from the value class to a map 
that associates restrictions with properties.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-295) Implement owl:allValuesFrom inference

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131206#comment-16131206
 ] 

ASF GitHub Bot commented on RYA-295:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/201#discussion_r133811286
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -416,6 +418,36 @@ private void refreshHasValueRestrictions(Map restrictions) throws
 }
 }
 
+private void refreshAllValuesFromRestrictions(Map 
restrictions) throws QueryEvaluationException {
--- End diff --

I think it would be good to outline the flow of logic here.  E.g. refreshes 
allValuesRestrictions by creating a map of maps from the value class to a map 
that associates restrictions with properties.  


> Implement owl:allValuesFrom inference
> -
>
> Key: RYA-295
> URL: https://issues.apache.org/jira/browse/RYA-295
> Project: Rya
>  Issue Type: Sub-task
>  Components: sail
>Reporter: Jesse Hatfield
>Assignee: Jesse Hatfield
>
> An *{{owl:allValuesFrom}}* restriction defines the set of resources for 
> which, given a particular predicate and other type, every value of that 
> predicate is a member of that type. Note that there may be no values at all.
> For example, the ontology may state that resources of type {{:Person}} have 
> all values from {{:Person}} for type {{:parent}}: that is, a person's parents 
> are all people as well. Therefore, a pattern of the form {{?x rdf:type 
> :Person}} should be expanded to:
> {noformat}
> { ?y rdf:type :Person .
>   ?y :parent ?x }
> UNION
> { ?x rdf:type :Person }
> {noformat}
> i.e. we can infer {{?x}}'s personhood from the fact that child {{?y}} is 
> known to satisfy the restriction.
> Notes:
> -We can infer "x is a person, therefore all of x's parents are people". But 
> we can't infer "all of x's parents are people, therefore x is a person", 
> because of the open world semantics: we don't know that the parents given by 
> the data are in fact all of x's parents. (If there were also a cardinality 
> restriction and we could presume consistency, then we could infer this in the 
> right circumstances, but this is outside the scope of basic allValuesFrom 
> support.) This differs with most other property restriction rules in that we 
> can't infer that an object belongs to the class defined by the restriction, 
> but rather use the fact that an object is already known to belong in that 
> class in order to infer something about its neighbors in the graph (the types 
> of the values).
> -The example above could be applied recursively, but to implement this as a 
> simple query rewrite we'll need to limit recursion depth (and interactions 
> with other rules, for the same reasons).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RYA-298) Implement rdfs:domain inference

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131227#comment-16131227
 ] 

ASF GitHub Bot commented on RYA-298:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/197#discussion_r133815357
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -363,13 +367,173 @@ public void refreshGraph() throws 
InferenceEngineException {
}
 }
 
+refreshDomainRange();
+
 refreshPropertyRestrictions();
 
 } catch (QueryEvaluationException e) {
 throw new InferenceEngineException(e);
 }
 }
 
+/**
+ * Queries domain and range information, then populates the inference 
engine with direct
+ * domain/range relations and any that can be inferred from the 
subclass graph, subproperty
+ * graph, and inverse property map. Should be called after that class 
and property information
+ * has been refreshed.
+ *
+ * Computes indirect domain/range:
+ *  - If p1 has domain c, and p2 is a subproperty of p1, then p2 also 
has domain c.
+ *  - If p1 has range c, and p2 is a subproperty of p1, then p2 also 
has range c.
+ *  - If p1 has domain c, and p2 is the inverse of p1, then p2 has 
range c.
+ *  - If p1 has range c, and p2 is the inverse of p1, then p2 has 
domain c.
+ *  - If p has domain c1, and c1 is a subclass of c2, then p also has 
domain c2.
+ *  - If p has range c1, and c1 is a subclass of c2, then p also has 
range c2.
+ * @throws QueryEvaluationException
+ */
+private void refreshDomainRange() throws QueryEvaluationException {
+Map domainByTypePartial = new ConcurrentHashMap<>();
+Map rangeByTypePartial = new ConcurrentHashMap<>();
+// First, populate domain and range based on direct domain/range 
triples.
+CloseableIteration iter = 
RyaDAOHelper.query(ryaDAO, null, RDFS.DOMAIN, null, conf);
+try {
+while (iter.hasNext()) {
--- End diff --

It seems like this map building loop is extremely prevalent in the refresh 
methods for all of the property restrictions.  Is there any way to pull this 
logic out into a buildRestrictionMap(...) method that takes in the map that is 
being updated/created/refreshed, and/or the iterator/property URI used to 
create the iterator?


> Implement rdfs:domain inference
> ---
>
> Key: RYA-298
> URL: https://issues.apache.org/jira/browse/RYA-298
> Project: Rya
>  Issue Type: Sub-task
>  Components: sail
>Reporter: Jesse Hatfield
>Assignee: Jesse Hatfield
>
> If a predicate has an *{{rdfs:domain}}* of some class, than the subject of 
> any triple including that predicate belongs to the class.
> If the ontology states that {{:advisor}} has the domain of {{:Person}}, then 
> the inference engine should rewrite queries of the form {{?x rdf:type 
> :Person}} to check for resources which have any {{:advisor}} (as well as any 
> specifically stated to have type {{:Person}} ).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #197: RYA-298, RYA-299 Domain/range inference.

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/197#discussion_r133814275
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -73,6 +74,8 @@
 private Set transitivePropertySet;
 private Map> hasValueByType;
 private Map> hasValueByProperty;
+private Map domainByType;
--- End diff --

Can one of you create a ticket for this?  The solution of this is probably 
not within the scope of this task, but this should be resolved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-rya pull request #197: RYA-298, RYA-299 Domain/range inference.

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/197#discussion_r133815357
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -363,13 +367,173 @@ public void refreshGraph() throws 
InferenceEngineException {
}
 }
 
+refreshDomainRange();
+
 refreshPropertyRestrictions();
 
 } catch (QueryEvaluationException e) {
 throw new InferenceEngineException(e);
 }
 }
 
+/**
+ * Queries domain and range information, then populates the inference 
engine with direct
+ * domain/range relations and any that can be inferred from the 
subclass graph, subproperty
+ * graph, and inverse property map. Should be called after that class 
and property information
+ * has been refreshed.
+ *
+ * Computes indirect domain/range:
+ *  - If p1 has domain c, and p2 is a subproperty of p1, then p2 also 
has domain c.
+ *  - If p1 has range c, and p2 is a subproperty of p1, then p2 also 
has range c.
+ *  - If p1 has domain c, and p2 is the inverse of p1, then p2 has 
range c.
+ *  - If p1 has range c, and p2 is the inverse of p1, then p2 has 
domain c.
+ *  - If p has domain c1, and c1 is a subclass of c2, then p also has 
domain c2.
+ *  - If p has range c1, and c1 is a subclass of c2, then p also has 
range c2.
+ * @throws QueryEvaluationException
+ */
+private void refreshDomainRange() throws QueryEvaluationException {
+Map domainByTypePartial = new ConcurrentHashMap<>();
+Map rangeByTypePartial = new ConcurrentHashMap<>();
+// First, populate domain and range based on direct domain/range 
triples.
+CloseableIteration iter = 
RyaDAOHelper.query(ryaDAO, null, RDFS.DOMAIN, null, conf);
+try {
+while (iter.hasNext()) {
--- End diff --

It seems like this map building loop is extremely prevalent in the refresh 
methods for all of the property restrictions.  Is there any way to pull this 
logic out into a buildRestrictionMap(...) method that takes in the map that is 
being updated/created/refreshed, and/or the iterator/property URI used to 
create the iterator?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (RYA-298) Implement rdfs:domain inference

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131226#comment-16131226
 ] 

ASF GitHub Bot commented on RYA-298:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/197#discussion_r133814275
  
--- Diff: 
sail/src/main/java/org/apache/rya/rdftriplestore/inference/InferenceEngine.java 
---
@@ -73,6 +74,8 @@
 private Set transitivePropertySet;
 private Map> hasValueByType;
 private Map> hasValueByProperty;
+private Map domainByType;
--- End diff --

Can one of you create a ticket for this?  The solution of this is probably 
not within the scope of this task, but this should be resolved.


> Implement rdfs:domain inference
> ---
>
> Key: RYA-298
> URL: https://issues.apache.org/jira/browse/RYA-298
> Project: Rya
>  Issue Type: Sub-task
>  Components: sail
>Reporter: Jesse Hatfield
>Assignee: Jesse Hatfield
>
> If a predicate has an *{{rdfs:domain}}* of some class, than the subject of 
> any triple including that predicate belongs to the class.
> If the ontology states that {{:advisor}} has the domain of {{:Person}}, then 
> the inference engine should rewrite queries of the form {{?x rdf:type 
> :Person}} to check for resources which have any {{:advisor}} (as well as any 
> specifically stated to have type {{:Person}} ).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication

2017-08-17 Thread meiercaleb
Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133817921
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws ConfigurationException
+ */
+public DuplicateDataDetector() throws ConfigurationException {
+this(new DuplicateDataConfig());
+}
+
+/**
+ * Creates a new instance of {@link 

[jira] [Commented] (RYA-250) Smart URI avoid data duplication

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/RYA-250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131229#comment-16131229
 ] 

ASF GitHub Bot commented on RYA-250:


Github user meiercaleb commented on a diff in the pull request:

https://github.com/apache/incubator-rya/pull/153#discussion_r133817921
  
--- Diff: 
extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java
 ---
@@ -0,0 +1,1066 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.rya.indexing.smarturi.duplication;
+
+import static java.util.Objects.requireNonNull;
+
+import java.math.BigDecimal;
+import java.util.ArrayList;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.Map.Entry;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.commons.configuration.ConfigurationException;
+import org.apache.commons.lang.StringUtils;
+import org.apache.rya.api.domain.RyaType;
+import org.apache.rya.api.domain.RyaURI;
+import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver;
+import org.apache.rya.indexing.entity.model.Entity;
+import org.apache.rya.indexing.entity.model.Property;
+import org.apache.rya.indexing.smarturi.SmartUriAdapter;
+import org.apache.rya.indexing.smarturi.SmartUriException;
+import 
org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig;
+import org.calrissian.mango.types.exception.TypeEncodingException;
+import org.joda.time.DateTime;
+import org.openrdf.model.URI;
+import org.openrdf.model.impl.URIImpl;
+import org.openrdf.model.vocabulary.XMLSchema;
+
+import com.google.common.collect.ImmutableMap;
+
+/**
+ * Detects if two entities contain data that's nearly identical based on a 
set
+ * tolerance for each field's type. Two entities are considered nearly
+ * identical if all their properties are equal and/or within the specified
+ * tolerance for the property's object type. Setting all object type 
tolerances
+ * to 0 means that the objects need to be exactly equal to each other to be
+ * considered duplicates. Duplicate data detection can be enabled/disabled
+ * through configuration and each object type can have a tolerance based on
+ * either the difference or the percentage difference between the objects 
being
+ * compared.
+ */
+public class DuplicateDataDetector {
+private final Map uriMap = new 
HashMap<>();
+private final Map classMap = new 
HashMap<>();
+
+private boolean isDetectionEnabled;
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the
+ * values provided by the configuration file.
+ * @param duplicateDataConfig the {@link DuplicateDataConfig}
+ */
+public DuplicateDataDetector(final DuplicateDataConfig 
duplicateDataConfig) {
+this(duplicateDataConfig.getBooleanTolerance(),
+duplicateDataConfig.getByteTolerance(),
+duplicateDataConfig.getDateTolerance(),
+duplicateDataConfig.getDoubleTolerance(),
+duplicateDataConfig.getFloatTolerance(),
+duplicateDataConfig.getIntegerTolerance(),
+duplicateDataConfig.getLongTolerance(),
+duplicateDataConfig.getShortTolerance(),
+duplicateDataConfig.getStringTolerance(),
+duplicateDataConfig.getUriTolerance(),
+duplicateDataConfig.getEquivalentTermsMap(),
+duplicateDataConfig.isDetectionEnabled()
+);
+}
+
+/**
+ * Creates a new instance of {@link DuplicateDataDetector} with the 
values
+ * from the config.
+ * @throws