[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user asfgit closed the pull request at: https://github.com/apache/incubator-rya/pull/153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r134297278 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { +try { +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +if (hasDuplicate) { +break; +} +} +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +final List hasAllExplicitTypesEntities = new ArrayList<>(); +if (!explicitTypeIds.isEmpty()) { +// Grab the first type from the explicit type IDs. +final RyaURI firstType = explicitTypeIds.get(0); + +// Check if that type exists anywhere in storage. +final List subjects = new ArrayList<>(); +Optional type; +try { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +type = mongoTypeStorage.get(firstType); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + firstType, e); +} +if (type.isPresent()) { +// Grab the subjects for all the types we found matching "firstType" +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); --- End diff -- To clarify one the first point, querying with fuzzy-matching for properties is not supported by Rya and would need to be done at the MongoDB level. A new JIRA ticket for that improvement has been opened, [RYA-349](https://issues.apache.org/jira/browse/RYA-349). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r134259051 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { +try { +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +if (hasDuplicate) { +break; +} +} +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +final List hasAllExplicitTypesEntities = new ArrayList<>(); +if (!explicitTypeIds.isEmpty()) { +// Grab the first type from the explicit type IDs. +final RyaURI firstType = explicitTypeIds.get(0); + +// Check if that type exists anywhere in storage. +final List subjects = new ArrayList<>(); +Optional type; +try { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +type = mongoTypeStorage.get(firstType); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + firstType, e); +} +if (type.isPresent()) { +// Grab the subjects for all the types we found matching "firstType" +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); --- End diff -- Unfortunately, we can't use the properties to query for potential duplicates. That would bring back Entities with EXACT matches of the properties. We want to include Entities that would be NEAR matches (based on the tolerance). So, if we have a tolerance of 1% for longitudes we'd want to consider 99° to be the same as an Entity with 100° which means we couldn't query for it based off its property value of 100°. I kind of think Entities should be judged as a whole and not as part of their components. If we have 2 completely different Joe Smiths it's possible that they're supposed to have 2 Employee TypedEntities and only one of them has a Person TypedEntity. If we try to create the one that only has a Employee TypedEntity and we say it's a duplicate of the other Joe Smith's Employee TypedEntity (due to setting high tolerance values) then it won't get created. But if we saw that they had different TypedEntities associated with them then we'd consider them to not be duplicates and both Joe Smith's Entities would be created. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r134025772 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +283,84 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { +try { +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +if (hasDuplicate) { +break; +} +} +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +final List hasAllExplicitTypesEntities = new ArrayList<>(); +if (!explicitTypeIds.isEmpty()) { +// Grab the first type from the explicit type IDs. +final RyaURI firstType = explicitTypeIds.get(0); + +// Check if that type exists anywhere in storage. +final List subjects = new ArrayList<>(); +Optional type; +try { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +type = mongoTypeStorage.get(firstType); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + firstType, e); +} +if (type.isPresent()) { +// Grab the subjects for all the types we found matching "firstType" +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); --- End diff -- Instead of getting all of the TypedEntities in the database with the given Type, you should call the method Event.makeTypedEntity(...) for each typeId. Then use the Type and Property map of each TypedEntity to query the DB. This will provide a more constrained query that uses the actual property values. Finally, I think that you should add a compareTypedEntities method to your DuplicateDataDetector so that you can then apply it to compare the returned TypedEntities with the TypedEntity that you created from the original Entity. This eliminates the need the re-query the DB to get the Entities that each TypedEntity is derived from. Also, comparing all TypedEntities derived from a given Entity with all other TypeEntities in the database provides a stricter notion of duplicate data detection. For example, if an Entity contains the Types People and Employee with associated properties, then the approach I'm describing would compare the People TypedEntity and the Employee TypedEntity with all other People and Employee TypedEntities in the DB. None of those TypedEntities could be duplicates in order for the Entity to be deemed a non-duplicate. As it's currently implemented, if an Employee TypedEntity was ingested and derived from an Entity whose sole type was Employee, then an Entity with Type Person and Employee would not be considered a duplicate even if the Employee properties were exactly the same! So in effect, I think we should detect if any TypedEntites derived from an Entity are duplicate to avoid duplicating TypedEntities (I think that these are more meaningful and concrete than Entities, which are e ssen
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133825842 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) throws EntityStorageExcept if (mongoTypeStorage == null) { mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); } -final Builder builder = new Builder(); -builder.setSubject(entity.getSubject()); -boolean abort = false; -for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { -Optional type; + +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { try { -type = mongoTypeStorage.get(typeRyaUri); -} catch (final TypeStorageException e) { -throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); } -if (type.isPresent()) { -final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); -while (cursor.hasNext()) { -final TypedEntity typedEntity = cursor.next(); -builder.setExplicitType(typeRyaUri); -for (final Property property : typedEntity.getProperties()) { -builder.setProperty(typeRyaUri, property); -} -} -} else { -abort = true; +if (hasDuplicate) { break; } } -if (!abort) { -final Entity entity2 = builder.build(); -try { -hasDuplicate = duplicateDataDetector.compareEntities(entity, entity2); -} catch (final SmartUriException e) { -throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +// Grab the first type from the explicit type IDs. +RyaURI firstType = null; +if (!explicitTypeIds.isEmpty()) { +firstType = explicitTypeIds.get(0); +} + +// Check if that type exists anywhere in storage. +final List subjects = new ArrayList<>(); +Optional type; +try { +type = mongoTypeStorage.get(firstType); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + firstType, e); +} +if (type.isPresent()) { +// Grab the subjects for all the types we found matching "firstType" +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { +final TypedEntity typedEntity = cursor.next(); +final RyaURI subject = typedEntity.getSubject(); +subjects.add(subject); +} +} + +// Now grab all the Entities that have the subjects we found. +final List hasAllExplicitTypesEntities = new ArrayList<>(); +for (final RyaURI subject : subjects) { +final Optional entityF
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133825074 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) throws EntityStorageExcept if (mongoTypeStorage == null) { mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); } -final Builder builder = new Builder(); -builder.setSubject(entity.getSubject()); -boolean abort = false; -for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { -Optional type; + +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { try { -type = mongoTypeStorage.get(typeRyaUri); -} catch (final TypeStorageException e) { -throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); } -if (type.isPresent()) { -final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); -while (cursor.hasNext()) { -final TypedEntity typedEntity = cursor.next(); -builder.setExplicitType(typeRyaUri); -for (final Property property : typedEntity.getProperties()) { -builder.setProperty(typeRyaUri, property); -} -} -} else { -abort = true; +if (hasDuplicate) { break; } } -if (!abort) { -final Entity entity2 = builder.build(); -try { -hasDuplicate = duplicateDataDetector.compareEntities(entity, entity2); -} catch (final SmartUriException e) { -throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +// Grab the first type from the explicit type IDs. +RyaURI firstType = null; +if (!explicitTypeIds.isEmpty()) { +firstType = explicitTypeIds.get(0); --- End diff -- The remaining typeIds get compared down with the Entity query results below. But I hastily pulled all this working logic into a separate function and ended up comparing them to themselves. I'll fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133821430 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) throws EntityStorageExcept if (mongoTypeStorage == null) { mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); } -final Builder builder = new Builder(); -builder.setSubject(entity.getSubject()); -boolean abort = false; -for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { -Optional type; + +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { try { -type = mongoTypeStorage.get(typeRyaUri); -} catch (final TypeStorageException e) { -throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); } -if (type.isPresent()) { -final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); -while (cursor.hasNext()) { -final TypedEntity typedEntity = cursor.next(); -builder.setExplicitType(typeRyaUri); -for (final Property property : typedEntity.getProperties()) { -builder.setProperty(typeRyaUri, property); -} -} -} else { -abort = true; +if (hasDuplicate) { break; } } -if (!abort) { -final Entity entity2 = builder.build(); -try { -hasDuplicate = duplicateDataDetector.compareEntities(entity, entity2); -} catch (final SmartUriException e) { -throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +// Grab the first type from the explicit type IDs. +RyaURI firstType = null; +if (!explicitTypeIds.isEmpty()) { +firstType = explicitTypeIds.get(0); +} + +// Check if that type exists anywhere in storage. +final List subjects = new ArrayList<>(); +Optional type; +try { +type = mongoTypeStorage.get(firstType); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + firstType, e); +} +if (type.isPresent()) { +// Grab the subjects for all the types we found matching "firstType" +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { +final TypedEntity typedEntity = cursor.next(); +final RyaURI subject = typedEntity.getSubject(); +subjects.add(subject); +} +} + +// Now grab all the Entities that have the subjects we found. +final List hasAllExplicitTypesEntities = new ArrayList<>(); +for (final RyaURI subject : subjects) { +final Optional entityF
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133820424 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -288,39 +290,79 @@ private boolean detectDuplicates(final Entity entity) throws EntityStorageExcept if (mongoTypeStorage == null) { mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); } -final Builder builder = new Builder(); -builder.setSubject(entity.getSubject()); -boolean abort = false; -for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { -Optional type; + +// Grab all entities that have all the same explicit types as our +// original Entity. +final List comparisonEntities = searchHasAllExplicitTypes(entity.getExplicitTypeIds()); + +// Now that we have our set of potential duplicates, compare them. +// We can stop when we find one duplicate. +for (final Entity compareEntity : comparisonEntities) { try { -type = mongoTypeStorage.get(typeRyaUri); -} catch (final TypeStorageException e) { -throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +hasDuplicate = duplicateDataDetector.compareEntities(entity, compareEntity); +} catch (final SmartUriException e) { +throw new EntityStorageException("Encountered an error while comparing entities.", e); } -if (type.isPresent()) { -final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); -while (cursor.hasNext()) { -final TypedEntity typedEntity = cursor.next(); -builder.setExplicitType(typeRyaUri); -for (final Property property : typedEntity.getProperties()) { -builder.setProperty(typeRyaUri, property); -} -} -} else { -abort = true; +if (hasDuplicate) { break; } } -if (!abort) { -final Entity entity2 = builder.build(); -try { -hasDuplicate = duplicateDataDetector.compareEntities(entity, entity2); -} catch (final SmartUriException e) { -throw new EntityStorageException("Encountered an error while comparing entities.", e); +} +return hasDuplicate; +} + +/** + * Searches the Entity storage for all Entities that contain all the + * specified explicit type IDs. + * @param explicitTypeIds the {@link ImmutableList} of {@link RyaURI}s that + * are being searched for. + * @return the {@link List} of {@link Entity}s that have all the specified + * explicit type IDs. If nothing was found an empty {@link List} is + * returned. + * @throws EntityStorageException + */ +private List searchHasAllExplicitTypes(final ImmutableList explicitTypeIds) throws EntityStorageException { +// Grab the first type from the explicit type IDs. +RyaURI firstType = null; +if (!explicitTypeIds.isEmpty()) { +firstType = explicitTypeIds.get(0); --- End diff -- It doesn't seem like you do anything with the remaining typeIds. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133817921 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java --- @@ -0,0 +1,1066 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.rya.indexing.smarturi.duplication; + +import static java.util.Objects.requireNonNull; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; + +import org.apache.commons.configuration.ConfigurationException; +import org.apache.commons.lang.StringUtils; +import org.apache.rya.api.domain.RyaType; +import org.apache.rya.api.domain.RyaURI; +import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver; +import org.apache.rya.indexing.entity.model.Entity; +import org.apache.rya.indexing.entity.model.Property; +import org.apache.rya.indexing.smarturi.SmartUriAdapter; +import org.apache.rya.indexing.smarturi.SmartUriException; +import org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig; +import org.calrissian.mango.types.exception.TypeEncodingException; +import org.joda.time.DateTime; +import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; +import org.openrdf.model.vocabulary.XMLSchema; + +import com.google.common.collect.ImmutableMap; + +/** + * Detects if two entities contain data that's nearly identical based on a set + * tolerance for each field's type. Two entities are considered nearly + * identical if all their properties are equal and/or within the specified + * tolerance for the property's object type. Setting all object type tolerances + * to 0 means that the objects need to be exactly equal to each other to be + * considered duplicates. Duplicate data detection can be enabled/disabled + * through configuration and each object type can have a tolerance based on + * either the difference or the percentage difference between the objects being + * compared. + */ +public class DuplicateDataDetector { +private final Map> uriMap = new HashMap<>(); +private final Map, ApproxEqualsDetector> classMap = new HashMap<>(); + +private boolean isDetectionEnabled; + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the + * values provided by the configuration file. + * @param duplicateDataConfig the {@link DuplicateDataConfig} + */ +public DuplicateDataDetector(final DuplicateDataConfig duplicateDataConfig) { +this(duplicateDataConfig.getBooleanTolerance(), +duplicateDataConfig.getByteTolerance(), +duplicateDataConfig.getDateTolerance(), +duplicateDataConfig.getDoubleTolerance(), +duplicateDataConfig.getFloatTolerance(), +duplicateDataConfig.getIntegerTolerance(), +duplicateDataConfig.getLongTolerance(), +duplicateDataConfig.getShortTolerance(), +duplicateDataConfig.getStringTolerance(), +duplicateDataConfig.getUriTolerance(), +duplicateDataConfig.getEquivalentTermsMap(), +duplicateDataConfig.isDetectionEnabled() +); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the values + * from the config. + * @throws ConfigurationException + */ +public DuplicateDataDetector() throws ConfigurationException { +this(new DuplicateDataConfig()); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector}. + * @param tolera
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133804678 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java --- @@ -0,0 +1,1066 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.rya.indexing.smarturi.duplication; + +import static java.util.Objects.requireNonNull; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; + +import org.apache.commons.configuration.ConfigurationException; +import org.apache.commons.lang.StringUtils; +import org.apache.rya.api.domain.RyaType; +import org.apache.rya.api.domain.RyaURI; +import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver; +import org.apache.rya.indexing.entity.model.Entity; +import org.apache.rya.indexing.entity.model.Property; +import org.apache.rya.indexing.smarturi.SmartUriAdapter; +import org.apache.rya.indexing.smarturi.SmartUriException; +import org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig; +import org.calrissian.mango.types.exception.TypeEncodingException; +import org.joda.time.DateTime; +import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; +import org.openrdf.model.vocabulary.XMLSchema; + +import com.google.common.collect.ImmutableMap; + +/** + * Detects if two entities contain data that's nearly identical based on a set + * tolerance for each field's type. Two entities are considered nearly + * identical if all their properties are equal and/or within the specified + * tolerance for the property's object type. Setting all object type tolerances + * to 0 means that the objects need to be exactly equal to each other to be + * considered duplicates. Duplicate data detection can be enabled/disabled + * through configuration and each object type can have a tolerance based on + * either the difference or the percentage difference between the objects being + * compared. + */ +public class DuplicateDataDetector { +private final Map> uriMap = new HashMap<>(); +private final Map, ApproxEqualsDetector> classMap = new HashMap<>(); + +private boolean isDetectionEnabled; + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the + * values provided by the configuration file. + * @param duplicateDataConfig the {@link DuplicateDataConfig} + */ +public DuplicateDataDetector(final DuplicateDataConfig duplicateDataConfig) { +this(duplicateDataConfig.getBooleanTolerance(), +duplicateDataConfig.getByteTolerance(), +duplicateDataConfig.getDateTolerance(), +duplicateDataConfig.getDoubleTolerance(), +duplicateDataConfig.getFloatTolerance(), +duplicateDataConfig.getIntegerTolerance(), +duplicateDataConfig.getLongTolerance(), +duplicateDataConfig.getShortTolerance(), +duplicateDataConfig.getStringTolerance(), +duplicateDataConfig.getUriTolerance(), +duplicateDataConfig.getEquivalentTermsMap(), +duplicateDataConfig.isDetectionEnabled() +); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the values + * from the config. + * @throws ConfigurationException + */ +public DuplicateDataDetector() throws ConfigurationException { +this(new DuplicateDataConfig()); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector}. + * @param tolera
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133803255 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { --- End diff -- Oops, it's only grabbing one Entity to compare. I reworked so it now finds a set of potential Entities to compare based on them having all the same explicit type IDs. The subjects don't matter and querying for properties doesn't help us since we're trying to find properties that are CLOSE but not quite equal. That leaves us with only the Types to narrow our initial search of Entities to check. Once we grab the Entities the (near) duplicate data detector is run over them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133773816 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { +final TypedEntity typedEntity = cursor.next(); --- End diff -- If I recall correctly, a TypedEntity is one part of an Entity and an Entity is comprised of possibly several TypedEntities. (i.e an Entity could be made up of a person TypedEntity and an employee TypedEntity). So, TypedEntity shouldn't extend Entity. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133742379 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java --- @@ -0,0 +1,1066 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.rya.indexing.smarturi.duplication; + +import static java.util.Objects.requireNonNull; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; + +import org.apache.commons.configuration.ConfigurationException; +import org.apache.commons.lang.StringUtils; +import org.apache.rya.api.domain.RyaType; +import org.apache.rya.api.domain.RyaURI; +import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver; +import org.apache.rya.indexing.entity.model.Entity; +import org.apache.rya.indexing.entity.model.Property; +import org.apache.rya.indexing.smarturi.SmartUriAdapter; +import org.apache.rya.indexing.smarturi.SmartUriException; +import org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig; +import org.calrissian.mango.types.exception.TypeEncodingException; +import org.joda.time.DateTime; +import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; +import org.openrdf.model.vocabulary.XMLSchema; + +import com.google.common.collect.ImmutableMap; + +/** + * Detects if two entities contain data that's nearly identical based on a set + * tolerance for each field's type. Two entities are considered nearly + * identical if all their properties are equal and/or within the specified + * tolerance for the property's object type. Setting all object type tolerances + * to 0 means that the objects need to be exactly equal to each other to be + * considered duplicates. Duplicate data detection can be enabled/disabled + * through configuration and each object type can have a tolerance based on + * either the difference or the percentage difference between the objects being + * compared. + */ +public class DuplicateDataDetector { +private final Map> uriMap = new HashMap<>(); +private final Map, ApproxEqualsDetector> classMap = new HashMap<>(); + +private boolean isDetectionEnabled; + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the + * values provided by the configuration file. + * @param duplicateDataConfig the {@link DuplicateDataConfig} + */ +public DuplicateDataDetector(final DuplicateDataConfig duplicateDataConfig) { +this(duplicateDataConfig.getBooleanTolerance(), +duplicateDataConfig.getByteTolerance(), +duplicateDataConfig.getDateTolerance(), +duplicateDataConfig.getDoubleTolerance(), +duplicateDataConfig.getFloatTolerance(), +duplicateDataConfig.getIntegerTolerance(), +duplicateDataConfig.getLongTolerance(), +duplicateDataConfig.getShortTolerance(), +duplicateDataConfig.getStringTolerance(), +duplicateDataConfig.getUriTolerance(), +duplicateDataConfig.getEquivalentTermsMap(), +duplicateDataConfig.isDetectionEnabled() +); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector} with the values + * from the config. + * @throws ConfigurationException + */ +public DuplicateDataDetector() throws ConfigurationException { +this(new DuplicateDataConfig()); +} + +/** + * Creates a new instance of {@link DuplicateDataDetector}. + * @param tolera
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133739286 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { +final TypedEntity typedEntity = cursor.next(); --- End diff -- Why doesn't TypedEntity extend Entity? Also, I noticed that your compareEntities method for the DuplicateDataDetector applies to Entities but then immediately checks for Type. Obviously you want to return false if two Entities don't have the same Type, but maybe it would be useful to have a compareTypedEntities method? Then your compareEntities method could effectively delegate to that (check for Type and if the Types are the same, convert the Entities to TypedEntities and call the compareTypedEntities method). It seems like a compareTypedEntities method would align with the use case better -- there would be no need to convert all of the TypedEntities returned in this loop to an Entity. You could convert the given Entity to a TypedEntity if it has a Type, and do a direct comparison to each TypedEntity in this loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user meiercaleb commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r133737467 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +281,46 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +final ConvertingCursor cursor = search(Optional.empty(), type.get(), Collections.emptySet()); +while (cursor.hasNext()) { --- End diff -- I'm not quite following what you are doing with the Entity Builder here. It seems like you are using it primarily to convert each TypedEntity returned in this loop to an Entity. If that is the case, you should be creating a new Builder for each TypedEntity and then doing your duplicate comparison within this loop. As it is currently written, it seems like you are just overwriting properties as you iterate through the TypedEntities. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132695882 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java --- @@ -0,0 +1,1059 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.rya.indexing.smarturi.duplication; + +import static java.util.Objects.requireNonNull; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; + +import org.apache.commons.configuration.ConfigurationException; +import org.apache.commons.lang.StringUtils; +import org.apache.rya.api.domain.RyaType; +import org.apache.rya.api.domain.RyaURI; +import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver; +import org.apache.rya.indexing.entity.model.Entity; +import org.apache.rya.indexing.entity.model.Property; +import org.apache.rya.indexing.smarturi.SmartUriAdapter; +import org.apache.rya.indexing.smarturi.SmartUriException; +import org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig; +import org.calrissian.mango.types.exception.TypeEncodingException; +import org.joda.time.DateTime; +import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; +import org.openrdf.model.vocabulary.XMLSchema; + +import com.google.common.collect.ImmutableMap; + +/** + * Detects if two entities contain data that's nearly identical based on a set --- End diff -- I updated the javadocs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user ejwhite922 commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132695871 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +282,49 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +final ImmutableMap typePropertyMap = entity.getProperties().get(typeRyaUri); +final Set properties = new HashSet<>(typePropertyMap.values()); +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +//final ConvertingCursor cursor = search(Optional.empty(), type.get(), properties); --- End diff -- Removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user isper3at commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132585736 --- Diff: common/rya.api/src/main/java/org/apache/rya/api/domain/RyaTypeUtils.java --- @@ -24,12 +24,44 @@ import org.joda.time.DateTimeZone; import org.joda.time.format.ISODateTimeFormat; import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; import org.openrdf.model.vocabulary.XMLSchema; +import com.google.common.collect.ImmutableMap; + /** * Utility methods for using {@link RyaType}. */ public final class RyaTypeUtils { +private static final ImmutableMap, RyaTypeMethod> METHOD_MAP = +ImmutableMap., RyaTypeMethod>builder() +.put(Boolean.class, (v) -> booleanRyaType((Boolean) v)) +.put(Byte.class, (v) -> byteRyaType((Byte) v)) +.put(Date.class, (v) -> dateRyaType((Date) v)) +.put(DateTime.class, (v) -> dateRyaType((DateTime) v)) +.put(Double.class, (v) -> doubleRyaType((Double) v)) +.put(Float.class, (v) -> floatRyaType((Float) v)) +.put(Integer.class, (v) -> intRyaType((Integer) v)) +.put(Long.class, (v) -> longRyaType((Long) v)) +.put(Short.class, (v) -> shortRyaType((Short) v)) +.put(String.class, (v) -> stringRyaType((String) v)) +.put(URI.class, (v) -> uriRyaType((URI) v)) +.put(URIImpl.class, (v) -> uriRyaType((URIImpl) v)) +.build(); + +/** + * Represents a method inside the {@link RyaTypeUtils} class that can be + * call. + */ +private static interface RyaTypeMethod { --- End diff -- ignore this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user isper3at commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132576212 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/smarturi/duplication/DuplicateDataDetector.java --- @@ -0,0 +1,1059 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.rya.indexing.smarturi.duplication; + +import static java.util.Objects.requireNonNull; + +import java.math.BigDecimal; +import java.util.ArrayList; +import java.util.Date; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Objects; +import java.util.Optional; +import java.util.Set; +import java.util.TreeSet; + +import org.apache.commons.configuration.ConfigurationException; +import org.apache.commons.lang.StringUtils; +import org.apache.rya.api.domain.RyaType; +import org.apache.rya.api.domain.RyaURI; +import org.apache.rya.api.resolver.impl.DateTimeRyaTypeResolver; +import org.apache.rya.indexing.entity.model.Entity; +import org.apache.rya.indexing.entity.model.Property; +import org.apache.rya.indexing.smarturi.SmartUriAdapter; +import org.apache.rya.indexing.smarturi.SmartUriException; +import org.apache.rya.indexing.smarturi.duplication.conf.DuplicateDataConfig; +import org.calrissian.mango.types.exception.TypeEncodingException; +import org.joda.time.DateTime; +import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; +import org.openrdf.model.vocabulary.XMLSchema; + +import com.google.common.collect.ImmutableMap; + +/** + * Detects if two entities contain data that's nearly identical based on a set --- End diff -- nearly identical? can you define that a bit more in the class docs? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user isper3at commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132571187 --- Diff: extras/indexing/src/main/java/org/apache/rya/indexing/entity/storage/mongo/MongoEntityStorage.java --- @@ -242,4 +282,49 @@ private static Bson makeExplicitTypeFilter(final RyaURI typeId) { return Stream.of(dataTypeFilter, valueFilter); } + +private boolean detectDuplicates(final Entity entity) throws EntityStorageException { +boolean hasDuplicate = false; +if (duplicateDataDetector.isDetectionEnabled()) { +if (mongoTypeStorage == null) { +mongoTypeStorage = new MongoTypeStorage(mongo, ryaInstanceName); +} +final Builder builder = new Builder(); +builder.setSubject(entity.getSubject()); +boolean abort = false; +for (final RyaURI typeRyaUri : entity.getExplicitTypeIds()) { +final ImmutableMap typePropertyMap = entity.getProperties().get(typeRyaUri); +final Set properties = new HashSet<>(typePropertyMap.values()); +Optional type; +try { +type = mongoTypeStorage.get(typeRyaUri); +} catch (final TypeStorageException e) { +throw new EntityStorageException("Unable to get entity type: " + typeRyaUri, e); +} +if (type.isPresent()) { +//final ConvertingCursor cursor = search(Optional.empty(), type.get(), properties); --- End diff -- commented code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
Github user isper3at commented on a diff in the pull request: https://github.com/apache/incubator-rya/pull/153#discussion_r132557770 --- Diff: common/rya.api/src/main/java/org/apache/rya/api/domain/RyaTypeUtils.java --- @@ -24,12 +24,44 @@ import org.joda.time.DateTimeZone; import org.joda.time.format.ISODateTimeFormat; import org.openrdf.model.URI; +import org.openrdf.model.impl.URIImpl; import org.openrdf.model.vocabulary.XMLSchema; +import com.google.common.collect.ImmutableMap; + /** * Utility methods for using {@link RyaType}. */ public final class RyaTypeUtils { +private static final ImmutableMap, RyaTypeMethod> METHOD_MAP = +ImmutableMap., RyaTypeMethod>builder() +.put(Boolean.class, (v) -> booleanRyaType((Boolean) v)) +.put(Byte.class, (v) -> byteRyaType((Byte) v)) +.put(Date.class, (v) -> dateRyaType((Date) v)) +.put(DateTime.class, (v) -> dateRyaType((DateTime) v)) +.put(Double.class, (v) -> doubleRyaType((Double) v)) +.put(Float.class, (v) -> floatRyaType((Float) v)) +.put(Integer.class, (v) -> intRyaType((Integer) v)) +.put(Long.class, (v) -> longRyaType((Long) v)) +.put(Short.class, (v) -> shortRyaType((Short) v)) +.put(String.class, (v) -> stringRyaType((String) v)) +.put(URI.class, (v) -> uriRyaType((URI) v)) +.put(URIImpl.class, (v) -> uriRyaType((URIImpl) v)) +.build(); + +/** + * Represents a method inside the {@link RyaTypeUtils} class that can be + * call. + */ +private static interface RyaTypeMethod { --- End diff -- confused.is this for reflection? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-rya pull request #153: RYA-250 Smart URI avoiding data duplication
GitHub user ejwhite922 opened a pull request: https://github.com/apache/incubator-rya/pull/153 RYA-250 Smart URI avoiding data duplication ## Description Added data duplication detection methods to Smart URI/Entities. These use configured tolerances for each data type to decide if an Entity is considered nearly equal. Also, string terms that are considered equivalent can be configured. **! NOTE !** Only review the latest commit. The other commit is from another PR. ### Tests Unit tests ### Links [Jira](https://issues.apache.org/jira/browse/RYA-250) ### Checklist - [ ] Code Review - [ ] Squash Commits People To Review @kchilton2 @isper3at @meiercaleb @pujav65 @amihalik @DLotts You can merge this pull request into a Git repository by running: $ git pull https://github.com/ejwhite922/incubator-rya RYA-250_SmartURIAvoidingDataDuplication Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-rya/pull/153.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #153 commit 7f84778a90d5b33d5332e21e43a838b7c6f54f6d Author: eric.white Date: 2017-02-27T16:12:05Z RYA-250 Smart URI commit 7e02475a024267afa9f2dca74d384819618c9c61 Author: eric.white Date: 2017-04-12T15:15:03Z RYA-250 Added data duplication detection methods to Smart URI/Entities. These use configured tolerances for each data type to decide if an Entity is considered nearly equal. Also, string terms that are considered equivalent can be configured. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---