daniel added a comment. To reduce memory consumption of the approach I suggested above, use part (the first few digits) of the hash as the key in the "seen" array, keep the full hash as the value. For instance, using 4 digits would limit the size of the "seen" list to 2^16 entires.
When looking up x: - `!isset( $seen[ key($x) ] )` -> not seen - `isset( $seen[ key($x) ] ) && $seen[ key($x) ] === x` -> seen - `isset( $seen[ key($x) ] ) && $seen[ key($x) ] !== x` -> probably not seen Finally, set `$seen[ key($x) ] = x` Hat Tip to http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/ and https://news.ycombinator.com/item?id=4251313 TASK DETAIL https://phabricator.wikimedia.org/T92586 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Smalyshev, daniel Cc: daniel, Manybubbles, Aklapper, Smalyshev, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
