Use case is simple, count unique user in for in a window slide, and I found the
common solutions over the Internet is to use HashSet to fliter the duplicated
user,like this
public class Distinct extends BaseFilter {
private static final long serialVersionUID = 1L;
private Set<String> distincter = Collections.synchronizedSet(new
HashSet<String>());
@Override
public boolean isKeep(TridentTuple tuple) {
String id = this.getId(tuple);
return distincter.add(id);
}
public String getId(TridentTuple t) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < t.size(); i++) {
sb.append(t.getString(i));
}
return sb.toString();
}
}
However, the HashSet is stored in memory, when the data grows to a very large
level, I think it will cause a OOM.
So is there a scalable solution?
2014-07-14
唐思成