distinctapprox
distinctapprox, aggregator
def process distinctapprox(value: text ): number
def process distinctapprox(value: number): number
The aggregator approximates the number of distinct elements in a group. Two elements a
and b
are considered identical iff a == b
holds true. It is intended a faster alternative to distinct for especially large datasets – typically for those beyond one million elements.
If the group is empty, this aggregator returns 0
.
Examples
table T = with
[| as A, as B |]
[| 1, "a" |]
[| 1, "a" |]
[| 2, "b" |]
[| 3, "b" |]
[| 4, "c" |]
table G[gdim] = by T.B
where T.B != "c"
show table "" a1b4 with
gdim
distinctapprox(T.A)
group by gdim
The above code results in the following table:
gdim | distinctapprox(T.A) |
---|---|
a | 1 |
b | 2 |
c | 0 |
Since "c"
is filtered out by where T.B != "c"
, distinctapprox(T.A)
returns 0
for this group.
Remarks
The aggregator distinctapprox
supports the text
and number
data types. The reasoning behind is that for other data types, the maximum number of distinct elements is small enough that distinct
can achieve satisfactory performance without approximation.
Advanced remark: The approximation employed by distinctapprox
is based on the HyperLogLog algorithm.