distinctapprox

distinctapprox, aggregator

def process distinctapprox(value: text  ): number
def process distinctapprox(value: number): number

The aggregator approximates the number of distinct elements in a group. Two elements a and b are considered identical iff a == b holds true. It is intended a faster alternative to distinct for especially large datasets – typically for those beyond one million elements.

If the group is empty, this aggregator returns 0.

Examples

table T = with
  [| as A, as B |]
  [| 1,   "a"   |]
  [| 1,   "a"   |]
  [| 2,   "b"   |]
  [| 3,   "b"   |]
  [| 4,   "c"   |]

table G[gdim] = by T.B

where T.B != "c"
  show table "" a1b4 with
    gdim
    distinctapprox(T.A)
    group by gdim

The above code results in the following table:

gdim distinctapprox(T.A)
a 1
b 2
c 0

Since "c" is filtered out by where T.B != "c", distinctapprox(T.A) returns 0 for this group.

Remarks

The aggregator distinctapprox supports the text and number data types. The reasoning behind is that for other data types, the maximum number of distinct elements is small enough that distinct can achieve satisfactory performance without approximation.

Advanced remark: The approximation employed by distinctapprox is based on the HyperLogLog algorithm.

See also

User Contributed Notes
0 notes + add a note