Standard SQL CookBook - 3/N

RCTにおけるunit of observationのsampling - Farm Fingerprint

公開日: 2022-07-13
更新日: 2022-07-20

  概要
目的 RCTにおけるunit of observationのsampling - Farm Fingerprint
分類 SQL Cookbook
SQL Standard SQL
環境 BigQuery

Table of Contents

問題設定:

以下のような顧客マスタが与えられたとします.

  • customer_id:顧客ID, ユニークに割り振られている, 100万人程度登録されているとします, INT64
  • created_timestamp_jst: 顧客ID作成時点のtimestamp
  • birth_month: 顧客が入力した誕生月, NaNはないものとする, STRING
1
2
3
4
5
6
7
customer_id	created_timestamp	birth_month
3207193037871	2022-08-18 03:31:02	00
3110943233023	2022-07-08 23:55:30	01
3114655870191	2022-07-09 03:29:23	01
3096080323583	2022-07-08 11:03:10	01
3096393819039	2022-07-08 11:19:01	01
...

この顧客マスターから、顧客ID単位で5つのグループ(A, B, C, D, E)にランダムに分けたいとします. なおグループ分けにあたって, A, B, C, Dのサンプルサイズは 24,000 とし、さらにこのグループ内部ではbirth_month毎に2000人ずつ確保されている状態を目指します. A, B, C, Dに振り分けられなかった顧客IDはすべてグループEに振り分けられるものとします.

この記事ではGoogle Farm Fingerprintの値の順序を用いたランダムサンプリングを紹介します.

Farm Fingerprint

オープンソースの FarmHash ライブラリの Fingerprint64 関数を使用して、 STRING または BYTES 入力のフィンガープリントを計算します. 特定の入力に対するこの関数の出力は、決して変わらないという特徴があります (= ランダムサンプリング時には、何かしらのSaltを用いる必要がある).

FARM_FINGERPRINT function Syntax

1
FARM_FINGERPRINT(value) 

Return Value type

1
INT64

Example

1
2
3
4
5
6
7
8
9
10
11
12
13
WITH example AS (
  SELECT 1 AS x, "foo" AS y, true AS z UNION ALL
  SELECT 2 AS x, "apple" AS y, false AS z UNION ALL
  SELECT 3 AS x, "" AS y, true AS z UNION ALL
  SELECT 4 AS x, "" AS y, false AS z
)
SELECT
  *,
  FARM_FINGERPRINT(CAST(x AS STRING)) AS x_fingerprint,
  FARM_FINGERPRINT(y) AS y_fingerprint,
  FARM_FINGERPRINT(CAST(z AS STRING)) AS z_fingerprint,
  FARM_FINGERPRINT(CONCAT(CAST(x AS STRING), y, CAST(z AS STRING))) AS concat_fingerprint
FROM example;

Then,

1
2
3
4
5
Row x	y	z	x_fingerprint	y_fingerprint	z_fingerprint	row_fingerprint
1	foo	true	-9142586270102516767	6150913649986995171	8572766038233537570	-1541654101129638711
2	apple	false	6920640749119438759	6447335267136888601	3372626016653902757	2794438866806483259
3		true	-6455268178307048695	-7286425919675154353	8572766038233537570	-4880158226897771312
4		false	-3919889686402291073	-7286425919675154353	3372626016653902757	-5671577889189715922

(1) Farm Fingerprintはsaltさえランダムに選べば離散一様分布になるのか?

検証内容

  • Google Farmfingerprintの値の順序を用いてsortした場合、そのsortのindex順序は離散一様分布しているのか?を検証します
  • Google Farmfingerprintの値自体が離散一様分布しているかは検証していません

検証方針

  1. 任意の長さのunique valueからなるリストを作成する(計算資源の兼ね合いで今回は長さ6としている)
  2. (1)で作成したリストに対して、ランダムに作成したsaltと合わせてFarm Fingerprint valueを計算する
  3. Farm Fingerprint valueの順序に基づいて、リストをシャッフルする
  4. リストのシャッフル結果が、離散一様分布しているかchi square testにかける

Remarks

  • 重複ありの場合のPermutation Indexの取得方法としてstack overflow > Finding the index of a given permutationが存在する
  • こちらベースで一部修正して計算も試みたが、メモリエラーが発生したので今回は諦めます(実行時間もそんなに変わらなかった)

Python code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#  Imports
## basics
import numpy as np
import matplotlib.pyplot as plt
import math
import random
import string
import seaborn as sns

## statistical analysis
from scipy import stats

## hash
import farmhash

## for cyclic list
from typing import List

# Class definition
class RandomShuffleTest:
    def __init__(self, ar, num_trial:int, salt_length:int=12, kind='farm_fingerprint'):
        # initialize
        self.ar          = ar
        self.num_trial   = num_trial
        self.salt_length = salt_length
        self.kind        = kind

        self.sorted_key = None
        self.result_salt = None
        self.freq = None

    def get_random_salt(self):
        # choose from all lowercase letter
        letters = string.ascii_lowercase
        self.result_salt = ''.join(random.choice(letters) for i in range(self.salt_length))

    def fit(self, output=False):
        n_perm = math.factorial(self.ar.length) if self.kind == 'naive_slide' else math.factorial(len(self.ar))
        trials = self.num_trial * n_perm
        freq = [0] * n_perm

        for _ in range(trials):
            if self.kind == 'farm_fingerprint':
              self.get_random_salt()
              self.farm_fingerprint_shuffle(ar=self.ar, salt=self.result_salt)
            
            elif self.kind == 'naive_slide':
              offset = _ % (self.ar.length - 1)
              self.sorted_key = self.ar.extract(offset=offset, width=self.ar.length)
            
            else:
              self.sorted_key = np.argsort(np.random.rand(len(self.ar)))
            
            freq[self.get_permutation_index(self.sorted_key)] += 1

        self.freq = freq


    def farm_fingerprint_shuffle(self, ar, salt):
        self.sorted_key = list(map(lambda x:abs(self.google_farm_fingerprint(x=x,salt=salt)), ar))

    @staticmethod
    def google_farm_fingerprint(x: str, salt:str):
        ### BigQuery version returns int64, while Python version returns unsigned int
        ### value must be between -9223372036854775808 and 9223372036854775807
        return np.uint64(farmhash.fingerprint64(x+salt)).astype('int64')

    @staticmethod
    def get_permutation_index(permutation):
        n = len(permutation)
        t = 0
        for i in range(n):
          t = t * (n - i)
          for j in range(i + 1, n):
            if permutation[i] > permutation[j]:
                t += 1
        return t

class CyclicList(object):

    def __init__(self, *, item: List[int]):
        self.item = item
        self.length: int = len(item)

    def extract(self, *, offset: int, width: int) -> List[int]:
        top: int = offset % self.length
        result: List[int] = self.item[top:top + width]
        length: int = len(result)

        while length < width:
            plus: List[int] = self.item[:width - length]
            result += plus
            length += len(plus)

        return result

# 検証
random.seed(100)
N= 2000
ar = list(range(6))
#cyclic_ar = CyclicList(item=ar)
#Knuth_test = RandomShuffleTest(ar=cyclic_ar,num_trial=N,kind='naive_slide')
Knuth_test = RandomShuffleTest(ar=list(map(str,ar)),num_trial=N)
#Knuth_test = RandomShuffleTest(ar=list(map(str,ar)),num_trial=N, kind='random')

Knuth_test.fit()
freq = Knuth_test.freq

ax = sns.displot(freq, kind="kde")
print(stats.chisquare(freq, f_exp=[N]*len(freq)))

Then,

1
Power_divergenceResult(statistic=697.7860000000001, pvalue=0.7079480282766375)

  • Permutation index結果をplotしてもトライアル回数(N=2000)が最頻値となっており、特定の順序が出やすいというわけではなさそう
  • Chi square testでも帰無仮説は棄却されていない

(2) Farm Fingerprintはsaltさえランダムに選べば離散一様分布になるのか?:簡易版

以下ではIDのpercentileとIDに対してfarm fingerprintを当てた値のpercentileの組を取得して、chi square testを実施する方法となります.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
%%bigquery df --project <your project-id>
create temp function max_const() AS(
      100000000000
    );

create temp function bucket_const() AS(
      100000
    );

WITH
    test_data AS(
        SELECT 
            x AS key_id, 
            FARM_FINGERPRINT(CAST(x AS STRING)) hash_value
        FROM 
            UNNEST((SELECT GENERATE_ARRAY(1,max_const(),bucket_const()) xs)) AS x
    )
SELECT DISTINCT
    CAST(CEIL(key_id/(max_const()/100)) AS INT64) AS nth_tile_key,
    CAST(CEIL(hash_value/(9223372036854775807/50))+50 AS INT64) AS nth_tile_hash,
    COUNT(DISTINCT key_id) AS frequency --- the extepected value should be 100
FROM
    test_data
GROUP BY 
    1, 2
ORDER BY
    1, 2

Then,

1
2
3
4
freq = df['frequency'].values
print(stats.chisquare(freq, f_exp=[100]*len(freq)))

>>> Power_divergenceResult(statistic=9808.0, pvalue=0.9122301443605709)

Standard SQLで実装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
CREATE TEMP FUNCTION rand_range(min_value INT64, max_value INT64) AS (
  CAST(ROUND(min_value + rand() * (max_value - min_value)) AS INT64)
);

CREATE TEMP FUNCTION rand_date(min_date DATE,max_date DATE) AS ( 
  TIMESTAMP_SECONDS(
    rand_range(UNIX_SECONDS(CAST(min_date AS TIMESTAMP)), UNIX_SECONDS(CAST(max_date AS TIMESTAMP)))
  ) 
);

CREATE TEMP FUNCTION assign_group(lottery_value INT64) AS ( 
  CASE
      WHEN lottery_value BETWEEN 0001 AND 2000 THEN 'A'
      WHEN lottery_value BETWEEN 2001 AND 4000 THEN 'B'
      WHEN lottery_value BETWEEN 4001 AND 6000 THEN 'C'
      WHEN lottery_value BETWEEN 6001 AND 8000 THEN 'D'
      ELSE 'E'
  END 
);


WITH
    test_data AS(
        SELECT 
            CAST(x AS STRING) AS user_id, 
            rand_date("2020-01-01", "2020-12-31") AS created_timestamp,
            RIGHT(CONCAT('0', CAST(fhoffa.x.random_int(1,13) AS STRING)), 2) AS birth_month
        FROM 
            UNNEST((SELECT GENERATE_ARRAY(1000001,2000000,1) xs)) AS x
    ),
    lottery AS(
        SELECT
            t1.user_id,
            t1.created_timestamp,
            t1.birth_month,
            FARM_FINGERPRINT(user_id) AS fingerprint_value,
            ROW_NUMBER() OVER (PARTITION BY birth_month ORDER BY FARM_FINGERPRINT(user_id)) AS lottery_number
        FROM
            test_data AS t1
    )
SELECT
    t1.user_id,
    t1.created_timestamp,
    t1.birth_month,
    assign_group(t1.lottery_number) AS test_group
FROM
    lottery AS t1

これを上手く分けられているか確認したい場合は

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
CREATE TEMP FUNCTION rand_range(min_value INT64, max_value INT64) AS (
  CAST(ROUND(min_value + rand() * (max_value - min_value)) AS INT64)
);

CREATE TEMP FUNCTION rand_date(min_date DATE,max_date DATE) AS ( 
  TIMESTAMP_SECONDS(
    rand_range(UNIX_SECONDS(CAST(min_date AS TIMESTAMP)), UNIX_SECONDS(CAST(max_date AS TIMESTAMP)))
  ) 
);

CREATE TEMP FUNCTION assign_group(lottery_value INT64) AS ( 
  CASE
      WHEN lottery_value BETWEEN 0001 AND 2000 THEN 'A'
      WHEN lottery_value BETWEEN 2001 AND 4000 THEN 'B'
      WHEN lottery_value BETWEEN 4001 AND 6000 THEN 'C'
      WHEN lottery_value BETWEEN 6001 AND 8000 THEN 'D'
      ELSE 'E'
  END 
);


WITH
    test_data AS(
        SELECT 
            CAST(x AS STRING) AS user_id, 
            rand_date("2020-01-01", "2020-12-31") AS created_timestamp,
            RIGHT(CONCAT('0', CAST(fhoffa.x.random_int(1,13) AS STRING)), 2) AS birth_month
        FROM 
            UNNEST((SELECT GENERATE_ARRAY(1000001,2000000,1) xs)) AS x
    ),
    lottery AS(
        SELECT
            t1.user_id,
            t1.created_timestamp,
            t1.birth_month,
            FARM_FINGERPRINT(user_id) AS fingerprint_value,
            ROW_NUMBER() OVER (PARTITION BY birth_month ORDER BY FARM_FINGERPRINT(user_id)) AS lottery_number
        FROM
            test_data AS t1
    ),
    group_table AS(
        SELECT
            t1.user_id,
            t1.created_timestamp,
            t1.birth_month,
            assign_group(t1.lottery_number) AS test_group
        FROM
            lottery AS t1
    )
SELECT
    test_group,
    birth_month,
    COUNT(DISTINCT user_id) AS user_cnt
FROM
    group_table
GROUP BY
    1, 2
ORDER BY
    1, 2

Then,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
test_group	birth_month	user_cnt
A	01	2000
A	02	2000
A	03	2000
A	04	2000
A	05	2000
A	06	2000
A	07	2000
A	08	2000
A	09	2000
A	10	2000
A	11	2000
A	12	2000
B	01	2000
B	02	2000
B	03	2000
B	04	2000
B	05	2000
B	06	2000
B	07	2000
B	08	2000
B	09	2000
B	10	2000
B	11	2000
B	12	2000
C	01	2000
C	02	2000
C	03	2000
C	04	2000
C	05	2000
C	06	2000
C	07	2000
C	08	2000
C	09	2000
C	10	2000
C	11	2000
C	12	2000
D	01	2000
D	02	2000
D	03	2000
D	04	2000
D	05	2000
D	06	2000
D	07	2000
D	08	2000
D	09	2000
D	10	2000
D	11	2000
D	12	2000
E	01	75459
E	02	74957
E	03	75219
E	04	75066
E	05	75470
E	06	75300
E	07	74803
E	08	75121
E	09	76075
E	10	75410
E	11	75362
E	12	75758

参考資料

オンラインマテリアル



Share Buttons
Share on:

Feature Tags
Leave a Comment
(注意:GitHub Accountが必要となります)