Boxcox transformation on workflow

Concept

λͺ‡κ°€μ§€ κ°œλ…μ •λ¦¬ 및 μ½”λ“œλ₯Ό 보고 μ§„ν–‰ν•©λ‹ˆλ‹€.

Outlier removal workflow

μ°¨λΆ„(Difference):

sort ν›„ 각 ν•­λͺ©μ˜ 차이 κ°’

μ°¨λΆ„ κ°’μ˜ 배열을 λ§Œλ“€ 수 있음.

μ‚¬λΆ„μœ„μˆ˜(Quartile):

데이터λ₯Ό κ°€μž₯ μž‘μ€ μˆ˜λΆ€ν„° κ°€μž₯ 큰 μˆ˜κΉŒμ§€ 크기가 μ»€μ§€λŠ” μˆœμ„œλŒ€λ‘œ μ •λ ¬ν•˜μ˜€μ„ λ•Œ,
1/4, 2/4, 3/4 μœ„μΉ˜μ— μžˆλŠ” 수λ₯Ό λ§ν•œλ‹€. 

각각 1μ‚¬λΆ„μœ„μˆ˜, 2μ‚¬λΆ„μœ„μˆ˜, 3μ‚¬λΆ„μœ„μˆ˜λΌκ³  ν•œλ‹€. 

1/4의 μœ„μΉ˜λž€ 전체 λ°μ΄ν„°μ˜ μˆ˜κ°€ λ§Œμ•½ 100개이면 25번째 μˆœμ„œ, 즉 ν•˜μœ„ 25%λ₯Ό λ§ν•œλ‹€. 
λ”°λΌμ„œ 2μ‚¬λΆ„μœ„μˆ˜λŠ” 쀑앙값과 κ°™λ‹€.

λ•Œλ‘œλŠ” μœ„μΉ˜λ₯Ό 1/100 λ‹¨μœ„λ‘œ λ‚˜λˆˆ λ°±λΆ„μœ„μˆ˜(percentile)을 μ‚¬μš©ν•˜κΈ°λ„ ν•œλ‹€. 1μ‚¬λΆ„μœ„μˆ˜λŠ” 25% λ°±λΆ„μœ„μˆ˜μ™€ κ°™λ‹€.

np.percentile(x, 0)  # μ΅œμ†Œκ°’
np.percentile(x, 25)  # 1μ‚¬λΆ„μœ„ 수
np.percentile(x, 50)  # 2μ‚¬λΆ„μœ„ 수
np.percentile(x, 75)  # 3μ‚¬λΆ„μœ„ 수
np.percentile(x, 100)  # μ΅œλŒ“κ°’

μ•„λž˜λŠ” μœ„μ˜ μ„€λͺ…ν•œ κ°œλ…μ„ μ΄μš©ν•œ 예제 ν”„λ‘œκ·Έλž¨μž…λ‹ˆλ‹€.

Oulier removal using diffrence series

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
  
# Set lower and upper bound of a sequence
# Percentile -> Difference -> x sigma
 
import numpy as np
import pandas ad pd

"""
# numpy.argmin
# min(), max() : μ΅œμ†Œ, μ΅œλŒ€κ°’μ„ κ΅¬ν•˜κΈ°. 
# argmin(), argmax() : μ΅œμ†Œ, μ΅œλŒ€κ°’μ΄ μ‘΄μž¬ν•˜λŠ” μœ„μΉ˜(인덱슀)λ₯Ό κ΅¬ν•˜κΈ°.

# numpy.arange([start, ] stop, [step, ] dtype=None)
numpy λͺ¨λ“ˆμ˜ arange ν•¨μˆ˜λŠ” λ°˜μ—΄λ¦°κ΅¬κ°„ [start, stop) μ—μ„œ 
step 의 크기만큼 μΌμ •ν•˜κ²Œ λ–¨μ–΄μ Έ μžˆλŠ” μˆ«μžλ“€μ„ array ν˜•νƒœλ‘œ λ°˜ν™˜ν•΄ μ£ΌλŠ” ν•¨μˆ˜λ‹€.

# numpy.mean
평균 κ΅¬ν•˜κΈ°

# numpy.diff(arr, [n])
n order 차뢄을 ꡬ함

"""

"""
@self :
@series : μ–‘μ˜ μ‹€μˆ˜ κ°’μœΌλ‘œ 이루어진 1차원 λ°°μ—΄
@scale : 
@cut_pt : 
@cut_ratio : μ •κ·œ λΆ„ν¬λ‘œ λ³€ν™˜λœ λ°μ΄ν„°μ—μ„œ 선택할 ν‘œμ€€ 편차(μ‹œκ·Έλ§ˆ)κ°’

"""

class Self:

    def __init__(self):
        self.series = np.array([]);
        
    def setSeries(arr)
        self.series = np.array(arr);

    def calc_bound_from_dist(self, series = self.series, scale = 1000, cut_pt = 10, cut_ratio = 2):
        
        unit = 100 / scale
        center = int(scale / 2)
        cut_point = int(cut_pt / unit) if cut_pt / unit > 1 else 1 # 100
        
        # np.percentile(1000, np.arange(0, 100.1, 0.1))
        # np.percentile(1000, [   0.     0.1    0.2 ...,   99.8   99.9  100. ])
        percentiles = np.percentile(series, np.arange(0, 100 + unit, unit))
        
        # nonstationary process -> stationary process
        
        # TODO : Operator Overloading
        # percentiles[1:1001] - percentiles[0:1000]
        #   -> e(n+1) - e(n) -> 1 dimensional array
        # 1st order differincing
        diff = percentiles[1:scale + 1] - percentiles[0:scale]
        # diff = np.diff(percentiles)
        
        # TODO : Operator Overloading 
        # 1e-6 delta less than (>) filter -> 1 dimensional array
        diff_nnz_idx = np.abs(diff) > 1e-6 # 1 * 10^(-6)
        
        # Fix code to cpu caching
        
        """ Upper Bound Operation """
        
        ev_upper_range = np.arange(center, scale - cut_point + 1, 1) # [500, 501, ... 900]
        diff_ev_upper_part = diff[ev_upper_range][diff_nnz_idx[ev_upper_range]]
        
        upper_bound_diff = np.mean(diff_ev_upper_part) + np.std(diff_ev_upper_part) * cut_ratio
        
        # upper
        upper_bound_idx = np.argmax(diff[center:] > upper_bound_diff)
            
        if upper_bound_idx == 0 or upper_bound_idx == scale-center:
            upper_bound_idx = scale - 1
        else:
            upper_bound_idx += center
            
        upper_bound = percentiles[upper_bound_idx]
        
        """ Lower Bound Operation """
                
        ev_lower_range = np.arange(cut_point, center + 1, 1) # [100, 101, ... 500]
        diff_ev_lower_part = diff[ev_lower_range][diff_nnz_idx[ev_lower_range]]
        
        lower_bound_diff = np.mean(diff_ev_lower_part) + np.std(diff_ev_lower_part) * cut_ratio
                
        # lower
        lower_bound_idx = -np.argmax(diff[center::-1] > lower_bound_diff)
            
        if lower_bound_idx == 0 or lower_bound_idx == -center:
            lower_bound_idx = 1
        else:
            lower_bound_idx += center
                
        lower_bound = percentiles[lower_bound_idx]

        return upper_bound, lower_bound

Boxcox Transformation

boxcox transformation

λ³΅μž‘ν•œ μ§€μˆ˜ν•¨μˆ˜μ˜ 승수의 ν•΄λ₯Ό 뉴턴 λ©”μ„œλ“œλ‘œ κ΅¬ν•˜κ³  이λ₯Ό μ •κ·œλΆ„ν¬λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€.

λ¬΄μž‘μœ„ κ°’μœΌλ‘œ boxcox 값이 μ–Όλ§ˆλ‚˜ μ •κ·œλΆ„ν¬λ‘œ 잘 λ°”κΏ”μ£ΌλŠ”μ§€

box-cox-transformation-using-python에 ν•œλˆˆμ— 보기 μ‰¬μš΄ μ˜ˆμ œκ°€ μžˆμŠ΅λ‹ˆλ‹€.

μ•„λž˜ μ˜ˆμ œλŠ” μ‹€μ œ 츑정값을 예제둜 ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€.

C.Doom의 Cygnus λ°©ν–₯으둜 47 개의 별을 ν¬ν•¨ν•˜λŠ” 성단 CYG OB1의 Hertzsprung-Russell Diagram μ‹€μΈ‘ λ°μ΄ν„°μž…λ‹ˆλ‹€.

첫 번째 λ³€μˆ˜λŠ” 별 ν‘œλ©΄μ—μ„œ 유효 μ˜¨λ„μ˜ 둜그 (log.Te)이고 두 번째 λ³€μˆ˜λŠ” λΉ› κ°•λ„μ˜ 둜그(log.light)μž…λ‹ˆλ‹€.

μ•„λž˜μ˜ μ˜ˆμ œμ—μ„œλŠ” log.Te 만 λ°μ΄ν„°λ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€.

원본 데이터

원본 데이터 csv

예제λ₯Ό 돌리기 μœ„ν•΄μ„œ μ•„λ‚˜μ½˜λ‹€ ν”„λ ˆμž„μ›Œν¬λ₯Ό μ„€μΉ˜ν•˜λ©΄ νŽΈλ¦¬ν•©λ‹ˆλ‹€.

μ•„λž˜λŠ” μ˜ˆμ œμ—μ„œλŠ” scipy.stats.boxcox 라이브러리λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.

scipy.stats.boxcox(x, lmbda=None, alpha=None)[source]
Return a dataset transformed by a Box-Cox power transformation.

Parameters
    xndarray
    Input array. Must be positive 1-dimensional. Must not be constant.

    lmbda{None, scalar}, optional
    If lmbda is not None, do the transformation for that value.

    If lmbda is None, find the lambda that maximizes the log-likelihood function and return it as the second output argument.

    alpha{None, float}, optional
    If alpha is not None, return the 100 * (1-alpha)% confidence interval for lmbda as the third output argument. Must be between 0.0 and 1.0.

Returns
    boxcoxndarray
    Box-Cox power transformed array.

    maxlogfloat, optional
    If the lmbda parameter is None, the second returned argument is the lambda that maximizes the log-likelihood function.

    (min_ci, max_ci)tuple of float, optional
    If lmbda parameter is None and alpha is not None, this returned tuple of floats represents the minimum and maximum confidence limits given alpha.

비정상 ν™•λ₯  과정을 정상 ν™•λ₯  κ³Όμ •μœΌλ‘œ λ³€ν™˜ν•˜κΈ°

Boxcox transformation example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import boxcox
import seaborn as sns
import dautil as dl
import pandas_datareader as pd
import numpy as np
# import csv

#from IPython.display import HTML

# Load the data and transform it as follows :
context = dl.nb.Context('normalizing_boxcox')
lr = dl.nb.LatexRenderer(chapter=4, start=3, context=context)
lr.render(r'y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(y_i)} & \text{if } \lambda = 0, \end{cases} ')

starsCYG = sm.datasets.get_rdataset("starsCYG", "robustbase", cache=True).data

whichData = 'log.Te'

# Data must be positive
# Data must be 1-dimensional.
transformed, lamda = boxcox(starsCYG[whichData])

# export CSV
#with open('test', 'w', newline='', encoding='utf-8') as csv_file:
    #writer = csv.writer(transformed, delimiter=',')
    #writer.writerow('my_utf8_string')

print("1. input data : ")
print(starsCYG)
print(type(starsCYG))

np.savetxt('./input_data.txt'
   , (starsCYG)
   , header='--input data start--'
   , footer='--input data end--'
   , fmt='%1.2f')

print("\n\n")

print("2. input data [%s] : " %whichData)
print(starsCYG[whichData])
print(type(starsCYG[whichData]))

np.savetxt('./input_data_x.txt'
   , (starsCYG)
   , header='--input data start--'
   , footer='--input data end--'
   , fmt='%1.2f')

print("\n\n")

print("3. transformed output : ")
print(transformed)
print(type(transformed))

print("max lamda value : ")
print(lamda)
print(type(lamda))

print("\n\n")

np.savetxt('D:/PythonProject/output_data_x.txt'
   , (transformed)
   , header='--output data start--'
   , footer='--output data end--'
   , fmt='%1.2f')

#region Plot
# Display the Q - Q plots and the distribution as follows :

#"""
sp = dl.plotting.Subplotter(2, 2, context)
sp.label()
sm.qqplot(starsCYG[whichData], fit=True, line='s', ax=sp.ax)

sp.label(advance=True)
sm.qqplot(transformed, fit=True, line='s', ax=sp.ax)

sp.label(advance=True)
sns.distplot(starsCYG[whichData], ax=sp.ax)

sp.label(advance=True)
sns.distplot(transformed, ax=sp.ax)                                       
plt.tight_layout()
plt.show()

#"""

#endregion Plot

λ‹€μŒμ€ postgres 에 μ μž¬ν•˜λŠ” μ˜ˆμ‹œλ₯Ό λ³΄κ² μŠ΅λ‹ˆλ‹€.

Postgres 연동은 Citus membership-manager.py λ₯Ό μ°Έκ³ ν–ˆμŠ΅λ‹ˆλ‹€.

Boxcox transformation batch process example on postgres

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import datetime
import csv
import numpy as np
from scipy.stats import boxcox
from scipy.special  import inv_boxcox

import os
import sys
import psycopg2

from datetime import datetime as dt

print("WORKFLOW START [BOXCOX] :", dt.isoformat(dt.utcnow()))

idValue = (sys.argv[1])
print("idValue : %(idValue)s")

timeValue = (sys.argv[2])
print("timeValue : %(timeValue)s")

cdate = (sys.argv[3])
print("cdate : %(cdate)s")

sigma = float(sys.argv[4])
print("sigma : %(sigma)s")

connParam = "dbname = 'db' user = 'user' host = '192.168.10.1' password = 'paranlee'"
conn = psycopg2.connect(connParam)
cur = conn.cursor()

sql = """SELECT value 
        FROM ts_bg_alarm_outlier_threshold_data 
        WHERE cdate = %(cdate)s 
            AND idValue = %(idValue)s 
            AND timeValue = %(timeValue)s ;"""

cur.execute(sql)

result = [r[0] for r in cur.fetchall()]

inputSeries = np.array(result, dtype=float)
print("inputSeries: ", inputSeries)

# tryboxcox transformation
try:
    #inputSeries = np.array(list_file,dtype=float)
    transformed, lamda = boxcox(inputSeries)

    print("transformed output : ", transformed)

    print("max lamda value : ", lamda)

    print("transformed: ", transformed[0], transformed[1])

    stddev = np.std(transformed)
    avg = np.mean(transformed)

    orgAvg = np.mean(inputSeries)
    orgStd = np.std(inputSeries)
 
    print("avg value : %(avg)s")
    print("std value : %(stddev)s")

    print("orgAvg value : %(orgAvg)s")
    print("orgStd value : %(orgStd)s")

    print("sigma value : %(sigma)s")

    # up, down boundary
    up = avg + (stddev * sigma)
    down = avg - (stddev * sigma)

    orgUp = orgAvg + (orgStd * sigma)
    orgDown = orgAvg - (orgStd * sigma)

    print("up bodundary output : %(up)s")
    print("down boundary output : %(down)s")

    print("orgUp output : %(orgUp)s")
    print("orgDown output : %(orgDown)s")

    #inverse up,down
    invup = inv_boxcox(up, lamda)
    invdown = inv_boxcox(down, lamda)

    print("invup output : %(invup)s")
    print("invdown output : %(invdown)s")

    invup2 = ((up * lamda) ** (1 / lamda))
    invdown2 = ((down * lamda) ** (1 / lamda))

    print("invup2 output : %(invup2)s")
    print("invdown2 output : %(invdown2)s")

    if str(invup) == 'nan' or str(invup) == 'inf':
        invup = orgUp

    if str(invdown) == 'nan' or str(invdown) == 'inf':
        invdown = orgDown

    invup = orgUp
    invdown = orgDown

except:
    print('except')
    orgAvg = np.mean(inputSeries)
    orgStd = np.std(inputSeries)

    orgUp = orgAvg + (orgStd * sigma)
    orgDown = orgAvg - (orgStd * sigma)

    invup = orgUp
    invdown = orgDown

    print(orgAvg)
    print(orgStd)

    if str(invup) == 'nan' or str(invup) == 'inf':
        invup = psycopg2.extensions.AsIs('NULL')

    if str(invdown) == 'nan' or str(invdown) == 'inf':
        invdown = psycopg2.extensions.AsIs('NULL')

# minus change 0
# if invup < 0:
#     invup = 0
# if invdown < 0:
#     invdown = 0

# threshold update

sql = """UPDATE
        outlier_threshold_table
    SET threshold_min = %(invdown)s,
        threshold_max = %(invup)s 
    WHERE 1=1
        AND cdate = %(cdate)s 
        AND idValue = %(idValue)s
        AND timeValue = %(timeValue)s ;"""

cur.execute(sql)

conn.commit()

print("WORKFLOW END [BOXCOX] :", dt.isoformat(dt.utcnow()))

PG μ—λŠ” λ‚΄μž₯ ν•¨μˆ˜μ— Boxcox transformation 이 μ—†κ³ ,

λΆ€λ™μ†Œμˆ˜μ μ— λŒ€ν•œ μ˜ˆμ™Έμ²˜λ¦¬κ°€ μ’€ 더 νŽΈλ¦¬ν•˜κ²Œ ν•  수 μžˆμ–΄,

PL/SQL ν”„λ‘œμ‹œμ €λ‘œ κ΅¬ν˜„ν•˜μ§€ μ•Šκ³ , Python3 둜 κ΅¬ν˜„ν•œ 배치 ν”„λ‘œκ·Έλž¨ μ˜ˆμ‹œλ₯Ό κ΅¬ν˜„ν•΄λ³΄μ•˜μŠ΅λ‹ˆλ‹€.

Summary

μ΅œμ’…μ μœΌλ‘œ 데이터 적재λ₯Ό μœ„ν•΄μ„œλŠ”

  1. 이상점 μΆ”μΆœ 및 μ œμ™Έν•˜κΈ°

  2. μ •κ·œλΆ„ν¬ λ³€ν™˜ ν›„ ν‘œμ€€νŽΈμ°¨λ‘œ μ μž¬ν•  λ°μ΄ν„°μ˜ λ²”μœ„ μ„€μ •, μ—­λ³€ν™˜ν•œ 값을 μ μž¬ν•˜κΈ°

크게 2가지 μ›Œν¬ ν”Œλ‘œμš°λ‘œ μ΄λ£¨μ–΄μ§€λŠ” 것을 ν™•μΈν–ˆμŠ΅λ‹ˆλ‹€.