๐Ÿ’ก ๋ฐ์ดํ„ฐ ๋ถ„์„ ํ”„๋กœ์ ํŠธ

Olist ๋งค์ถœ ๋ถ„์„

rtw0202 2025. 8. 2. 13:53

1. ํ™œ์šฉ ๋ฐ์ดํ„ฐ

1-1. ๋ฐ์ดํ„ฐ : Olist ๋ฐ์ดํ„ฐ(์ถœ์ฒ˜ : kaggle)

Olist(์˜ฌ๋ฆฌ์ŠคํŠธ)๋Š” ๋ธŒ๋ผ์งˆ์˜ ์ด์ปค๋จธ์Šค ํ”Œ๋žซํผ์œผ๋กœ, ํŒ๋งค์ž(๋ธŒ๋žœ๋“œ)๊ฐ€ ์ƒํ’ˆ์„ ์˜จ๋ผ์ธ ๋งˆ์ผ“ํ”Œ๋ ˆ์ด์Šค์— ํŒ๋งคํ•  ์ˆ˜ ์žˆ๋„๋ก ์œ ํ†ต ์†”๋ฃจ์…˜(์ƒํ’ˆ ๋“ฑ๋ก, ๋ฌผ๋ฅ˜, ๊ณ ๊ฐ ์‘๋Œ€ ๋“ฑ)์„ ์ œ๊ณตํ•œ๋‹ค.

 

1-2. ๋ฐ์ดํ„ฐ ์Šคํ‚ค๋งˆ

๊ฐ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ฃผ๋ฌธ ์•„์ด๋””(order_id), ๊ณ ๊ฐ ์•„์ด๋””(customer_id) ๋“ฑ์„ ์ด์šฉํ•ด ์ฐธ์กฐ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

 

1-3. ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ ์ •๋ณด

* ์—ด ์ด๋ฆ„ ์˜คํƒ€ ์ˆ˜์ •

  • products ๋ฐ์ดํ„ฐ์˜ product_name_leng'ht' -> product_name_leng'th'
  • products ๋ฐ์ดํ„ฐ์˜ product_description_leng'ht' -> product_description_leng'th'

 

2. ๋Œ€์‹œ๋ณด๋“œ ์„ค๊ณ„

2-1. ์ฃผ์ œ ์„ ์ •

๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „ ๋Œ€์‹œ๋ณด๋“œ๋ฅผ ์„ค๊ณ„ํ•˜์—ฌ ์–ด๋–ค ๋ถ€๋ถ„์„ ์ค‘์ ์œผ๋กœ ๋ถ„์„ํ• ์ง€ ๋ฐฉํ–ฅ์„ ์ •ํ–ˆ๋‹ค.

์šฐ์„  ๋ฐ์ดํ„ฐ๊ฐ€ ์ฃผ๋กœ ํŒ๋งค ํ˜„ํ™ฉ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๊ธฐ์— ๋งค์ถœ ๋ถ„์„์ด๋ผ๋Š” ํฐ ์ฃผ์ œ๋ฅผ ์žก๊ณ , ์„ธ๋ถ€ ์ฃผ์ œ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์„ ์ •ํ–ˆ๋‹ค.

  • ์ œํ’ˆ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๋งค์ถœ
  • ์›”๋ณ„ ๋งค์ถœ
  • ๊ณ ๊ฐ ๋“ฑ๊ธ‰๋ณ„ ๋งค์ถœ
  • ์ง€์—ญ๋ณ„ ๋งค์ถœ(๋„ํ‘œ, ๋งต)

 

2-2. ์ƒ‰์ƒ

๋Œ€์‹œ๋ณด๋“œ์˜ ํฌ์ธํŠธ ์ปฌ๋Ÿฌ๋Š” olist์˜ ๋กœ๊ณ  ์ƒ‰์ƒ์„ ํ™œ์šฉํ•˜์˜€๋‹ค.

 

3. ๋ฐ์ดํ„ฐ ๋ถ„์„

์‹ค์ œ ๊ณ ๊ฐ ์ˆ˜ ํ™•์ธ

customers ์ •์ œ ์ „

customers ๋ฐ์ดํ„ฐ ์ค‘ customer_id๋Š” ์ฃผ๋ฌธ ๋‹น์‹œ ๊ตฌ๋งค์ž์—๊ฒŒ ๋ฐœ๊ธ‰๋˜๋Š” id๋กœ, ๊ฐ™์€ customer_unique_id๋ฅผ ๊ฐ€์ง„ ๊ณ ๊ฐ์ด๋ผ ํ•ด๋„ customer_id๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.

์‹ค์ œ ๊ณ ๊ฐ์˜ ์ˆ˜๋Š” 96,096๋ช…์œผ๋กœ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ฃผ(state) ๋งคํ•‘

๋ธŒ๋ผ์งˆ์˜ 27๊ฐœ ์ฃผ(state)๊ฐ€ ์•ฝ์–ด๋กœ ๋˜์–ด ์žˆ์–ด ๋Œ€์‹œ๋ณด๋“œ ์ƒ์—์„œ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™์•„ ์•ฝ์–ด๋ฅผ ํ’€์–ด ์ ์šฉํ–ˆ๋‹ค. 

state_mapping = {
    'SP': 'Sao Paulo',
    'SC': 'Santa Catarina',
    'MG': 'Minas Gerais',
    'PR': 'Parana',
    'RJ': 'Rio de Janeiro',
    'RS': 'Rio Grande do Sul',
    'PA': 'Para',
    'GO': 'Goias',
    'ES': 'Espirito Santo',
    'BA': 'Bahia',
    'MA': 'Maranhao',
    'MS': 'Mato Grosso do Sul',
    'CE': 'Ceara',
    'DF': 'Distrito Federal',
    'RN': 'Rio Grande do Norte',
    'PE': 'Pernambuco',
    'MT': 'Mato Grosso',
    'AM': 'Amazonas',
    'AP': 'Amapa',
    'AL': 'Alagoas',
    'RO': 'Rondonia',
    'PB': 'Paraiba',
    'TO': 'Tocantins',
    'PI': 'Piaui',
    'AC': 'Acre',
    'SE': 'Sergipe',
    'RR': 'Roraima'
}

customers['customer_state'] = customers['customer_state'].map(state_mapping)

 

product_category_name ์˜๋ฌธ ๋ณ€ํ™˜

product_category_name ์˜์–ด ๋ณ€ํ™˜ ์ „

๊ธฐ์กด products ๋ฐ์ดํ„ฐ์—๋Š” product_category_name์ด ํฌ๋ฅดํˆฌ๊ฐˆ์–ด๋กœ ๋˜์–ด ์žˆ๋‹ค. 

product_category_name_translation ๋ฐ์ดํ„ฐ๋Š” product_category_name์ด ํฌ๋ฅดํˆฌ๊ฐˆ์–ด์™€ ์˜์–ด๊ฐ€ ๋งคํ•‘๋˜์–ด ์žˆ์–ด ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํฌ๋ฅดํˆฌ๊ฐˆ์–ด๋ฅผ ์˜์–ด๋กœ ๋ณ€ํ™˜ํ–ˆ๋‹ค.

# ๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋งคํ•‘
category_dict = product_translation.set_index('product_category_name')['product_category_name_english']

# 2๋ฒˆ์งธ ์ธ๋ฑ์Šค์— ์ถ”๊ฐ€
products.insert(2, 'product_category_name_english', products['product_category_name'].map(category_dict))

product_category_name ์˜์–ด ๋ณ€ํ™˜ ํ›„

 

RFM

RFM์ด๋ž€ ๊ณ ๊ฐ์˜ ์ตœ๊ทผ ๊ตฌ๋งค ์‹œ์ (Recency), ๊ตฌ๋งค ๋นˆ๋„(Frequency), ๊ตฌ๋งค ๊ธˆ์•ก(Monetary)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ๊ฐ์„ ๋ถ„๋ฅ˜ํ•˜๊ณ  ๋ถ„์„ํ•˜๋Š” ๊ณ ๊ฐ ๊ฐ€์น˜ ๋ถ„์„ ๊ธฐ๋ฒ•์ด๋‹ค.

  • ๊ตฌ๋งค ์‹œ์ (Recency) :  ๋ชจ๋“  ์ฃผ๋ฌธ ์ผ์ž ์ค‘ ์ตœ๋Œ“๊ฐ’ - ํ•ด๋‹น ๊ณ ๊ฐ์˜ ๋งˆ์ง€๋ง‰ ์ฃผ๋ฌธ ์ผ์ž(Day)
  • ๊ตฌ๋งค ๋นˆ๋„(Frequency) : ๊ณ ๊ฐ๋ณ„ ๋ˆ„์  ์ฃผ๋ฌธ ํšŸ์ˆ˜
  • ๊ตฌ๋งค ๊ธˆ์•ก(Monetary) : ๊ณ ๊ฐ๋ณ„ ์ฃผ๋ฌธ ์ด์•ก

 

์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  • orders :  ์ฃผ๋ฌธ ID(order_id), ์ฃผ๋ฌธ ์ƒํƒœ(order_status) ํฌํ•จ
  • order_payments : ์ฃผ๋ฌธ๋ณ„ ๊ฒฐ์ œ ๋‚ด์—ญ
  • customers : ๊ณ ๊ฐ ๊ณ ์œ  ID(customer_unique_id), ์ฃผ๋ฌธ ๋‹น์‹œ ๋ฐœ๊ธ‰๋˜๋Š” ID(customer_id) ํฌํ•จ

 

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

orders์™€ order_payments ๋ฐ์ดํ„ฐ๋ฅผ order_id๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, orders์—๋Š” ์žˆ์ง€๋งŒ order_payments์—๋Š” ์—†๋Š” order_id๊ฐ€ ์กด์žฌํ•˜์—ฌ ํ•ด๋‹น ํ–‰์„ ์ œ์™ธํ•˜์˜€๋‹ค.

 

์ฃผ๋ฌธ ์ƒํƒœ๊ฐ€ '๋ฐฐ์†ก ์™„๋ฃŒ(order_status='delivered')'์ธ ์ฃผ๋ฌธ๋งŒ ๋Œ€์ƒ์— ํฌํ•จํ•˜์˜€๋‹ค.

 

๊ธฐ์กด orders ๋ฐ์ดํ„ฐ์—๋Š” ์ฃผ๋ฌธ ์‹œ ๋ฐœ๊ธ‰๋œ customer_id๋งŒ ์กด์žฌํ•˜๋ฏ€๋กœ, ๊ณ ๊ฐ๋ณ„ ์ฃผ๋ฌธ ๋‚ด์—ญ์„ ์ทจํ•ฉํ•˜๊ธฐ ์œ„ํ•ด customers์™€ orders๋ฅผ ์กฐ์ธํ•˜์—ฌ ๊ฐ ์ฃผ๋ฌธ์„ ๊ณ ๊ฐ์˜ ๊ณ ์œ  ์‹๋ณ„์ž(customer_unique_id) ๊ธฐ์ค€์œผ๋กœ ์žฌ๊ตฌ์„ฑํ•˜์˜€๋‹ค.

 

RFM

Recency๋Š” ๋ชจ๋“  ๊ณ ๊ฐ์˜ ๋งˆ์ง€๋ง‰ ๊ตฌ๋งค ์‹œ๊ฐ„์—์„œ ๊ณ ๊ฐ๋ณ„ ๋งˆ์ง€๋ง‰ ๊ตฌ๋งค ์‹œ๊ฐ„์„ ๋บ€ ์ผ์ž๋กœ ๊ณ„์‚ฐํ•˜์˜€๋‹ค.

 

Frequency๋Š” ๊ณ ๊ฐ๋ณ„ ๋ˆ„์  ์ฃผ๋ฌธ ํšŸ์ˆ˜๋กœ ๊ณ„์‚ฐํ•˜์˜€๋‹ค.

 

Monetary๋Š” ๊ณ ๊ฐ๋ณ„ ๋ˆ„์  ์ฃผ๋ฌธ์— ๋Œ€ํ•œ ์ด ๊ธˆ์•ก์œผ๋กœ ๊ณ„์‚ฐํ•˜์˜€๋‹ค.

 

ํ•œ ๋ฒˆ๋„ ์ฃผ๋ฌธํ•˜์ง€ ์•Š์€ ๊ณ ๊ฐ์€ RFM ๋ถ„์„ ๋Œ€์ƒ์—์„œ ์ œ์™ธํ•˜์˜€์œผ๋ฉฐ, 5์  ๋งŒ์  ๊ธฐ์ค€์œผ๋กœ ์š”์†Œ๋ณ„ ์ ์ˆ˜๋ฅผ ํ• ๋‹นํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค.

1. Recency(๊ตฌ๋งค ์‹œ์ )

Recency์˜ ๊ฒฝ์šฐ ์ƒ๋Œ€์ ์œผ๋กœ ๊ณ ๋ฅด๊ฒŒ ๋ถ„ํฌ๋˜์–ด ์ ์ˆ˜ ํ• ๋‹น์„ ๋ฐ”๋กœ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

rfm_score = rfm.copy()
rfm_score['R_score'] = pd.qcut(rfm['Recency'], 5, labels=[5, 4, 3, 2, 1])

๊ฐ’์ด ๋‚ฎ์„์ˆ˜๋ก ์ตœ์‹ ์„ฑ์ด๋ฏ€๋กœ ์ ์ˆ˜๋ฅผ ์—ญ์ˆœ์œผ๋กœ ํ• ๋‹นํ•˜์˜€๋‹ค.

 

2. Frequency(๊ตฌ๋งค ๋นˆ๋„)

Frequency์˜ ๊ฒฝ์šฐ 1ํšŒ ๊ตฌ๋งค ๊ณ ๊ฐ์ด ๊ฐ€์žฅ ๋งŽ์•˜๊ณ , ์ฃผ๋ฌธ ํšŸ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ๊ณ ๊ฐ ์ˆ˜๋Š” ๊ฐ์†Œํ•˜๋Š” ํŒจํ„ด์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

def score_frequency(freq):
    if freq == 1:
        return 1
    elif freq == 2:
        return 2
    elif freq == 3:
        return 3
    elif freq == 4:
        return 4
    else:
        return 5

rfm_score['F_score'] = rfm['Frequency'].apply(score_frequency)

1, 2, 3, 4, 5๋ฒˆ ์ด์ƒ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ํ• ๋‹นํ•˜์˜€๋‹ค.

 

3. Monetary(๊ตฌ๋งค ๊ธˆ์•ก)

Monetary์˜ ๊ฒฝ์šฐ ์ ์ˆ˜๋ณ„ ๊ธฐ์ค€๊ฐ’์„ ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ๋ถ„์œ„์ˆ˜๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค.

def score_monetary(mone):
    if mone < 20:
        return 1
    elif mone < 50:
        return 2
    elif mone < 100:
        return 3
    elif mone < 500:
        return 4
    else:
        return 5

rfm_score['M_score'] = rfm['Monetary'].apply(score_monetary)

20 ๋ฏธ๋งŒ, 50 ๋ฏธ๋งŒ, 100 ๋ฏธ๋งŒ, 500 ๋ฏธ๋งŒ, 500 ์ด์ƒ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ํ• ๋‹นํ•˜์˜€๋‹ค.

 

๊ณ ๊ฐ ๋“ฑ๊ธ‰์€ ์ ์ˆ˜ ์กฐํ•ฉ์— ๋”ฐ๋ผ ์ด 6๊ฐœ๋กœ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค.

  • VIP ๊ณ ๊ฐ
  • ์ถฉ์„ฑ ๊ณ ๊ฐ
  • ์ž ์žฌ ์ถฉ์„ฑ ๊ณ ๊ฐ
  • ์‹ ๊ทœ ๊ณ ๊ฐ
  • ์ž ์žฌ ์ดํƒˆ ๊ณ ๊ฐ
  • ์ดํƒˆ ๊ณ ๊ฐ

 

4. ๊ฒฐ๊ณผ

ํŠน์ • ์ง€์—ญ์„ ํด๋ฆญํ•˜๋ฉด ํ•ด๋‹น ์ง€์—ญ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๋งŒ ๋‚˜์˜ค๋„๋ก ์ง€์—ญ๋ณ„ ๋งค์ถœ(๋งต)์„ ํ•„ํ„ฐ๋กœ ์„ค์ •ํ•˜์˜€๋‹ค.