AgglomerativeClustering

- 계층적 군집화(Hierarchical Clustering) 방식 중 하향식(Bottom-Up, 병합형)
- 초기에는 각 샘플을 개별 클러스터로 간주하고, 거리/유사도 기준으로 가장 가까운 두 클러스터를 반복적으로 병합
- 병합 과정이 트리 형태 (dendrogram) 로 나타남 → 군집 수 결정 전에도 계층 구조를 확인 가능

- 이상치가 개별 클러스터로 존재하거나 single linkage에 포함될 경우 전체 군집 구조에 영향

- ward linkage는 이상치 영향이 조금 덜하지만, 여전히 취약

- linkage 방법에 따라 병합 기준이 달라짐

- 시간 복잡도 : O(M·N²) ~ O(N³)

- 공간 복잡도 : O(N²)

장점

- 트리 구조로 계층적 관계를 한눈에 볼 수 있음.
- 클러스터 수를 사전에 몰라도 distance_threshold로 자동 결정 가능

- 병합 과정 완료 후, 군집의 수를 설정할 수 있음
- 다양한 linkage와 거리(metric) 선택 가능 → 데이터 특성에 맞춤 적용 가능
- 소규모 데이터셋에서는 해석력이 뛰어나고 시각화 용이

- 동일한 결과 보장

단점

- 높은 계산 비용 (연산량이 다른 모델에 비해 큰 편)
- 대규모 데이터셋에서는 비실용적
- 이상치(outlier)에 민감 → single linkage 등에서는 연결이 왜곡될 수 있음.
- 클러스터를 한 번 병합하면 되돌릴 수 없음 → 초기 단계의 잘못된 병합이 전체 구조에 영향

하이퍼 파라미터

from sklearn.cluster import AgglomerativeClustering

AgglomerativeClustering(
    n_clusters=2,
    affinity='euclidean',
    memory=None,
    connectivity=None,
    compute_full_tree='auto',
    linkage='ward',
    pooling_func='deprecated',
    distance_threshold=None,
)

n_clusters

- 찾을 클러스터의 수
- distance_threshold가 None이 아닌 경우에는 반드시 None으로 설정

affinity

- 문자열 또는 callable, 기본값: euclidean
- 연결(linkage)을 계산할 때 사용할 거리(metric)
- "euclidean", "l1", "l2", "manhattan", "cosine", "precomputed"
- linkage가 "ward"일 경우에는 "euclidean"만 허용
- "precomputed"일 경우, fit 메서드에 입력할 때 유사도(similarity) 행렬이 아닌 거리(distance) 행렬이 필요

connectivity
- 각 샘플의 이웃 샘플을 정의하는 연결성(connectivity) 행렬
- 자체가 연결성 행렬이거나, kneighbors_graph 등으로 변환 가능한 callable이 될 수 있음
- 기본값은 None으로, 이 경우 계층적 군집화 알고리즘은 구조화되지 않는다.

compute_full_tree

- bool 또는 'auto'
- 트리 구성을 n_clusters에서 조기 종료할지 여부
- 샘플 수에 비해 클러스터 수가 작지 않을 경우 계산 시간을 줄이는 데 유용
- 이 옵션은 connectivity 행렬을 지정할 때만 유용
- distance_threshold가 None이 아닌 경우에는 반드시 True 설정

linkage

- "ward", "complete", "average", "single"
- 사용할 연결(linkage) 기준

- 두 클러스터 집합 간의 거리를 결정하며, 알고리즘은 이 기준을 최소화하는 클러스터 쌍을 병합
* ward : 병합되는 클러스터의 분산을 최소화
* average : 두 집합의 모든 관측치 거리 평균을 사용
* complete (또는 maximum) : 두 집합의 모든 관측치 거리 중 최대값을 사용
* single : 두 집합의 모든 관측치 거리 중 최소값을 사용

distance_threshold

- 이 거리(threshold)보다 큰 경우 클러스터는 병합되지 않는다.
- None이 아닌 경우, n_clusters는 None이어야 하며 compute_full_tree는 True로 설정

Attributes

    n_clusters_ : int
        The number of clusters found by the algorithm. If
        ``distance_threshold=None``, it will be equal to the given
        ``n_clusters``.

    labels_ : array [n_samples]
        cluster labels for each point

    n_leaves_ : int
        Number of leaves in the hierarchical tree.

    n_connected_components_ : int
        The estimated number of connected components in the graph.

    children_ : array-like, shape (n_samples-1, 2)
        The children of each non-leaf node. Values less than `n_samples`
        correspond to leaves of the tree which are the original samples.
        A node `i` greater than or equal to `n_samples` is a non-leaf
        node and has children `children_[i - n_samples]`. Alternatively
        at the i-th iteration, children[i][0] and children[i][1]
        are merged to form node `n_samples + i`

n_clusters_
- 알고리즘이 찾은 클러스터 수
- distance_threshold=None일 경우, 주어진 n_clusters와 동일

labels_
- 각 샘플에 대한 클러스터 레이블

n_leaves_
- 계층적 트리의 리프 수

n_connected_components_
-그래프에서 추정된 연결된 컴포넌트 수

children_

- 노드 별 관측치의 수 참고

- 각 비리프 노드(non-leaf node)의 자식
- n_samples보다 작은 값은 원래 샘플(리프)을 나타냄
- i >= n_samples인 노드는 비리프 노드이며, 자식은 children_[i - n_samples]
- 또는 i 번째 반복(iteration)에서 children[i][0]과 children[i][1]이 병합되어 노드 n_samples + i 를 형성

'개발 > Python' 카테고리의 다른 글

파생변수 vs 요약변수 (0)	2025.11.25
표본조사와 표본 추출 방법 (0)	2025.11.23
IsolationForest Hyper Parameters and Attributes (0)	2025.11.22
LocalOutlierFactor Hyper Parameters and Attributes (0)	2025.11.22
LightGBM Hyper Parameters and Attributes (0)	2025.11.22

피로물든딸기의 라이브러리

AgglomerativeClustering

AgglomerativeClustering

'개발 > Python' 카테고리의 다른 글

댓글

티스토리툴바

AgglomerativeClustering

AgglomerativeClustering

'개발 > Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바