OpenTelemetry Collector, Agent와 Gateway 두 층으로 굴리기

IT/모니터링

OpenTelemetry Collector, Agent와 Gateway 두 층으로 굴리기

gfrog 2026. 7. 2. 15:47

SMALL

OTel Collector 배포할 때 처음에는 다들 하나만 띄우고 시작한다. 우리 팀도 그러다. Deployment 하나 만들어서 어플리케이션들이 다 거기로 OTLP 쏘게. 초기엔 잘 돌아간다. 그러다 트래픽 늘고, tail sampling 붙이고, k8s 노드 라벨을 span에 붙이기 시작하면 슬슬 얘가 무너진다.

이번 글은 그 다음 단계 — Agent(DaemonSet) + Gateway(Deployment) 두 층 구조로 가는 실전 셋업이다. 최근 릴리스 몇 개(현재 v0.154 기준)에서 바뀐 것들도 같이 정리한다.

왜 두 층인가

한 층으로 유지하다 보면 결국 이런 문제가 나온다.

첫째, 노드 로컬 정보를 붙이기 어렵다. Pod의 IP는 알지만 노드 이름, 커널 버전, 인스턴스 타입 같은 건 어디선가 리소스 attribute로 붙여줘야 한다. 어플리케이션이 다 알아서 붙이면 되지 않냐 싶은데, 우리 팀은 언어 4개 SDK를 쓴다. 언어마다 붙이는 attribute 규격이 미묘하게 다르다. 통일하려고 붙잡고 있느니 노드에 얹은 Agent가 처리하는 게 훨씬 낫다.

둘째, tail-based sampling을 하려면 하나의 trace에 속한 span들이 같은 인스턴스로 모여야 한다. 이걸 어플리케이션이 알 리가 없다. Gateway 층에서 loadbalancingexporter로 trace ID 라우팅을 걸어야 처리된다.

셋째, 배치·재시도·백프레셔가 Gateway에 집중되면 어플리케이션 SDK 쪽 큐 튜닝 걱정을 덜 수 있다. Agent가 얇게 받아서 Gateway에 위임하는 구조가 운영이 편하다.

구조 잡기

기본 흐름은 이렇다.

App SDK  →  Agent (DaemonSet, 노드마다 1)  →  Gateway (Deployment, N개)  →  백엔드

Agent는 노드에 하나. hostmetrics, kubeletstats 같은 노드 로컬 리시버가 여기서 돈다. 어플리케이션은 노드 IP(또는 hostPort/downward API로 주입받은 IP)로 OTLP를 쏜다. 이렇게 하면 네트워크 홉이 짧고, 노드 이슈 있을 때 데이터가 멀리 안 간다.

Gateway는 3~5대 Deployment. 여기서 tail sampling, redaction, 라우팅, 백엔드별 export가 다 일어난다. Agent와 Gateway 사이는 OTLP gRPC로 붙인다.

Agent 쪽 설정

DaemonSet으로 배포하고, hostmetrics·kubeletstats·otlp만 켠다. 어플리케이션에서 오는 OTLP는 그대로 받아서 최소 처리만 하고 gateway로 넘긴다.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu:
      memory:
      filesystem:
      network:
  kubeletstats:
    collection_interval: 20s
    auth_type: serviceAccount
    endpoint: ${env:K8S_NODE_NAME}:10250

processors:
  batch:
    timeout: 5s
    send_batch_size: 8192
  resource:
    attributes:
      - key: k8s.node.name
        value: ${env:K8S_NODE_NAME}
        action: insert
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    filter:
      node_from_env_var: K8S_NODE_NAME

exporters:
  otlp/gateway:
    endpoint: otel-gateway.observability.svc.cluster.local:4317
    tls:
      insecure: true
    sending_queue:
      enabled: true
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [k8sattributes, resource, batch]
      exporters: [otlp/gateway]
    metrics:
      receivers: [otlp, hostmetrics, kubeletstats]
      processors: [k8sattributes, resource, batch]
      exporters: [otlp/gateway]
    logs:
      receivers: [otlp]
      processors: [k8sattributes, resource, batch]
      exporters: [otlp/gateway]

여기서 놓치기 쉬운 게 k8sattributes processor의 filter.node_from_env_var. 이걸 안 넣으면 Agent가 클러스터 전체 Pod 정보를 워치하려고 든다. 노드 수십 대 곱하기 클러스터 전체 Pod 수만큼 informer가 돌면 API 서버가 좋아할 리가 없다. 반드시 자기 노드로 필터링해라.

Gateway 쪽 설정

Gateway는 tail sampling과 라우팅이 주 임무다. 여기서 트래픽 튀는 걸 견뎌야 하니 HPA도 함께 붙인다.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 32

processors:
  batch:
    timeout: 10s
    send_batch_size: 16384
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 5
  transform/redact:
    trace_statements:
      - context: span
        statements:
          - replace_pattern(attributes["http.url"], "token=[^&]+", "token=REDACTED")

exporters:
  otlphttp/backend:
    endpoint: https://otlp.your-backend.example.com
    headers:
      x-api-key: ${env:BACKEND_API_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform/redact, tail_sampling, batch]
      exporters: [otlphttp/backend]

여기서 중요한 건 Agent → Gateway 구간에서 loadbalancingexporter를 쓰는 경우다. 실제로 tail sampling이 정확히 돌려면 같은 trace의 모든 span이 같은 Gateway 인스턴스로 가야 한다. 위 설정은 Agent → Gateway가 그냥 랜덤 로드밸런싱이라 tail sampling이 부정확해진다. 정확도가 필요하면 Agent의 exporter를 loadbalancing으로 바꾸고 resolver에 Gateway Service를 넣어라.

exporters:
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      k8s:
        service: otel-gateway.observability
        ports: [4317]

이렇게 하면 trace ID 해시로 라우팅해준다. 우리 팀은 처음에 이걸 몰라서 tail sampling 결과가 이상하게 나오는 걸 한참 헤맸다.

최근 릴리스에서 조심할 것들

한동안 안 봤다가 업그레이드하면 몇 개 이름이 바뀌어 있다. 최근 릴리스 노트 훑어보다 눈에 띈 것들:

k8snode detector가 deprecated 되고 k8s_api로 옮겨졌다. 2026년 12월에 완전히 제거 예정이라니 지금 미리 바꿔두는 게 낫다.
apachespark → apache_spark, envoyals → envoy_als 처럼 underscore로 통일되고 있다. 옛 이름은 alias로 남아있긴 한데 로그에 deprecation 경고가 쌓인다.
Kafka metrics receiver의 UseFranzGo feature gate가 stable로 승격되고 Sarama 기반 구현이 제거됐다. Kafka 붙여 쓰던 팀이면 config 검증해봐라.
OTLP protobuf 정의가 v1.10.0으로 올라갔다. 대부분 하위호환이지만 자체 exporter를 만든 팀은 벤더 백엔드 호환성을 한 번 봐야 한다.

리네임 대응은 간단하다. otelcol validate 걸어서 deprecation 경고 목록 뽑고, 하나씩 이름만 바꿔주면 된다. 큰 리스크는 없다.

언제 이 구조가 오버킬인가

솔직히 팀이 작고 서비스도 몇 개 없으면 Gateway 한 층만으로 충분하다. 두 층 가는 게 정당화되는 시점은 대충 이렇다.

노드 개수가 20대 이상이거나, tail sampling을 정말 쓰고 있거나, 백엔드가 두 곳 이상이라 라우팅이 필요하거나. 하나라도 해당되면 두 층으로 넘어가라. 아니면 그냥 Deployment 하나 잘 튜닝하는 게 낫다. 인프라 쪼갠 만큼 관측 대상도 늘어난다.

마무리

우리 팀은 최근 이 구조로 옮기고 나서 tail sampling 정확도가 정상화됐고, Gateway HPA로 트래픽 스파이크 대응도 훨씬 편해졌다. 다만 Agent DaemonSet의 리소스 요청/제한 튜닝은 노드 크기별로 다시 잡아야 했다. m5.large 노드용 requests가 m5.4xlarge에 그대로 붙으면 낭비다. 이 부분은 다음에 다시 정리해보려고 한다.

혹시 다르게 굴리는 분들 있으면 어떻게 하시는지 궁금하다.

BIG

저작자표시 비영리 동일조건 (새창열림)