[k8s] CoreDNS 및 k8s Cluster 내 DNS 조회 문제 Trouble shooting Deep Dive

hoonii2 2023. 8. 1. 21:06

1. 상황

이전에 k8s Cluster 에서의 DNS 조회 문제에 대한 문제를 다뤄봤었으나 상세한 내용이 빠져있어 다시 정리하고자 합니다.

우선 현재 테스트하는 환경은 VMware Workstation 을 통해 1개의 Master , 2개의 Worker Node 로 구성되어 있으며 NAT 동작을 위해 VMnet NAT type 가상 네트워크 어댑터를 사용하고 있습니다.

VMware Workstation 특성 상 때문인지는 모르겠으나 해당 네트워크를 사용하는 가상 리눅스의 Nameserver 가 해당 어댑터의 특정 IP 가 지정되어 동작했습니다. ( 이는 아래 3번 항목에서 상세 확인이 가능합니다 )

만약 가상화 환경으로 k8s cluster 를 구성하여 테스트하는 환경이라면 동일한 문제가 발생할 것으로 보이고 Baremetal 서버에 Linux 를 설치하여 k8s cluster 를 구성한다면 해당 문제가 발생할지 않을 것으로 보입니다.

결과적으로 Work & Master Node 혹은 생성된 Pod 자체에서 CoreDNS 로 DNS 쿼리 요청을 보내지 않고 상단의 IP (192.168.236.2) 를 Nameserver 로 바라보는 현상이 발생했고 이를 해결하는 과정에서 얻게된 관련 지식을 정리하고자 합니다.

이는 Linux 에서의 DNS 조회가 이루어지는 systemd-resolved 프로세스와 CoreDNS 에서 로그가 발생하는지 확인하여 문제를 해결하는 방법이 연관되어 있습니다.

2. (상태 확인) CoreDNS 의 로깅 설정 및 실시간 DNS 쿼리 모니터링

root@k8s-master:~# kubectl edit configmaps -n kube-system coredns

// 생략
    .:53 {
        errors
        health {
           lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
           ttl 30
        }
        prometheus :9153
        forward . 8.8.8.8 {
           max_concurrent 1000
        }
        cache 30
        loop
        reload
        loadbalance
        log // 추가
        debug // 추가
    }
// 생략

위 처럼 CoreDNS 의 ConfigMap 에 log, debug 레벨을 추가하여 CoreDNS Pod 의 로그에서 쿼리가 들어오는지 확인할 수 있도록 합니다. ( 설정 이후 위 내용을 바탕으로 동작시키기 위해 CoreDNS Pod 를 삭제하여 재생성되도록 했습니다 )

DNS 조회 시 CoreDNS 의 Pod 에서 해당 로그를 실시간으로 확인할 수 있도록 -f ( Stream log ) 옵션을 사용하여 확인합니다.

root@k8s-master:~# kubectl get pods -n kube-system -o wide | grep coredns
coredns-5d78c9869d-2m4l2             0/1     ImagePullBackOff   0                168m   192.168.58.251    k8s-node02   <none>           <none>
coredns-5d78c9869d-bljql             1/1     Running            0                172m   192.168.85.247    k8s-node01   <none>           <none>

root@k8s-master:~# kubectl logs -n kube-system coredns-5d78c9869d-bljql -f | grep hoonii2.tistory.com

( 현재 2m4l2 Pod 에 문제가 있지만 DNS 쿼리 실패로 이미지를 가져오지 못해서 발생하는 현상입니다. 아래와 같이 Coredns 의 Service 인 kube-dns 의 Endpoint 는 bljql 만 동작하고 있으므로 CoreDNS 서비스 영향도는 없습니다 )

root@k8s-master:~# kubectl describe services -n kube-system kube-dns
Name:              kube-dns
Namespace:         kube-system
Labels:            k8s-app=kube-dns
                   kubernetes.io/cluster-service=true
                   kubernetes.io/name=CoreDNS
Annotations:       prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.0.10
IPs:               10.96.0.10
Port:              dns  53/UDP
TargetPort:        53/UDP
Endpoints:         192.168.85.247:53 // bljql 의 IP 만 등록되어 있음
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         192.168.85.247:53
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         192.168.85.247:9153
Session Affinity:  None
Events:            <none>

위 " logs ~~ -f " 를 통해 특정 Node 혹은 Pod 에서 DNS 조회 시 CoreDNS 로 DNS 쿼리 요청이 오는지 실시간으로 확인할 수 있습니다. ( 쿼리 테스트는 hoonii2.tistory.com 으로 진행하겠습니다 )

3. (상태 확인) Worker Node 의 DNS Resolve 동작 상태 확인

저의 경우 Ubuntu Linux 를 사용하고 있는데 기본적으로 DNS Name Resolve 관련 프로세스로 systemd-resolved 가 동작합니다.

해당 프로세스가 동작하는지 여부와 프로세스에서 Name Resolve 시 사용하는 NameServer 설정이 어떻게 되어있는지 확인합니다.

root@k8s-node02:~# sudo systemctl status systemd-resolved
● systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/lib/systemd/system/systemd-resolved.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/systemd-resolved.service.d
             └─override.conf
     Active: active (running) since Tue 2023-08-01 11:21:48 UTC; 9min ago
       Docs: man:systemd-resolved.service(8)
             https://www.freedesktop.org/wiki/Software/systemd/resolved
             https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
             https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
   Main PID: 137629 (systemd-resolve)
     Status: "Processing requests..."
      Tasks: 1 (limit: 4573)
     Memory: 4.7M
     CGroup: /system.slice/systemd-resolved.service
             └─137629 /lib/systemd/systemd-resolved

Aug 01 11:22:08 k8s-node02 systemd-resolved[137629]: Using degraded feature set (UDP) for DNS server 192.168.236.2.
Aug 01 11:22:13 k8s-node02 systemd-resolved[137629]: Using degraded feature set (TCP) for DNS server 192.168.236.2.
Aug 01 11:25:40 k8s-node02 systemd-resolved[137629]: Using degraded feature set (UDP) for DNS server 192.168.236.2.
Aug 01 11:25:56 k8s-node02 systemd-resolved[137629]: Using degraded feature set (TCP) for DNS server 192.168.236.2.
Aug 01 11:26:33 k8s-node02 systemd-resolved[137629]: Using degraded feature set (UDP) for DNS server 192.168.236.2.
Aug 01 11:26:43 k8s-node02 systemd-resolved[137629]: Using degraded feature set (TCP) for DNS server 192.168.236.2.
Aug 01 11:26:53 k8s-node02 systemd-resolved[137629]: Using degraded feature set (UDP) for DNS server 192.168.236.2.
Aug 01 11:27:03 k8s-node02 systemd-resolved[137629]: Using degraded feature set (TCP) for DNS server 192.168.236.2.
Aug 01 11:27:14 k8s-node02 systemd-resolved[137629]: Using degraded feature set (UDP) for DNS server 192.168.236.2.
Aug 01 11:27:24 k8s-node02 systemd-resolved[137629]: Using degraded feature set (TCP) for DNS server 192.168.236.2.

여기서 우리는 resolved 프로세스가 동작중이며 192.168.236.2 가 연관되어 있음을 알 수 있습니다. ( 해당 IP 는 개요에서 설명드린 것처럼 Hypervisor type 2 의 특수성으로 보입니다. 관련하여 아시는 분이 계신다면 첨언 부탁드립니다! )

그러면 resolved 프로세스는 어떤 파일을 기준으로 dns 쿼리를 진행하는 걸까요?

/run/systemd/resolv/resolv.conf 를 확인하여 알 수 있으며 현재 상태는 아래와 같습니다.

# This file is managed by man:systemd-resolved(8). Do not edit.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs must not access this file directly, but only through the
# symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a different way,
# replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 192.168.236.2
search localdomain

여기까지 우리는 리눅스의 DNS Resolve 가 CoreDNS 를 바라보고 있지 않음을 확인했습니다.

(테스트 1) 그러면 실제로 k8s cluster 에 속하는 Node 임에도 불구하고 DNS 쿼리 시 CoreDNS 로 쿼리 전송이 되지 않는 것을 확인해보겠습니다.

root@k8s-node02:~# nslookup hoonii2.tistory.com
Server:         127.0.0.53
Address:        127.0.0.53#53

** server can't find hoonii2.tistory.com: SERVFAIL



root@k8s-master:~# kubectl logs -n kube-system coredns-5d78c9869d-bljql -f | grep hoonii2.tistory.com

(테스트 2) DNS 쿼리의 Nameserver 를 직접 지정하였을 경우 정상적으로 쿼리가 되는지 확인하겠습니다. ( 우리는 위 2번에서 coredns 의 service cluster ip 가 10.96.0.10 인 것을 확인했고 이를 활용하겠습니다 )

root@k8s-node02:~# nslookup hoonii2.tistory.com 10.96.0.10
Server:         10.96.0.10
Address:        10.96.0.10#53

Non-authoritative answer:
hoonii2.tistory.com     canonical name = wildcard-tistory-fz0x1pwf.kgslb.com.
Name:   wildcard-tistory-fz0x1pwf.kgslb.com
Address: 211.231.99.250



root@k8s-master:~# kubectl logs -n kube-system coredns-5d78c9869d-bljql -f | grep hoonii2.tistory.com
[INFO] 192.168.58.192:42030 - 11135 "A IN hoonii2.tistory.com. udp 37 false 512" NOERROR qr,rd,ra 156 0.159602539s

직접 CoreDNS 를 지정하였을 경우 DNS 쿼리가 정상 동작하며 로그도 확인할 수 있습니다.

이를 바탕으로 k8s cluster 내의 노드가 CoreDNS 를 바라보고 있지 않음을 확인하였습니다.

4. (설정 변경) systemd-resolved 설정 변경

저는 위 문제의 해결방안으로 resolved 프로세스를 그대로 사용하며 nameserver 를 변경하도록 하였습니다.

이를 위해 /etc/systemd/resolved.conf 파일을 아래와 같이 변경합니다.

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details

[Resolve]
DNS=10.96.0.10
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#DNSOverTLS=no
#Cache=no-negative
#DNSStubListener=yes
#ReadEtcHosts=yes

DNS 주소를 CoreDNS 의 ClusterIP 로 변경하였습니다.

root@k8s-node02:~# systemctl restart systemd-resolved

그리고 실제 동작인 /run/systemd/resolv/resolv.conf 에 적용되도록 systemd-resolved 를 재기동 해주었습니다.

5. 정상 동작 확인

이후 아래와 같이 Nameserver 를 직접 지정하지 않고 CoreDNS 를 통하는지 확인할 수 있었습니다.

root@k8s-node02:~# nslookup hoonii2.tistory.com
Server:         127.0.0.53
Address:        127.0.0.53#53

Non-authoritative answer:
hoonii2.tistory.com     canonical name = wildcard-tistory-fz0x1pwf.kgslb.com.
Name:   wildcard-tistory-fz0x1pwf.kgslb.com
Address: 211.249.222.33

확실하게 아래와 같이 CoreDNS 에 로깅되는지 확인합니다.

root@k8s-master:~# kubectl logs -n kube-system coredns-5d78c9869d-bljql -f | grep hoonii2.tistory.com
[INFO] 192.168.58.192:42030 - 11135 "A IN hoonii2.tistory.com. udp 37 false 512" NOERROR qr,rd,ra 156 0.159602539s
[INFO] 192.168.58.192:34609 - 53042 "A IN hoonii2.tistory.com. udp 48 false 512" NOERROR qr,rd,ra 156 0.16294326s

여기서 우리는 Linux 의 DNS Resolve 가 어떤 프로세스를 통해 이뤄지는지와 CoreDNS 로깅도 알 수 있었습니다.

만약 CoreDNS 가 Cluster 내부 Domain 만 쿼리가 되고 외부로의 DNS 조회가 되지 않는다면 아래 글에서 ConfigMap 수정 부분을 참고 부탁드립니다.

[k8s] Appendix A. k8s node 에서 CoreDNS 를 통한 외부 DNS 조회 문제

1. 개요 kubeadm 을 통해 k8s cluster 구성 시, systemd-resolved 프로세스 혹은 k8s 동작 때문인지 resolv.conf 는 "127.0.0.53" 으로 고정되어있으며, 외부 DNS 쿼리 시 조회가 안되는 상황 발생 때문에 kubectl 을 통

hoonii2.tistory.com