Istio可观测性系列-监控指标没有上报

背景

istio 版本: 1.13.4

aeraki 版本: 1.1.5

业务反馈在测试过程中发现 istio_request_total 指标查找不到,通过Prometheus查看发现这个指标部分服务是正常采集上报。

通过 istioctl ps 预览集群状态

# ./istioctl ps
NAME                                                             CLUSTER        CDS          LDS          EDS          RDS          ISTIOD                      VERSION
test-1.istio-system                                                             NOT SENT     NOT SENT     NOT SENT     NOT SENT     istiod-66bd9d59d8-k6qzk     65536.65536.65536
test-1.istio-system                                                             NOT SENT     NOT SENT     NOT SENT     NOT SENT     istiod-66bd9d59d8-k6qzk     65536.65536.65536
components-comsumer-sfkt-default-test-rfgds.components-nacos     Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
components-provider-sfkt-default-test-6gw8b.components-nacos     Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
components-server-sfkt-default-test-frjwm.components-nacos       Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
istio-egressgateway-c5b57c584-rz9rw.istio-system                 Kubernetes     SYNCED       SYNCED       SYNCED       NOT SENT     istiod-66bd9d59d8-k6qzk     1.13.4
istio-ingressgateway-7767c959b4-jkbgb.istio-system               Kubernetes     SYNCED       SYNCED       SYNCED       NOT SENT     istiod-66bd9d59d8-k6qzk     1.13.4
mesh-demo-dubbo-consumer-test-rpzzh.pulsar-manager               Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
mesh-demo-dubbo-pv1111-test-rfmtg.pulsar-manager                 Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
mesh-demo-dubbo-pv2-test-gjhwb.pulsar-manager                    Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
mesh-demo-httpv1-test-8b7f5.pulsar-manager                       Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
mesh-demo-httpv2-test-qlkhb.pulsar-manager                       Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.14-dev
sl-job-agentd-sl-job-agentd-test-fskpl-lg8hq.sl-devops           Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.12-dev
sl-job-agentd-sl-job-server-test-fskpl-z5b4t.sl-devops           Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.12-dev
sl-job-web-api-sl-job-web-api-dev-2pm6n.sl-devops                Kubernetes     SYNCED       SYNCED       SYNCED       SYNCED       istiod-66bd9d59d8-k6qzk     1.12-dev

可以看到整个istio 集群下存在多个版本,其中 1.13.4 是属于istio正常安装版本,发现 1.12-dev pod能够正常上报 istio_request_total 指标,而 1.14-dev 没有正常上报。那么 1.12-dev1.14-dev 是怎么来的呢?查看这些pod的使用的 istio-proxy 的镜像是 aeraki 项目下的 meta-protocol-proxy 代理镜像,并且使用的版本不一致,将无法正常上报指标的 istio-proxy 镜像修改后可以正常查看指标。

# kubectl get pod -n ns sl-starter-mesh-demo-sfkt-deve-default-deve-8hlgb -o yaml|more
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: sl-starter-mesh-demo-hades
    kubectl.kubernetes.io/default-logs-container: sl-starter-mesh-demo-hades
    lifecycle.apps.kruise.io/timestamp: "2023-01-03T07:45:40Z"
    lxcfs-admission-webhook.aliyun.com/status: mutated
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/bootstrapOverride: aeraki-bootstrap-config
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/proxyImage: aeraki/meta-protocol-proxy-debug:1.1.2
...
...

# kubectl get pod -n ns components-comsumer-sfkt-default-test-rfgds -o yaml|more
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-container: components-comsumer
    kubectl.kubernetes.io/default-logs-container: components-comsumer
    lifecycle.apps.kruise.io/timestamp: "2023-01-12T05:32:51Z"
    lxcfs-admission-webhook.aliyun.com/status: mutated
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "15020"
    prometheus.io/scrape: "true"
    sidecar.istio.io/bootstrapOverride: aeraki-bootstrap-config
    sidecar.istio.io/inject: "true"
    sidecar.istio.io/proxyImage: aeraki/meta-protocol-proxy-debug:1.1.5
...
...

根本原因

通过Istio官方文档发现istio默认指标也是由 EnvoyFilter 控制采集,集群内安装的是1.13.4,默认配置了1.11、1.12和1.13版本的,但是没有1.14版本的,而 aeraki/meta-protocol-proxy-debug:1.1.5 是基于 istio 1.14版本开发的,由于没有集群内没有配置1.14版本的 EnvoyFilter,所以最终导致1.12版本的proxy可以正常上报指标,1.14版本的proxy没有正常上报指标

# kubectl get envoyfilter -n istio-system
NAME                    AGE
stats-filter-1.11       69d
stats-filter-1.12       69d
stats-filter-1.13       69d
tcp-stats-filter-1.11   69d
tcp-stats-filter-1.12   69d
tcp-stats-filter-1.13   69d


# kubectl get envoyfilter -n istio-system
NAME                    AGE
stats-filter-1.11       69d
stats-filter-1.12       69d
stats-filter-1.13       69d
stats-filter-1.14       2m15s
tcp-stats-filter-1.11   69d
tcp-stats-filter-1.12   69d
tcp-stats-filter-1.13   69d
tcp-stats-filter-1.14   2m8s

添加1.14版本相关的 EnvoyFilter 之后指标可以正常上报

# curl 10.98.211.214:15020/stats/prometheus|grep istio_request|more
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13666    0 13666    0     0  23879      0 --:--:-- --:--:-- --:--:-- 23849# TYPE istio_requests_total counter
istio_requests_total{response_code="200",reporter="destination",source_workload="components-server-sfkt-default-test",source_workload_namespace="components-nacos",source_pr
incipal="spiffe://kubeyy.com/ns/components-nacos/sa/default",source_app="components-server-sfkt-default-test",source_version="unknown",source_cluster="Kubernetes",destinati
on_workload="components-comsumer-sfkt-default-test",destination_workload_namespace="components-nacos",destination_principal="spiffe://kubeyy.com/ns/components-nacos/sa/defa
ult",destination_app="components-comsumer-sfkt-default-test",destination_version="unknown",destination_service="components-comsumer-test-ps.components-nacos.svc.kubeyy.com"
,destination_service_name="components-comsumer-test-ps",destination_service_namespace="components-nacos",destination_cluster="Kubernetes",request_protocol="http",response_f
lags="-",grpc_response_status="",connection_security_policy="mutual_tls",source_canonical_service="components-server-sfkt-default-test",destination_canonical_service="compo
nents-comsumer-sfkt-default-test",source_canonical_revision="latest",destination_canonical_revision="latest"} 699
istio_requests_total{response_code="200",reporter="destination",source_workload="unknown",source_workload_namespace="unknown",source_principal="unknown",source_app="unknown
",source_version="unknown",source_cluster="unknown",destination_workload="components-comsumer-sfkt-default-test",destination_workload_namespace="components-nacos",destinati
on_principal="unknown",destination_app="components-comsumer-sfkt-default-test",destination_version="unknown",destination_service="svc-metrics-components-nacos-test.componen
ts-nacos.svc.kubeyy.com",destination_service_name="svc-metrics-components-nacos-test",destination_service_namespace="components-nacos",destination_cluster="Kubernetes",requ
est_protocol="http",response_flags="-",grpc_response_status="",connection_security_policy="none",source_canonical_service="unknown",destination_canonical_service="component
s-comsumer-sfkt-default-test",source_canonical_revision="latest",destination_canonical_revision="latest"} 2