4000-520-616
欢迎来到免疫在线!(蚂蚁淘生物旗下平台)  请登录 |  免费注册 |  询价篮
主营:原厂直采,平行进口,授权代理(蚂蚁淘为您服务)
咨询热线电话
4000-520-616
当前位置: 首页 > 新闻动态 >
新闻详情
...Promtheus监控k8s企业家应用.md at master · blackface111/...
来自 : github 发布时间:2021-03-25
# 给21机器打个污点,22机器:~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master=master:NoSchedule

\"1581470696125\"

\"1583458119938\"

# 21/22两个机器,修改软连接:~]# mount -o remount,rw /sys/fs/cgroup/~]# ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu~]# ls -l /sys/fs/cgroup/

**mount -o remount, rw /sys/fs/cgroup:**重新以可读可写的方式挂载为已经挂载/sys/fs/cgroup

**ln -s:**创建对应的软链接

**ls -l:**显示不隐藏的文件与文件夹的详细信息

\"1583458181722\"

# 22机器,应用资源清单:~]# kubectl apply -f http://k8s-yaml.od.com/cadvisor/ds.yaml~]# kubectl get pods -n kube-system -o wide

只有22机器上有,跟我们预期一样

\"1583458264404\"

# 21机器,我们删掉污点:~]# kubectl taint node hdss7-21.host.com node-role.kubernetes.io/master-# out: node/hdss7-21.host.com untainted

看dashboard,污点已经没了

\"1583458299094\"

在去Pods看,污点没了,pod就自动起来了

\"1583458316885\"

完成

再修改下

\"1583458394302\"

交付blackbox-exporter

**WHAT:**监控业务容器存活性

# 200机器,下载镜像~]# docker pull prom/blackbox-exporter:v0.15.1~]# docker images|grep blackbox-exporter~]# docker tag 81b70b6158be harbor.od.com/public/blackbox-exporter:v0.15.1~]# docker push harbor.od.com/public/blackbox-exporter:v0.15.1~]# mkdir /data/k8s-yaml/blackbox-exporter~]# cd /data/k8s-yaml/blackbox-exporterblackbox-exporter]# vi cm.yamlapiVersion: v1kind: ConfigMapmetadata: labels: app: blackbox-exporter name: blackbox-exporter namespace: kube-systemdata: blackbox.yml: |- modules: http_2xx: prober: http timeout: 2s http: valid_http_versions: [\"HTTP/1.1\", \"HTTP/2\"] valid_status_codes: [200,301,302] method: GET preferred_ip_protocol: \"ip4\" tcp_connect: prober: tcp timeout: 2sblackbox-exporter]# vi dp.yamlkind: DeploymentapiVersion: extensions/v1beta1metadata: name: blackbox-exporter namespace: kube-system labels: app: blackbox-exporter annotations: deployment.kubernetes.io/revision: 1spec: replicas: 1 selector: matchLabels: app: blackbox-exporter template: metadata: labels: app: blackbox-exporter spec: volumes: - name: config configMap: name: blackbox-exporter defaultMode: 420 containers: - name: blackbox-exporter image: harbor.od.com/public/blackbox-exporter:v0.15.1 imagePullPolicy: IfNotPresent args: - --config.file=/etc/blackbox_exporter/blackbox.yml - --log.level=info - --web.listen-address=:9115 ports: - name: blackbox-port containerPort: 9115 protocol: TCP resources: limits: cpu: 200m memory: 256Mi requests: cpu: 100m memory: 50Mi volumeMounts: - name: config mountPath: /etc/blackbox_exporter readinessProbe: tcpSocket: port: 9115 initialDelaySeconds: 5 timeoutSeconds: 5 periodSeconds: 10 successThreshold: 1 failureThreshold: 3blackbox-exporter]# vi svc.yamlkind: ServiceapiVersion: v1metadata: name: blackbox-exporter namespace: kube-systemspec: selector: app: blackbox-exporter ports: - name: blackbox-port protocol: TCP port: 9115blackbox-exporter]# vi ingress.yamlapiVersion: extensions/v1beta1kind: Ingressmetadata: name: blackbox-exporter namespace: kube-systemspec: rules: - host: blackbox.od.com http: paths: - path: / backend: serviceName: blackbox-exporter servicePort: blackbox-port

\"1583460099035\"

# 11机器,解析域名:~]# vi /var/named/od.com.zoneserial 前滚一位blackbox A 10.4.7.10~]# systemctl restart named# 22机器~]# dig -t A blackbox.od.com @192.168.0.2 +short# out: 10.4.7.10

\"1583460226033\"

# 22机器,应用:~]# kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/cm.yaml~]# kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/dp.yaml~]# kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/svc.yaml~]# kubectl apply -f http://k8s-yaml.od.com/blackbox-exporter/ingress.yaml

\"1583460413279\"

blackbox.od.com

\"1583460433145\"

完成

安装部署Prometheus-server

**WHAT:**服务核心组件,通过pull metrics从 Exporter 拉取和存储监控数据,并提供一套灵活的查询语言(PromQL)

prometheus-server官网docker地址

# 200机器,准备镜像、资源清单:~]# docker pull prom/prometheus:v2.14.0~]# docker images|grep prometheus~]# docker tag 7317640d555e harbor.od.com/infra/prometheus:v2.14.0~]# docker push harbor.od.com/infra/prometheus:v2.14.0~]# mkdir /data/k8s-yaml/prometheus~]# cd /data/k8s-yaml/prometheusprometheus]# vi rbac.yamlapiVersion: v1kind: ServiceAccountmetadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: \"true\" name: prometheus namespace: infraapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: \"true\" name: prometheusrules:- apiGroups: - \"\" resources: - nodes - nodes/metrics - services - endpoints - pods verbs: - get - list - watch- apiGroups: - \"\" resources: - configmaps verbs: - get- nonResourceURLs: - /metrics verbs: - getapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: \"true\" name: prometheusroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheussubjects:- kind: ServiceAccount name: prometheus namespace: infraprometheus]# vi dp.yamlapiVersion: extensions/v1beta1kind: Deploymentmetadata: annotations: deployment.kubernetes.io/revision: \"5\" labels: name: prometheus name: prometheus namespace: infraspec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 7 selector: matchLabels: app: prometheus strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate template: metadata: labels: app: prometheus spec: containers: - name: prometheus image: harbor.od.com/infra/prometheus:v2.14.0 imagePullPolicy: IfNotPresent command: - /bin/prometheus args: - --config.file=/data/etc/prometheus.yml - --storage.tsdb.path=/data/prom-db - --storage.tsdb.min-block-duration=10m - --storage.tsdb.retention=72h ports: - containerPort: 9090 protocol: TCP volumeMounts: - mountPath: /data name: data resources: requests: cpu: \"1000m\" memory: \"1.5Gi\" limits: cpu: \"2000m\" memory: \"3Gi\" imagePullSecrets: - name: harbor securityContext: runAsUser: 0 serviceAccountName: prometheus volumes: - name: data nfs: server: hdss7-200 path: /data/nfs-volume/prometheusprometheus]# vi svc.yamlapiVersion: v1kind: Servicemetadata: name: prometheus namespace: infraspec: ports: - port: 9090 protocol: TCP targetPort: 9090 selector: app: prometheusprometheus]# vi ingress.yamlapiVersion: extensions/v1beta1kind: Ingressmetadata: annotations: kubernetes.io/ingress.class: traefik name: prometheus namespace: infraspec: rules: - host: prometheus.od.com http: paths: - path: / backend: serviceName: prometheus servicePort: 9090# 准备prometheus的配置文件:prometheus]# mkdir /data/nfs-volume/prometheusprometheus]# cd /data/nfs-volume/prometheusprometheus]# mkdir {etc,prom-db}prometheus]# cd etc/etc]# cp /opt/certs/ca.pem .etc]# cp -a /opt/certs/client.pem .etc]# cp -a /opt/certs/client-key.pem .etc]# prometheus.ymlglobal: scrape_interval: 15s evaluation_interval: 15sscrape_configs:- job_name: \'etcd\' tls_config: ca_file: /data/etc/ca.pem cert_file: /data/etc/client.pem key_file: /data/etc/client-key.pem scheme: https static_configs: - targets: - \'10.4.7.12:2379\' - \'10.4.7.21:2379\' - \'10.4.7.22:2379\'- job_name: \'kubernetes-apiservers\' kubernetes_sd_configs: - role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https- job_name: \'kubernetes-pods\' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\\d+)?;(\\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: \'kubernetes-kubelet\' kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __address__ replacement: ${1}:10255- job_name: \'kubernetes-cadvisor\' kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __address__ replacement: ${1}:4194- job_name: \'kubernetes-kube-state\' kubernetes_sd_configs: - role: pod relabel_configs: - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name - source_labels: [__meta_kubernetes_pod_label_grafanak8sapp] regex: .*true.* action: keep - source_labels: [\'__meta_kubernetes_pod_label_daemon\', \'__meta_kubernetes_pod_node_name\'] regex: \'node-exporter;(.*)\' action: replace target_label: nodename- job_name: \'blackbox_http_pod_probe\' metrics_path: /probe kubernetes_sd_configs: - role: pod params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme] action: keep regex: http - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path] action: replace regex: ([^:]+)(?::\\d+)?;(\\d+);(.+) replacement: $1:$2$3 target_label: __param_target - action: replace target_label: __address__ replacement: blackbox-exporter.kube-system:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: \'blackbox_tcp_pod_probe\' metrics_path: /probe kubernetes_sd_configs: - role: pod params: module: [tcp_connect] relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme] action: keep regex: tcp - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port] action: replace regex: ([^:]+)(?::\\d+)?;(\\d+) replacement: $1:$2 target_label: __param_target - action: replace target_label: __address__ replacement: blackbox-exporter.kube-system:9115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name- job_name: \'traefik\' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: keep regex: traefik - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\\d+)?;(\\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

**cp -a:**在复制目录时使用,它保留链接、文件属性,并复制目录下的所有内容

\"1583461823355\"

# 11机器, 解析域名,有ingress就有页面就需要解析:~]# vi /var/named/od.com.zoneserial 前滚一位prometheus A 10.4.7.10~]# systemctl restart named~]# dig -t A prometheus.od.com @10.4.7.11 +short# out:10.4.7.10

\"1582704423890\"

# 22机器,应用配置清单:~]# kubectl apply -f http://k8s-yaml.od.com/prometheus/rbac.yaml~]# kubectl apply -f http://k8s-yaml.od.com/prometheus/dp.yaml~]# kubectl apply -f http://k8s-yaml.od.com/prometheus/svc.yaml~]# kubectl apply -f http://k8s-yaml.od.com/prometheus/ingress.yaml

\"1583461941453\"

\"1583462134250\"

prometheus.od.com

这就是Prometheus自带的UI页面,现在你就知道为什么我们需要Grafana来替代了,如果你还不清楚,等下看Grafana的页面你就知道了

\"1583462164465\"

\"1583462217169\"

完成

配置Prometheus监控业务容器

先配置traefik

\"1583462282296\"

# Edit a Daemon Set,添加以下内容,记得给上面加逗号:\"annotations\": { \"prometheus_io_scheme\": \"traefik\", \"prometheus_io_path\": \"/metrics\", \"prometheus_io_port\": \"8080\"# 直接加进去update,会自动对齐 

\"1583462379871\"

删掉两个对应的pod让它重启

\"1583462451073\"

# 22机器,查看下,如果起不来就用命令行的方式强制删除:~]# kubectl get pods -n kube-system~]# kubectl delete pods traefik-ingress-g26kw -n kube-system --force --grace-period=0

\"1583462566364\"

启动成功后,去Prometheus查看

刷新后,可以看到是traefik2/2,已经有了

\"1583462600531\"

完成

blackbox

我们起一个dubbo-service,之前我们最后做的是Apollo的版本,现在我们的Apollo已经关了(因为消耗资源),现在需要起更早之前不是Apollo的版本。

我们去harbor里面找

\"1583465190219\"

我的Apollo的版本可能比你的多一个,不用在意,那是做实验弄的

修改版本信息

\"1583466214230\"

\"1583466251914\"

在把scale改成1

\"1583466284890\"

查看POD的LOGS日志

\"1583466310189\"

翻页查看,已经启动

\"1583466328146\"

如何监控存活性,只需要修改配置

\"1581517517953\"

# Edit a Deployment(TCP),添加以下内容\"annotations\": { \"blackbox_port\": \"20880\", \"blackbox_scheme\": \"tcp\"# 直接加进去update,会自动对齐 

\"1583466938931\"

UPDATE后,已经running起来了

\"1583467301614\"

prometheus.od.com刷新,自动发现业务

\"1583466979716\"

blackbox.od.com 刷新

\"1583467331128\"

同样的,我们把dubbo-consumer也弄进来

先去harbor找一个不是Apollo的版本(为什么要用不是Apollo的版本前面已经说了)

\"1583503611435\"

修改版本信息,并添加annotations

# Edit a Deployment(http),添加以下内容,记得前面的逗号\"annotations\":{ \"blackbox_path\": \"/hello?name=health\", \"blackbox_port\": \"8080\", \"blackbox_scheme\": \"http\"# 直接加进去update,会自动对齐 

\"1583503670313\"

\"1583504095794\"

UPDATE后,把scale改成1

\"1583503796291\"

确保起来了

\"1583503811855\"

\"1583503829457\"

prometheus.od.com刷新,自动发现业务

\"1583503935815\"

blackbox.od.com 刷新

\"1583504112078\"

安装部署配置Grafana

**WHAT:**美观、强大的可视化监控指标展示工具

**WHY:**用来代替prometheus原生UI界面

# 200机器,准备镜像、资源配置清单:~]# docker pull grafana/grafana:5.4.2~]# docker images|grep grafana~]# docker tag 6f18ddf9e552 harbor.od.com/infra/grafana:v5.4.2~]# docker push harbor.od.com/infra/grafana:v5.4.2~]# mkdir /data/k8s-yaml/grafana/ /data/nfs-volume/grafana~]# cd /data/k8s-yaml/grafana/grafana]# vi rbac.yamlapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: \"true\" name: grafanarules:- apiGroups: - \"*\" resources: - namespaces - deployments - pods verbs: - get - list - watchapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: labels: addonmanager.kubernetes.io/mode: Reconcile kubernetes.io/cluster-service: \"true\" name: grafanaroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: grafanasubjects:- kind: User name: k8s-nodegrafana]# vi dp.yamlapiVersion: extensions/v1beta1kind: Deploymentmetadata: labels: app: grafana name: grafana name: grafana namespace: infraspec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 7 selector: matchLabels: name: grafana strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 1 type: RollingUpdate template: metadata: labels: app: grafana name: grafana spec: containers: - name: grafana image: harbor.od.com/infra/grafana:v5.4.2 imagePullPolicy: IfNotPresent ports: - containerPort: 3000 protocol: TCP volumeMounts: - mountPath: /var/lib/grafana name: data imagePullSecrets: - name: harbor securityContext: runAsUser: 0 volumes: - nfs: server: hdss7-200 path: /data/nfs-volume/grafana name: datagrafana]# vi svc.yamlapiVersion: v1kind: Servicemetadata: name: grafana namespace: infraspec: ports: - port: 3000 protocol: TCP targetPort: 3000 selector: app: grafanagrafana]# vi ingress.yamlapiVersion: extensions/v1beta1kind: Ingressmetadata: name: grafana namespace: infraspec: rules: - host: grafana.od.com http: paths: - path: / backend: serviceName: grafana servicePort: 3000

\"1583504719781\"

# 11机器,解析域名:~]# vi /var/named/od.com.zoneserial 前滚一位grafana A 10.4.7.10~]# systemctl restart named~]# ping grafana.od.com

\"1582705048800\"

# 22机器,应用配置清单:~]# kubectl apply -f http://k8s-yaml.od.com/grafana/rbac.yaml~]# kubectl apply -f http://k8s-yaml.od.com/grafana/dp.yaml~]# kubectl apply -f http://k8s-yaml.od.com/grafana/svc.yaml~]# kubectl apply -f http://k8s-yaml.od.com/grafana/ingress.yaml

\"1583504865941\"

grafana.od.com

默认账户和密码都是admin

修改密码:admin123

\"1583504898029\"

修改配置,修改如下图

\"1583505029816\"

装插件

进入容器

\"1583505097409\"

# 第一个:kubenetes Appgrafana# grafana-cli plugins install grafana-kubernetes-app# 第二个:Clock Pannelgrafana# grafana-cli plugins install grafana-clock-panel# 第三个:Pie Chartgrafana# grafana-cli plugins install grafana-piechart-panel# 第四个:D3Gaugegrafana# grafana-cli plugins install briangann-gauge-panel# 第五个:Discretegrafana# grafana-cli plugins install natel-discrete-panel

\"1583505305939\"

装完后,可以在200机器查看

# 200机器:cd /data/nfs-volume/grafana/plugins/plugins]# ll

\"1583505462177\"

删掉让它重启

\"1583505490948\"

重启完成后

查看grafana.od.com,刚刚安装的5个插件都在里面了(记得检查是否在里面了)

\"1583505547061\"

添加数据源:Add data source

\"1583505581557\"

\"1583505600031\"

# 填入参数:URL:http://prometheus.od.comTLS Client Auth✔ With CA Cert✔

\"1583505840252\"

# 填入参数对应的pem参数:# 200机器拿ca等:~]# cat /opt/certs/ca.pem~]# cat /opt/certs/client.pem~]# cat /opt/certs/client-key.pem

\"1583505713033\"

\"1583505856093\"

保存

然后我们去配置plugins里面的kubernetes

\"1583505923109\"

\"1583505938700\"

右侧就多了个按钮,点击进去

\"1583505969865\"

# 按参数填入:Name:myk8sURL:https://10.4.7.10:7443Access:ServerTLS Client Auth✔ With CA Cert✔

\"1583506058483\"

# 填入参数:# 200机器拿ca等:~]# cat /opt/certs/ca.pem~]# cat /opt/certs/client.pem~]# cat /opt/certs/client-key.pem

\"1583506131529\"

save后再点击右侧框的图标,并点击Name

\"1583506163546\"

可能抓取数据的时间会稍微慢些(两分钟左右)

\"1583506184293\"

\"1583506503213\"

点击右上角的K8s Cluster,选择你要看的东西

\"1583506545308\"

由于K8s Container里面数据不全,如下图

\"1583506559069\"

我们改下,把Cluster删了

\"1583506631392\"

\"1583506645982\"

container也删了

\"1583506675876\"

deployment也删了

\"1583506695618\"

node也删了

\"1583506709705\"

\"1583506730713\"

\"1583506744138\"

把我给你准备的dashboard的json文件import进来

\"1583543092886\"

\"1583543584476\"

\"1583543602130\"

用同样的方法把node、deployment、cluster、container这4个分别import进来

\"1583543698727\"

可以都看一下,已经正常了

然后把etcd、generic、traefik也import进来

\"1583543809740\"

\"1583543831830\"

还有另外一种import的方法(使用官网的):

grafana官网

找一个别人写好的点进去

\"1584241883144\"

这个编号可以直接用

\"1584241903882\"

如下图,我们装blackbox的编号是9965

\"1584242072703\"

\"1584242093513\"

把名字和Prometheus修改一下

\"1584242164621\"

或者,你也可以用我上传的(我用的是7587)

\"1583543931644\"

你可以两个都用,自己做对比,都留着也可以,就是占一些资源

JMX

\"1583544009027\"

这个里面还什么都没有

\"1583544017606\"

把Dubbo微服务数据弄到Grafana

dubbo-service

\"1583544062372\"

# Edit a Daemon Set,添加以下内容,注意给上一行加逗号 \"prometheus_io_scrape\": \"true\", \"prometheus_io_port\": \"12346\", \"prometheus_io_path\": \"/\"# 直接加进去update,会自动对齐,

\"1583544144136\"

dubbo-consumer

\"1583544157268\"

# Edit a Daemon Set,添加以下内容,注意给上一行加逗号 \"prometheus_io_scrape\": \"true\", \"prometheus_io_port\": \"12346\", \"prometheus_io_path\": \"/\"# 直接加进去update,会自动对齐,

\"1583544192459\"

刷新JMX(可能有点慢,我等了1分钟才出来service,我机器不行了)

\"1583544446817\"

完成

此时你可以感受到,Grafana明显比K8S自带的UI界面更加人性化

安装部署alertmanager

WHAT: 从 Prometheus server 端接收到 alerts 后,会进行去除重复数据,分组,并路由到对方的接受方式,发出报警。常见的接收方式有:电子邮件,pagerduty 等。

**WHY:**使得系统的警告随时让我们知道

# 200机器,准备镜像、资源清单:~]# mkdir /data/k8s-yaml/alertmanager~]# cd /data/k8s-yaml/alertmanageralertmanager]# docker pull docker.io/prom/alertmanager:v0.14.0# 注意,这里你如果不用14版本可能会报错alertmanager]# docker images|grep alertalertmanager]# docker tag 23744b2d645c harbor.od.com/infra/alertmanager:v0.14.0alertmanager]# docker push harbor.od.com/infra/alertmanager:v0.14.0# 注意下面记得修改成你自己的邮箱等信息,还有中文注释可以删掉alertmanager]# vi cm.yamlapiVersion: v1kind: ConfigMapmetadata: name: alertmanager-config namespace: infradata: config.yml: |- global: # 在没有报警的情况下声明为已解决的时间 resolve_timeout: 5m # 配置邮件发送信息 smtp_smarthost: \'smtp.163.com:25\' smtp_from: \'ben909336740@163.com\' smtp_auth_username: \'ben909336740@163.com\' smtp_auth_password: \'xxxxxx\' smtp_require_tls: false # 所有报警信息进入后的根路由,用来设置报警的分发策略 route: # 这里的标签列表是接收到报警信息后的重新分组标签,例如,接收到的报警信息里面有许多具有 cluster=A 和 alertname=LatncyHigh 这样的标签的报警信息将会批量被聚合到一个分组里面 group_by: [\'alertname\', \'cluster\'] # 当一个新的报警分组被创建后,需要等待至少group_wait时间来初始化通知,这种方式可以确保您能有足够的时间为同一分组来获取多个警报,然后一起触发这个报警信息。 group_wait: 30s # 当第一个报警发送后,等待\'group_interval\'时间来发送新的一组报警信息。 group_interval: 5m # 如果一个报警信息已经发送成功了,等待\'repeat_interval\'时间来重新发送他们 repeat_interval: 5m # 默认的receiver:如果一个报警没有被一个route匹配,则发送给默认的接收器 receiver: default receivers: - name: \'default\' email_configs: - to: \'909336740@qq.com\' send_resolved: truealertmanager]# vi dp.yamlapiVersion: extensions/v1beta1kind: Deploymentmetadata: name: alertmanager namespace: infraspec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: labels: app: alertmanager spec: containers: - name: alertmanager image: harbor.od.com/infra/alertmanager:v0.14.0 args: - \"--config.file=/etc/alertmanager/config.yml\" - \"--storage.path=/alertmanager\" ports: - name: alertmanager containerPort: 9093 volumeMounts: - name: alertmanager-cm mountPath: /etc/alertmanager volumes: - name: alertmanager-cm configMap: name: alertmanager-config imagePullSecrets: - name: harboralertmanager]# vi svc.yamlapiVersion: v1kind: Servicemetadata: name: alertmanager namespace: infraspec: selector:  app: alertmanager ports: - port: 80 targetPort: 9093

\"1583547933312\"

# 22机器,应用清单:~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/cm.yaml~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/dp.yaml~]# kubectl apply -f http://k8s-yaml.od.com/alertmanager/svc.yaml

\"1583545326722\"

\"1583545352951\"

# 200机器,配置报警规则:~]# vi /data/nfs-volume/prometheus/etc/rules.ymlgroups:- name: hostStatsAlert rules: - alert: hostCpuUsageAlert expr: sum(avg without (cpu)(irate(node_cpu{mode!=\'idle\'}[5m]))) by (instance) 0.85 for: 5m labels: severity: warning annotations: summary: \"{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }}%)\" - alert: hostMemUsageAlert expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal 0.85 for: 5m labels: severity: warning annotations: summary: \"{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }}%)\" - alert: OutOfInodes expr: node_filesystem_free{fstype=\"overlay\",mountpoint =\"/\"} / node_filesystem_size{fstype=\"overlay\",mountpoint =\"/\"} * 100 10 for: 5m labels: severity: warning annotations: summary: \"Out of inodes (instance {{ $labels.instance }})\" description: \"Disk is almost running out of available inodes ( 10% left) (current value: {{ $value }})\" - alert: OutOfDiskSpace expr: node_filesystem_free{fstype=\"overlay\",mountpoint =\"/rootfs\"} / node_filesystem_size{fstype=\"overlay\",mountpoint =\"/rootfs\"} * 100 10 for: 5m labels: severity: warning annotations: summary: \"Out of disk space (instance {{ $labels.instance }})\" description: \"Disk is almost full ( 10% left) (current value: {{ $value }})\" - alert: UnusualNetworkThroughputIn expr: sum by (instance) (irate(node_network_receive_bytes[2m])) / 1024 / 1024 100 for: 5m labels: severity: warning annotations: summary: \"Unusual network throughput in (instance {{ $labels.instance }})\" description: \"Host network interfaces are probably receiving too much data ( 100 MB/s) (current value: {{ $value }})\" - alert: UnusualNetworkThroughputOut expr: sum by (instance) (irate(node_network_transmit_bytes[2m])) / 1024 / 1024 100 for: 5m labels: severity: warning annotations: summary: \"Unusual network throughput out (instance {{ $labels.instance }})\" description: \"Host network interfaces are probably sending too much data ( 100 MB/s) (current value: {{ $value }})\" - alert: UnusualDiskReadRate expr: sum by (instance) (irate(node_disk_bytes_read[2m])) / 1024 / 1024 50 for: 5m labels: severity: warning annotations: summary: \"Unusual disk read rate (instance {{ $labels.instance }})\" description: \"Disk is probably reading too much data ( 50 MB/s) (current value: {{ $value }})\" - alert: UnusualDiskWriteRate expr: sum by (instance) (irate(node_disk_bytes_written[2m])) / 1024 / 1024 50 for: 5m labels: severity: warning annotations: summary: \"Unusual disk write rate (instance {{ $labels.instance }})\" description: \"Disk is probably writing too much data ( 50 MB/s) (current value: {{ $value }})\" - alert: UnusualDiskReadLatency expr: rate(node_disk_read_time_ms[1m]) / rate(node_disk_reads_completed[1m]) 100 for: 5m labels: severity: warning annotations: summary: \"Unusual disk read latency (instance {{ $labels.instance }})\" description: \"Disk latency is growing (read operations 100ms) (current value: {{ $value }})\" - alert: UnusualDiskWriteLatency expr: rate(node_disk_write_time_ms[1m]) / rate(node_disk_writes_completedl[1m]) 100 for: 5m labels: severity: warning annotations: summary: \"Unusual disk write latency (instance {{ $labels.instance }})\" description: \"Disk latency is growing (write operations 100ms) (current value: {{ $value }})\"- name: http_status rules: - alert: ProbeFailed expr: probe_success == 0 for: 1m labels: severity: error annotations: summary: \"Probe failed (instance {{ $labels.instance }})\" description: \"Probe failed (current value: {{ $value }})\" - alert: StatusCode expr: probe_http_status_code = 199 OR probe_http_status_code = 400 for: 1m labels: severity: error annotations: summary: \"Status Code (instance {{ $labels.instance }})\" description: \"HTTP status code is not 200-399 (current value: {{ $value }})\" - alert: SslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() 86400 * 30 for: 5m labels: severity: warning annotations: summary: \"SSL certificate will expire soon (instance {{ $labels.instance }})\" description: \"SSL certificate expires in 30 days (current value: {{ $value }})\" - alert: SslCertificateHasExpired expr: probe_ssl_earliest_cert_expiry - time() = 0 for: 5m labels: severity: error annotations: summary: \"SSL certificate has expired (instance {{ $labels.instance }})\" description: \"SSL certificate has expired already (current value: {{ $value }})\" - alert: BlackboxSlowPing expr: probe_icmp_duration_seconds 2 for: 5m labels: severity: warning annotations: summary: \"Blackbox slow ping (instance {{ $labels.instance }})\" description: \"Blackbox ping took more than 2s (current value: {{ $value }})\" - alert: BlackboxSlowRequests expr: probe_http_duration_seconds 2  for: 5m labels: severity: warning annotations: summary: \"Blackbox slow requests (instance {{ $labels.instance }})\" description: \"Blackbox request took more than 2s (current value: {{ $value }})\" - alert: PodCpuUsagePercent expr: sum(sum(label_replace(irate(container_cpu_usage_seconds_total[1m]),\"pod\",\"$1\",\"container_label_io_kubernetes_pod_name\", \"(.*)\"))by(pod) / on(pod) group_right kube_pod_container_resource_limits_cpu_cores *100 )by(container,namespace,node,pod,severity) 80 for: 5m labels: severity: warning annotations: summary: \"Pod cpu usage percent has exceeded 80% (current value: {{ $value }}%)\"# 在最后面添加如下内容~]# vi /data/nfs-volume/prometheus/etc/prometheus.ymlalerting: alertmanagers: - static_configs: - targets: [\"alertmanager\"]rule_files: - \"/data/etc/rules.yml\"

\"1583545590235\"

**rules.yml文件:**这个文件就是报警规则

这时候可以重启Prometheus的pod,但生产商因为Prometheus太庞大,删掉容易拖垮集群,所以我们用另外一种方法,平滑加载(Prometheus支持):

# 21机器,因为我们起的Prometheus是在21机器,平滑加载:~]# ps aux|grep prometheus~]# kill -SIGHUP 1488

\"1583545718137\"

\"1583545762475\"

这时候报警规则就都有了

测试alertmanager报警功能

先把对应的两个邮箱的stmp都打开

\"1583721318338\"

\"1583721408460\"

我们测试一下,把dubbo-service停了,这样consumer就会报错

把service的scale改成0

\"1583545840349\"

blackbox.od.com查看,已经failure了

\"1583545937643\"

prometheus.od.com.alerts查看,两个变红了(一开始是变黄)

\"1583548102650\"

\"1583545983131\"

这时候可以在163邮箱看到已发送的报警

\"1583721856162\"

QQ邮箱收到报警

\"1583721899076\"

完成(service的scale记得改回1)

关于rules.yml:报警不能错报也不能漏报,在实际应用中,我们需要不断的修改rules的规则,以来贴近我们公司的实际需求。

资源不足时,可关闭部分非必要资源

# 22机器,也可以用dashboard操作:~]# kubectl scale deployment grafana --replicas=0 -n infra# out : deployment.extensions/grafana scaled~]# kubectl scale deployment alertmanager --replicas=0 -n infra# out : deployment.extensions/alertmanager scaled~]# kubectl scale deployment prometheus --replicas=0 -n infra# out : deployment.extensions/prometheus scaled

通过K8S部署dubbo微服务接入ELK架构

**WHAT:**ELK是三个开源软件的缩写,分别是:

E——ElasticSearch:分布式搜索引擎,提供搜集、分析、存储数据三大功能。L——LogStash:对日志的搜集、分析、过滤日志的工具,支持大量的数据获取方式。K——Kibana:为 Logstash 和 ElasticSearch 提供的日志分析友好的 Web 界面,可以帮助汇总、分析和搜索重要数据日志。还有新增的FileBeat(流式日志收集器):轻量级的日志收集处理工具,占用资源少,适合于在各个服务器上搜集日志后传输给Logstash,官方也推荐此工具,用来替代部分原本Logstash的工作。

WHY: 随着容器编排的进行,业务容器在不断的被创建、摧毁、迁移、扩容缩容等,面对如此海量的数据,又分布在各个不同的地方,我们不可能用传统的方法登录到每台机器看,所以我们需要建立一套集中的方法。我们需要这样一套日志手机、分析的系统:

收集——采集多种来源的日志数据(流式日志收集器)传输——稳定的把日志数据传输到中央系统(消息队列)存储——将日志以结构化数据的形式存储起来(搜索引擎)分析——支持方便的分析、检索等,有GUI管理系统(前端)警告——提供错误报告,监控机制(监控工具)

这就是ELK

ELK Stack概述

\"1581729929995\"

**c1/c2:**container(容器)的缩写

**filebeat:**收集业务容器的日志,把c和filebeat放在一个pod里让他们一起跑,这样耦合就紧了

**kafka:**高吞吐量的分布式发布订阅消息系统,它可以处理消费者在网站中的所有动作流数据。filebeat收集数据以Topic形式发布到kafka。

**Topic:**Kafka数据写入操作的基本单元

**logstash:**取kafka里的topic,然后再往ElasticSearch上传(异步过程,即又取又传)

**index-pattern:**把数据按环境分(按prod和test分),并传到kibana

**kibana:**展示数据

制作tomcat容器的底包镜像

尝试用tomcat的方式,因为很多公司老项目都是用tomcat跑起来,之前我们用的是springboot

tomcat官网

\"1583558092305\"

# 200 机器:cd /opt/src/# 你也可以直接用我上传的,因为版本一直在变,之前的版本你是下载不下来的,如何查看新版本如上图src]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/tomcat/tomcat-8/v8.5.51/bin/apache-tomcat-8.5.51.tar.gzsrc]# mkdir /data/dockerfile/tomcatsrc]# tar xfv apache-tomcat-8.5.51.tar.gz -C /data/dockerfile/tomcatsrc]# cd /data/dockerfile/tomcat# 配置tomcat-关闭AJP端口tomcat]# vi apache-tomcat-8.5.51/conf/server.xml# 找到AJP,注释掉相应的一行,结果如下图,8.5.51是已经自动注释掉的

\"1583558364369\"

# 200机器,删掉不需要的日志:tomcat]# vi apache-tomcat-8.5.51/conf/logging.properties# 删掉3manager,4host-manager的handlers,并注释掉相关的,结果如下图# 日志级别改成INFO

\"1583558487445\"

\"1583558525033\"

\"1583558607700\"

# 200机器,准备Dockerfile:tomcat]# vi DockerfileFrom harbor.od.com/public/jre:8u112RUN /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime \\  echo \'Asia/Shanghai\' /etc/timezoneENV CATALINA_HOME /opt/tomcatENV LANG zh_CN.UTF-8ADD apache-tomcat-8.5.51/ /opt/tomcatADD config.yml /opt/prom/config.ymlADD jmx_javaagent-0.3.1.jar /opt/prom/jmx_javaagent-0.3.1.jarWORKDIR /opt/tomcatADD entrypoint.sh /entrypoint.shCMD [\"/entrypoint.sh\"]tomcat]# vi config.ymlrules: - pattern: \'-*\'tomcat]# wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.1/jmx_prometheus_javaagent-0.3.1.jar -O jmx_javaagent-0.3.1.jartomcat]# vi entrypoint.sh#!/bin/bashM_OPTS=\"-Duser.timezone=Asia/Shanghai -javaagent:/opt/prom/jmx_javaagent-0.3.1.jar=$(hostname -i):${M_PORT:-\"12346\"}:/opt/prom/config.yml\"C_OPTS=${C_OPTS}MIN_HEAP=${MIN_HEAP:-\"128m\"}MAX_HEAP=${MAX_HEAP:-\"128m\"}JAVA_OPTS=${JAVA_OPTS:-\"-Xmn384m -Xss256k -Duser.timezone=GMT+08 -XX:+DisableExplicitGC -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabled -XX:LargePageSizeInBytes=128m -XX:+UseFastAccessorMethods -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=80 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+PrintClassHistogram -Dfile.encoding=UTF8 -Dsun.jnu.encoding=UTF8\"}CATALINA_OPTS=\"${CATALINA_OPTS}\"JAVA_OPTS=\"${M_OPTS} ${C_OPTS} -Xms${MIN_HEAP} -Xmx${MAX_HEAP} ${JAVA_OPTS}\"sed -i -e \"1a\\JAVA_OPTS=\\\"$JAVA_OPTS\\\"\" -e \"1a\\CATALINA_OPTS=\\\"$CATALINA_OPTS\\\"\" /opt/tomcat/bin/catalina.shcd /opt/tomcat /opt/tomcat/bin/catalina.sh run 2 1 /opt/tomcat/logs/stdout.logtomcat]# chmod u+x entrypoint.shtomcat]# lltomcat]# docker build . -t harbor.od.com/base/tomcat:v8.5.51tomcat]# docker push harbor.od.com/base/tomcat:v8.5.51

Dockerfile文件解析:

FROM:镜像地址RUN:修改时区ENV:设置环境变量,把tomcat软件放到opt下ENV:设置环境变量,字符集用zh_CN.UTF-8ADD:把apache-tomcat-8.5.50包放到/opt/tomcat下ADD:让prome基于文件的自动发现服务,这个可以不要,因为没在用promeADD:把jmx_javaagent-0.3.1.jar包放到/opt/...下,用来专门收集jvm的export,能提供一个http的接口WORKDIR:工作目录ADD:移动文件CMD:运行文件

本文链接: http://sblack111.immuno-online.com/view-753025.html

发布于 : 2021-03-25 阅读(0)