分布式链路追踪
核心概念
Trace 是一个逻辑概念,表示一次(分布式)请求经过的所有局部操作(Span)构成的一条完整的有向无环图,其中所有的 Span 的 TraceId 相同。
Span 则是真实的数据实体模型,表示一次(分布式)请求过程的一个步骤或操作,代表系统中一个逻辑运行单元,Span 之间通过嵌套或者顺序排列建立因果关系。Span 数据在采集端生成,之后上报到服务端,做进一步的处理。其包含如下关键属性:
- Name:操作名称,如一个 RPC 方法的名称,一个函数名
- StartTime/EndTime:起始时间和结束时间,操作的生命周期
- ParentSpanId:父级 Span 的 ID
- Attributes:属性,一组 <K,V> 键值对构成的集合
- Event:操作期间发生的事件
- SpanContext:Span 上下文内容,通常用于在 Span 间传播,其核心字段包括 TraceId、SpanId
一般架构
在应用端需要通过侵入或者非侵入的方式,注入 Tracing Sdk,以跟踪、生成、传播和上报请求调用链路数据;
Collect agent 一般是在靠近应用侧的一个边缘计算层,主要用于提高 Tracing Sdk 的写性能,和减少 back-end 的计算压力;
采集的链路跟踪数据上报到后端时,首先经过 Gateway 做一个鉴权,之后进入 kafka 这样的 MQ 进行消息的缓冲存储;
在数据写入存储层之前,我们可能需要对消息队列中的数据做一些清洗和分析的操作,清洗是为了规范和适配不同的数据源上报的数据,分析通常是为了支持更高级的业务功能,比如流量统计、错误分析等,这部分通常采用flink这类的流处理框架来完成;
存储层会是服务端设计选型的一个重点,要考虑数据量级和查询场景的特点来设计选型,通常的选择包括使用 Elasticsearch、Cassandra、或 Clickhouse 这类开源产品;
流处理分析后的结果,一方面作为存储持久化下来,另一方面也会进入告警系统,以主动发现问题来通知用户,如错误率超过指定阈值发出告警通知这样的需求等。
开源实现
SkyWalking
前置条件:
搭建skywalking需要三个镜像:
1.es 存储数据
2.oap 服务器
3.ui 界面
这里涉及到多个服务,采用docker-compose部署:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
version: '3.8'
services:
elasticsearch:
# image: docker.elastic.co/elasticsearch/elasticsearch-oss:${ES_VERSION}
image: elasticsearch:7.9.0
container_name: elasticsearch
# --restart=always : 开机启动,失败也会一直重启;
# --restart=on-failure:10 : 表示最多重启10次
restart: always
ports:
- "9200:9200"
- "9300:9300"
healthcheck:
test: [ "CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1" ]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
environment:
- discovery.type=single-node
# 锁定物理内存地址,防止elasticsearch内存被交换出去,也就是避免es使用swap交换分区,频繁的交换,会导致IOPS变高;
- bootstrap.memory_lock=true
# 设置时区
- TZ=Asia/Shanghai
# - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
oap:
image: apache/skywalking-oap-server:8.9.1
container_name: oap
restart: always
# 设置依赖的容器
depends_on:
elasticsearch:
condition: service_healthy
# 关联ES的容器,通过容器名字来找到相应容器,解决IP变动问题
links:
- elasticsearch
# 端口映射
ports:
- "11800:11800"
- "12800:12800"
# 监控检查
healthcheck:
test: [ "CMD-SHELL", "/skywalking/bin/swctl ch" ]
# 每间隔30秒执行一次
interval: 30s
# 健康检查命令运行超时时间,如果超过这个时间,本次健康检查就被视为失败;
timeout: 10s
# 当连续失败指定次数后,则将容器状态视为 unhealthy,默认 3 次。
retries: 3
# 应用的启动的初始化时间,在启动过程中的健康检查失效不会计入,默认 0 秒。
start_period: 10s
environment:
# 指定存储方式
SW_STORAGE: elasticsearch
# 指定存储的地址
SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200
SW_HEALTH_CHECKER: default
SW_TELEMETRY: prometheus
TZ: Asia/Shanghai
# JAVA_OPTS: "-Xms2048m -Xmx2048m"
ui:
image: apache/skywalking-ui:8.9.1
container_name: skywalking-ui
restart: always
depends_on:
oap:
condition: service_healthy
links:
- oap
ports:
- "8088:8080"
environment:
SW_OAP_ADDRESS: http://oap:12800
TZ: Asia/Shanghai
|
执行成功后,docker会启动对应的容器
验证步骤:
打开localhost:8088 看下ui界面是否能访问
访问:http://localhost:9200/ 是否有数据返回,以验证es是否正常启动
部署
1.下载对应系统的skywalking的agent
2.执行如下命令:
1
|
cd /path/to/skywalking && make build
|
3.执行成功会在bin文件夹下多出对应系统的可执行文件
不同的操作系统对应的可执行文件不同。例如,mac系统需选择skywalking-go-agent–darwin-amd64
4.打开Go项目,在main package中导入skywalking module
1
2
3
4
5
|
package main
import (
_ "github.com/apache/skywalking-go"
)
|
5.配置config.yaml文件
配置文件示例:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
agent:
# Service name is showed in UI.
service_name: ${SW_AGENT_NAME:Your_ApplicationName}
# To obtain the environment variable key for the instance name, if it cannot be obtained, an instance name will be automatically generated.
instance_env_name: SW_AGENT_INSTANCE_NAME
# Sampling rate of tracing data, which is a floating-point value that must be between 0 and 1.
sampler: ${SW_AGENT_SAMPLE:1}
meter:
# The interval of collecting metrics, in seconds.
collect_interval: ${SW_AGENT_METER_COLLECT_INTERVAL:20}
sampling:
rate: 100
reporter:
grpc:
# The gRPC server address of the backend service.
backend_service: ${SW_AGENT_REPORTER_GRPC_BACKEND_SERVICE:127.0.0.1:11800}
# The maximum count of segment for reporting tracing data.
max_send_queue: ${SW_AGENT_REPORTER_GRPC_MAX_SEND_QUEUE:5000}
# The interval(s) of checking service and backend service
check_interval: ${SW_AGENT_REPORTER_GRPC_CHECK_INTERVAL:20}
# The authentication string for communicate with backend.
authentication: ${SW_AGENT_REPORTER_GRPC_AUTHENTICATION:}
# The interval(s) of fetching dynamic configuration from backend.
cds_fetch_interval: ${SW_AGENT_REPORTER_GRPC_CDS_FETCH_INTERVAL:20}
tls:
# Whether to enable TLS with backend.
enable: ${SW_AGENT_REPORTER_GRPC_TLS_ENABLE:false}
# The file path of ca.crt. The config only works when opening the TLS switch.
ca_path: ${SW_AGENT_REPORTER_GRPC_TLS_CA_PATH:}
# The file path of client.pem. The config only works when mTLS.
client_key_path: ${SW_AGENT_REPORTER_GRPC_TLS_CLIENT_KEY_PATH:}
# The file path of client.crt. The config only works when mTLS.
client_cert_chain_path: ${SW_AGENT_REPORTER_GRPC_TLS_CLIENT_CERT_CHAIN_PATH:}
# Controls whether a client verifies the server's certificate chain and host name.
insecure_skip_verify: ${SW_AGENT_REPORTER_GRPC_TLS_INSECURE_SKIP_VERIFY:false}
log:
# The type determines which logging type is currently used by the system.
# The Go agent wourld use this log type to generate custom logs. It supports: "auto", "logrus", or "zap".
# auto: Automatically identifies the source of the log.
# If logrus is present in the project, it wourld automatically use logrus.
# If zap has been initialized in the project, it would use the zap framework.
# By default, it would use std errors to output log content.
# logrus: Specifies that the Agent should use the logrus framework.
# zap: Specifies that the Agent should use the zap framework.
# The system must have already been initialized through methods such as "zap.New", "zap.NewProduction", etc.
type: ${SW_AGENT_LOG_TYPE:auto}
tracing:
# Whether to automatically integrate Tracing information into the logs.
enable: ${SW_AGENT_LOG_TRACING_ENABLE:true}
# If tracing information is enabled, the tracing information would be stored in the current Key in each log.
key: ${SW_AGENT_LOG_TRACING_KEY:SW_CTX}
reporter:
# Whether to upload logs to the backend.
enable: ${SW_AGENT_LOG_REPORTER_ENABLE:true}
# The fields name list that needs to added to the label of the log.(multiple split by ",")
label_keys: ${SW_AGENT_LOG_REPORTER_LABEL_KEYS:}
plugin:
# List the names of excluded plugins, multiple plugin names should be splitted by ","
# NOTE: This parameter only takes effect during the compilation phase.
excluded: ${SW_AGENT_PLUGIN_EXCLUDES:}
config:
http:
# Collect the parameters of the HTTP request on the server side
server_collect_parameters: ${SW_AGENT_PLUGIN_CONFIG_HTTP_SERVER_COLLECT_PARAMETERS:true}
mongo:
# Collect the statement of the MongoDB request
collect_statement: ${SW_AGENT_PLUGIN_CONFIG_MONGO_COLLECT_STATEMENT:false}
sql:
# Collect the parameter of the SQL request
collect_parameter: ${SW_AGENT_PLUGIN_CONFIG_SQL_COLLECT_PARAMETER:true}
|
设置参数的方式:
方式一:使用配置文件配置
方式二:设置环境变量
6.构建项目
1
|
sudo go build -toolexec "path/to/skywalking-go-agent -config path/to/config.yaml"
|
7.启动项目,观察控制台是否有对应设置的service

备注:
配置文件中的plugin是用于配置链路追踪所需要用到的插件。插件支持则直接能扫描到对应的依赖。如果插件不支持则需要手动添加span进行分析
手动添加span
参考如下链接
https://skywalking.apache.org/zh/2023-10-18-skywalking-toolkit-trace/
Ref:
https://www.alibabacloud.com/help/zh/arms/tracing-analysis/untitled-document-1690516495864?spm=a2c63.p38356.0.0.59bc6bd1ao2W05#6f7d899246n3f
https://github.com/alibabacloud-observability/skywalking-demo/tree/master/skywalking-go-demo/skywalking-go-agent-demo
https://skywalking.apache.org/zh/2023-06-01-quick-start-with-skywalking-go-agent/
https://juejin.cn/post/7250317954993619000