Most popular use Prometheus to monitor cloudf

  • Detail

Using Prometheus to monitor the global network of cloudflare

matt Bostock introduced how to use Prometheus to monitor the global architecture and network of cloudflare in his speech at srecon 2017 European Conference. Prometheus is a tool for monitoring based on metrics. Cloudflare is a CDN, DNS and DDoS mitigation service provider

Prometheus, an open source monitoring project based on metrics, was first launched in 2012. It is a member of CNCF (cloud native Computing Foundation). Promql, the dynamic configuration and query language of Prometheus, supports users to write complex queries about alarms. Cloudflare provides CDN (content delivery network), distributed DNS and DDoS Defense Services, which means that its architecture has been extended to the world. Monitoring such an architecture and network is undoubtedly a complex task. In his speech, Bostock introduced the role of Prometheus. In cloudflare, 87% of the functions of Nagios deployed in the early stage have been replaced by Prometheus

cloudflare provides CDN services similar to those provided by anycast. Anycast DNS enables DNS queries to be processed by the server closest to the user, and anycast HTTP enables content to be provided from the service closest to the user. As an intermediary between the original web site and users, cloudflare also checks whether there are threatening patterns in visitors' traffic. It provides 116 data centers across 150 countries, processing 5million HTTP requests and 1.2 million DNS requests per second, accounting for 10% of the global internet requests. Each entry point (why does Po tester have load? P, point of presence) provides HTTP, DNS, DDoS Defense and key value storage services. As of the time of the speech, 188 Prometheus servers running in the production environment need to be monitored

prometheus is based on metrics, that is, it collects timing metrics and builds the rest of the features based on metrics. It works in pull mode, and each monitoring server runs a process called exporter to publish the collected metrics through HTTP. Cloudflare deploys an exporter for each service domain, and uses them to collect metrics of systems (such as CPU, memory, TCP, disk, etc.), networks (such as HTTP, Ping, etc.), local matching (error information), and containers/namespaces. Among them, exporter uses Google's open source project cadvisor to collect container/namespace metrics. Promet heus will not save all data permanently, because it focuses more on the monitoring of here and now. The data will not be sampled and will be stored in cloudflare configuration for 15 days

in the core data center of cloudflare, services include log access, business analysis, and APIs built using technology stacks such as marathon, mesos, Chronos, docker, sentry, CEPH (for storage), Kafka, spark, elasticsearch, and kibana. Prometheus queries servers and services through the exporter to obtain metrics in each pop. The high availability of each pop is provided by using multiple Prometheus services

prometheus' alarm management is called alertmanager. The deployment of cloudflare includes an alermanager, which pushes events by each Prometheus server, and takes into account the high availability of the configuration. The alarm is tested based on historical data to ensure the correct execution of the service. Emerging monitoring tools such as bosun also include similar features. To provide better alarm services, other features include descriptive names, ease of use, and information that enables the recipient to take immediate action

cloudflare team uses jiralerts to integrate JIRA work order system with alertmanager. JIRA can customize the high and low temperature impact testing machine. The workflow is composed of box, air circulation system, refrigeration system, heating system and humidity control system, so that the alarm monitoring can include some user-defined states specific to the monitoring workflow. Another tool called alertmanagere2s receives alarms and integrates alarm information into the elasticsearch index for further retrieval and analysis. Cloudflare has built its own dashboard for alertmanger, called unsee

how does Prometheus monitor its own situation? There are two ways to achieve this. One is the hybrid method, that is, in the same data center, one Prometheus monitors another Prometheus. The other method is the top-down method. The Prometheus server at the top level monitors the Prometheus server at the data center level

cloudfl's other are SRE team's experience is to standardize the labels and group identities of the environment and clusters as soon as possible; Other experiences are about how to create visualizations and generate buy ins from peers and stakeholders. Experience has pointed out that the early participation of the team will help the faster integration of services and monitoring systems. The alarm itself needs many iterations to adjust and improve, which is an ongoing process

Copyright © 2011 JIN SHI