Activity Record | Tooling, Productization, Operation – The story behind the team
Several people cloud "when the West's SRE encounter the East's Internet" Meetup first bomb record recorded!
This time to share guests is the United States Mission Commentary Center senior director Zhong Hongjun, he gave us a detailed introduction to the United States group comments on the past three years in the large-scale operation and maintenance of the concept and practice of exploration, especially in the operation and maintenance automation and data operations Aspects of work and effects –
- DockOne Share (11: 9): Elastic-Job-Cloud Work Cloud in Dangdang's SRE Practice
- Interview with LinkedIn SRE is a kind of experience?
- The face of large-scale systems engineering, how to deal with Facebook troubleshooting (a)
- SRECon Day2 | tube jigsaw: from a small hot spot to see SRE big article
- Spanner, TrueTime and CAP theory
- Thinking of SRE / PE Growth
Zhonghong Jun / US Mission Commentary Center Senior Director
The company has worked in Baidu, Tencent, PPTV and other Internet companies, familiar with the system, network, operation and maintenance, security, data, development and other fields.
Today I will comment on the past few years in the operation and maintenance of some of the work, and their own thinking to share with you. US team commented the entire operation and maintenance team more than 100 people, base in Beijing and Shanghai, the US group and comment on the two companies in 2015 merger, so the team is also the two places. Operation and maintenance center SRE team has a database of the team, there are automated development.
Phase 1: Tooling
I was in 2013 from Baidu to add reviews, before in Tencent, when the idea is very clear, because Tencent, Baidu's operation and maintenance system is relatively mature, including the operation and maintenance tools, automation tools are a set, The most regrettable is that these tools are not their own, in Tencent I was just a user, every day with those operation and maintenance tools do not know how to make these tools out. So the United States Mission Comments to their own mission, that is, the United States to comment on the operation and maintenance of Tencent, Baidu level, the process of missing, the growth process made by their own. US team commented on the operation and maintenance team in 2014-2015 business development is very fast, the company has tens of thousands of people, research and development team is very large, then the operation and maintenance done in a relatively basic stage, encountered a problem, regardless of day and night Operating pressure is great, especially in the accident to emergency, festivals need a variety of preparation, duty is also very confusing.
The original idea is very simple, I hope this thing to be minimal, standardized and consistent, to ensure that the operation can be done for decades, no matter who is doing the same operation. Such as the installation of a machine or the deployment of an application, I hope it to do a hundred times, a thousand times is the case. Second, the program instead of cumbersome tools, and third, all operations can be recorded, so as not to find out who is the accident can not operate. Fourth, the operation and maintenance operations to push forward, I hope the operation and maintenance operations do not do by the operation and maintenance, from research and development to do, because the demand itself from research and development, not from the operation and maintenance, so the demand should also be developed by do.
Above is my last year to sum up the four words, it seems very common four words, is the United States Mission Comments to do a process of automation in a line. The first sentence, all can not become a tool specification we do not see. Do the operation and maintenance we will think of a little specification, such as the release of norms, deployment specifications, naming norms, the machine has a standardized name, not standardized operation error. In my opinion, any specification can not become a tool to be bound, then this specification does not make sense. Write a document or a request to the R & D to see, as long as it can not become a tool it does not make sense, because this specification out, if the layout of the tool, then 100 times may be a time someone does not comply. But in fact he did not comply with one time, better to do 100 times only one does not comply, because every time do not comply, the problem is very good check, and do 100 times one does not comply, it is difficult to check. Such as the evening service hung up, a thousand servers, is one of the problems in fact very difficult to check, if this one thousand have a common problem, it is a good check.
The norm itself does not have any meaning, only it becomes a tool to make sense, because the emphasis is consistent, hope it is wrong is the same mistake each time, do not make every mistake. So, our commentary team no howto, no documentation, the entire operation and maintenance rarely make the document. Of course, now do, more than 100 people still have to do some norms can not form a tool, but still adhere to this point, norms should find a way to do a tool. For example, we have a quiet period of the request, the Spring Festival holiday three days before the release version is not allowed. Then from 2013 on the implementation of the rules on the implementation of this rule, because it has tools to support the release system to have a switch, a time to turn off, you must go with the approval of the circulation, the process is automated. But in 2015, the new release system does not support this switch, so stop this specification, do not implement this specification, because there is no tool to support the implementation of this specification does not make sense, send a notice to tell you to silence, first to be scolded , Followed by how we like how to curse after throwing the implementation of this specification, then we stopped until the Spring Festival this year, and finally support the function of the implementation of this specification.
Second, not to increase power, but to reduce power. Explain that in 2014-2016 to do the operation and maintenance of automation tools, the team's internal is also a lot of controversy, one of the very important controversy is that a considerable number of students that do automation tools to the operation and maintenance of the People more power, can do more things, we infinite Imagine how this tool can be, a button all the machines are restarted, in fact, very tragic. My idea is that the tool is to reduce the power, not to increase the power, why do you say that? If it is to make it more powerful, in fact, manual operation is the most powerful, to a ssh command window, a root, is the most powerful, what can be done. The essence of the tool is to limit, not to enhance, is to do nothing but not what can be done. For example, to do automated process system, in the assessment of automated process system, look at the process of its many, the more the process to do more bad. As an operation and maintenance, I think there should not be more than 10 processes. Common operation and maintenance operation will not be more than 10, plus machine, minus the machine, restart the machine, the other with a domain name. If the management of a bit, such as with a web IP, these should not need to do the operation and maintenance, so more than 10 things are problematic.
Third, to solve a complex problem, can not introduce another complex problem as a price. Many of the students to do the operation and maintenance, especially after a period of time, learned a lot of various concepts, from the earliest ITIL, to the current SRE, etc., easy to make a mistake, is like to use complex methods to solve complex Of the problem, the operation and maintenance of the system Ye Hao, operation and maintenance of automation Ye Hao, in fact, is a simple question. Back to the beginning, the operation and maintenance to solve the problem is to protect the stability of the line, only one thing. Operation and maintenance to solve any problems? Is to make all third-party factors or artificial factors on the stability of the online damage caused by the less the better, the less the better from the first change as little as possible, we put forward this concept in the late Tencent, no change Is the best. Previously we said that changes in management, change to manage it, this change is always managed badly, it is best not to change. Such as expansion, many students put forward the holiday capacity is not enough, to achieve a key expansion, in my understanding inside, I hope to achieve no expansion.
To solve a complex problem, if it is a complex way to solve, or the introduction of another complex problem, then this thing is more complicated, it is wrong. For example, when doing monitoring, is to do subtraction rather than addition, because it is too complicated to engage in the meaningless, assuming that the alarm alarm more than a thousand a day, there is no difference, because this time the maintenance of things to do is certainly off the phone, So to do subtraction, can not introduce complex problems, we must find a simple way.
The third sentence and the fourth sentence is similar, that is, the tool "minimalist" is a mission . I have seen a lot of operation and maintenance of automation tools, including Tencent, Baidu, as well as many domestic Internet companies, because I was recruiting, interviewed Internet companies do tool students, unfortunately the last one did not move, I found them to do Tools and my thinking is not the same, a lot of students to do automation tools, often thought that the value of the tool, put it complicated, it looks very gorgeous. In short, this is not my idea, my idea is minimal.
For example, this operation and maintenance of automation tools assume that only one button, of course, is the best, but can not, we are not Steve Jobs. And then do a expansion, there are many options can be selected, what room, which room, especially the larger size of what type of machine, how much memory, how many CPU, etc., many students think the more options, the better the tool, The more powerful, in my opinion the fewer options the better, more later, the first error-prone, if the wrong election, and then related to the development and operation of the PK. There is a waste of time, expansion of a machine should be a thing not to spend time, the options so much depends on half a day. From the performance of the tool, the tool is also the simpler the better. But did not think of a result, the tool is very ugly, and later we also recruit front-end students to do it a little better, rather than doing complex. Over the past few years to do the operation and maintenance of automation to sum up these four words.
US group of comments on the automation tools
Talk about the automation experience of the US group. The first is to do such a system, take off mainly four things: the middle is a CMDB, a set of process systems, a set of control platform and a set of monitoring system. Automation is mainly four things –
First, information. All the automation is based on very accurate and detailed information, especially in the era of virtualization, cloud computing is more popular, a machine on which the switch is very important information. Such as automatic expansion of the time, certainly do not want the same application of two machines to expand to the same switch, it must know this information. Information was done in great detail, such as it has several sections of the card, is a two-way or one-way connection. Information is very important, because the United States group of large-scale comments, a large number of machines deployed in different cities, can not be a temporary operation every time temporary look. Another example is the deployment of the breakup problem is very critical, the deployment of an application of 100 virtual machines or 200 virtual machines, to ensure that the 200 virtual machines are broken, not in the same switch or the same physical machine, or Is the same cabinet or the same IDC, in accordance with certain rules to break it, to ensure that after the stop will stop, such as quarter, one third, one half, thanks to the complete database , As long as a little bit of a problem.
Second, the standard operation. Just said the process will not be more than 10, this operation and maintenance of the standard operation will not be more than a dozen, these operations refined into the standard operation, called atomic operation. Imagine yourself doing an expansion, doing an on-line as an example, applying for a machine, initializing its environment, adding it to the monitor, making a configuration, basically these operations are relatively fixed, atomic operations can fall down, it Is a standardized action. This standardized action takes it into an operating library, and someone will ensure that the normalization of the normalized action itself, such as restarting a machine, is sure to make the operation itself particularly robust, ensuring that all operations are available, no matter what time , Do the restart when the action must use the same standard operation.
Third, the scene is a complex action, we call the process. Such as today's business to deploy 300 machines, or on-line a new business today, etc. This is a scene, will be able to break down a lot of standardized operations to complete the scene is the process, so when we are the development process system.
Independent of these three is the monitoring. This monitoring is multi-level, the operating system, monitoring applications, but also to monitor the release changes, I want to know how many changes, how much release. In general, the US group comment on the automation system is based on such a large framework, of course, there are four framework, which has a lot of products. As long as the frame box is good, the product is no relationship, such as the process system to do two sets does not matter, as long as the same frame like.
Automated tools finished, talk about the process at that time. When we have just done a lot of ideas after the automation tools, and soon fell into a confused, that the operation and maintenance, but so, life seems very dark feeling, and this tool will bring a side effect, the beginning of the time We are still very happy, with the tools after the rapid increase in efficiency, the need for the middle of the night things on the emergency, and some things can really be developed to deal with the. There is a operation and maintenance team annual meeting, we set out after the sudden received a phone call, there is a need to develop a business side of the line to do an operation, I told him that there are processes, in the process to apply for what, and is automatic Of course, he applied for a good job.
For the past, that year in Tencent when we most of the Vietnamese group to build, in case of failure who handled? So we all went, I did not go alone, at home, keep the computer, waiting to deal with failure. Later in the United States group comments, research and development of their own process can get this thing, that automation tools are indeed effective, by the end of 2014, this thing also won the company quarterly awards. This year we operation and maintenance team won the US group commentary annual awards, or very easy. At that time we do automation done, I feel very happy, but happy for a few days we were confused. Tools to do too much, and soon fell into a runaway, solve the problem began to bring problems, and bring a lot of problems, development is also a lot of chaos, the information began to be inconsistent, more and more dangerous tools, so we began to think – –
Phase 2: Productization
Think of the results, we call it product. The beginning of the tool, that it is a tool to achieve automation tools, did not understand it as a product, then ideas change a bit, this tool into a product, just like the development of a US group APP like this, it is A product, such as the CMDB or process positioning as a product rather than a tool, when you think of this after suddenly see the light, the product is not the same, and do the product first product manager, you can also recruit students to do PM, Such operations are done.
Phase 3: Operation
After doing the product, the tool did solve the problem just out of control, but caught in a confused, in simple terms is the relationship between maintenance and business. Operation and maintenance can be said that the entire end of the chain of technology, the lowest end of the food chain, how to reflect the value of operation and maintenance? Then we sort out a new set of ideas out, called the quality of operation, which has some ideas and SRE similar. Quality operation of the idea is very simple, from the monitoring system which constantly refining the data, the monitoring of data into a quality indicator to this indicator to drive the entire R & D system. Because a lot of problems are related to the development, such as the R & D SQL statement written relatively poor, slow SQL more, it is more prone to failure, online pressure once the larger, the database can not resist. On the issue of the previous practice, now linked to the line, check out a slow SQL caused, we began to each other PK, R & D said that I have not changed, this SQL has always been like this, the operation and maintenance that is your SQL This is a common routine. But now, in turn, the maintenance of the normal monitoring of each application of the number of slow SQL, if more, we think it is a sub-health state, even if there is no problem, it should be down.
US group commentary is not just a slow SQL so simple, we put a lot of quality data on the operation, according to the quality of data to promote research and development to improve the quality of data, operation and maintenance of the detection of this data, until this data did go down. DOM is the United States Mission Comments on the quality of the platform, similar to the report platform, in the above continue to put a lot of quality data, take this data to promote research and development, basically can think of, springboard machine, quality operation, message queue, CAT , Cloud platform, Nginx, etc., every thing in the plan has been defined, there is a set of quality indicators, quality indicators can be traced back and detailed, the so-called traceability is to see all the past, the details can be Has been down the point, such as this department this DB score is 75 points, point to see why 75 points? There may be slow SQL5000, and then point can see slow SQL5000 in the end which 5000, which 5000 in the end who is it? Because the CMDB which recorded all the application information, R & D personnel corresponding to the information, so the efficiency is very high.
There is also a DB health table, which has a slow query score, disk usage, lock situation score, delay consistency, green hat library, large table, capacity coefficient, etc., the data will continue to iteration. Because the company more people, the United States group commentary approach is generally horizontal comparison. Any one thing always have a better team, a team doing relatively poor, so that we do horizontal comparison, you can see which team is doing better, which team is doing relatively poor. Through this way to stimulate everyone to do improvements, because no one is willing to do their own team than other teams, this is as a technical team of accomplishment.
Quality operation, a word is to refine the indicators out, not wait until it happened, nor is it in response to R & D needs, but the operation and maintenance initiative to extract this indicator out, and to promote research and development may affect the indicators down. Last year 2016 to do more, one is the application of the average response time, such as a java application, call the average response time, a long time is certainly easy to fail, a high load overtime, and so on all kinds of failures, usually The response time of 100 milliseconds looks okay, but once the load will be a problem, so the requirements can not exceed 50 milliseconds, this request once set out on the quality of the report to see all the company's application, the current average is how much , How much is the high, how much is low, divided into teams, departments, immediately out of TOP10, TOP20 list, to promote the poor students to do better. Also like the response time of APP, is also similar. Slow SQL see more, we are still more useful to suppress, do so down, slow SQL caused a lot less trouble.
Since then, the operation and maintenance team and before have a lot of changes, from the complete auxiliary passive state, began to enter the so-called dominant state. In the past are research and development need to do what operation and maintenance, and then operation and maintenance to do. Now are the operation and maintenance needs to do research and development, what we do. The team's duties have changed greatly than before, and now there are about three parts: the first is the quality of operation, the second is the development of automation, the third is DO separation of the O. Three years ago, the United States group is basically doing these three parts, D is the development of O is the operation and maintenance, we are DO separation O gradually reduced.
Sum up and think
Simple summary, the US group operation and maintenance of the road to explore, from the beginning to do the tools to do the product, to do the operation, the main energy in 2016 is to do the operation, the team has become four parts. In the past, automation tools focused on some of the features, team performance is to see what this year to do the function, but the two years do not see the function, and see how the promotion of tools, how to operate. Now is the data-driven, the early is more trouble-driven, the problem driven by everyone, do all kinds of improvements, all kinds of auxiliary, various case study. Process-driven, operational design of a variety of rules, in fact, no use, no rules played a role. It is now data-driven, look at data reports, and constantly iterating.
Finally left to you two words: cloud era, everyone farther and farther away from the infrastructure, the operation and maintenance of how to reflect the value? Second, in the end is to go up or down? The so-called up is to go to the business point of view, go down is relatively more traditional, for example, I study the OS deeper, and so on, in the end should be how to go? This is the topic we are still thinking about. thank you all.