To ensure end-to-end IT quality, T-Systems has even the top decision makers on stand-by.
Quality Management

The night shift.

To ensure end-to-end quality, T-Systems has top decision makers on stand-by – ready to step in at a moment’s notice in the event of a critical IT incident or complex changes. This entails night and weekend shifts for SVPs and other senior executives.
Author: Thomas van Zütphen
Photos: Norbert Ittermann, Lodewijk Duijvesteijn/Gallery Stock, PR
Weekend work is on the docket – but Luz Mauch does not blink an eye when he sees his schedule. Manager on Duty (MoD) shifts are divvied up among T-Systems executives months in advance. “That way, we know we are able to make decisions, even in potentially complex situations,” explains Mauch, SVP of Automotive & Manufacturing at T-Systems. All SVPs and other senior executives work six or seven of these MoD shifts a year – two or three on Saturdays and Sundays. The schedule also includes a deputy MoD, should the primary MoD be ill or otherwise unable to work – in fact, the back-up is ready to spring into action in a matter of minutes.
Lu
“No matter how high an incident is escalated within the customer organization, we are able to make effective decisions and provide a contact at the appropriate management level.“
Luz G. Mauch, SVP Automotive & Manufacturing
Friday, 4:30 PM
Certain matters demand the attention of upper management. So a senior executive must be available as MoD at all times. Mauch receives the baton from co-worker Stephan Kasulke, and begins his weekend shift with a briefing call. Kasulke, the SVP in charge of Quality at T-Systems, provides the key facts and figures. But he is not the only decision maker participating in the concall: the core team comprises three other MoDs from T-Systems’ Digital Division, TC Division and IT Division, representatives from the quality organization, and a chairperson, Doris Reitter, is responsible for the results of the meeting, including details of who has MoD duties, and when. Kasulke confirms that there are no serious unresolved issues, and half an hour later, Mauch is heading home. Weekends plans for his daughter include a tennis tournament; for his son, a drama performance; and for a neighbor, a birthday party. But Mauch himself will have to stay within his own four walls until Monday – on stand-by.
Friday, 9:30 PM
A few hours into his shift, Mauch’s cell phone buzzes briefly – indicating the arrival of the report emailed every four hours from the Global MoD Service team to all senior executives at T-Systems. The report describes all workstreams for ongoing changes of the kind typical for a global corporation, including current status. There is no suggestion of a major incident – the sort that would require an SVP to take the reins. So far, Luz Mauch is having a quiet evening.
Saturday, 1:30 AM
Mauch’s alarm clock goes off – the first time during this shift. He knows the drill: it is time to check in again. “We need to be updated on the current situation, and notified of any high priority and critical changes,” he explains. “Because we have a complete picture, we can make decisions immediately if a change snags – and if that snag becomes a full-blown incident.” But at the moment the Global MoD Service’s latest email states that all is calm. There was no glitch that could not be quickly resolved.
Saturday, 2:30 AM
Mauch has just drifted off to sleep when his phone rings and wakes him. The Global Lead Incident Manager is calling, and that can only mean one thing: somewhere in the world, a T-Systems customer has encountered a serious problem. At the same time, the Global MoD Service has sent out a message that, at 2:09 am, a major carmaker reported the failure of its EDI system. As the central platform for standardized, electronic processes, it is key to the carmaker’s just-in-time manufacturing. Mauch can tell in an instant that the situation has critical CBI (customer business impact) status. The EDI system supports the flow of data for all of the enterprise’s business lines, user departments, and suppliers, and for the central ERP system. This is clearly a major incident, impacting production lines in four European countries and the USA. At this point, Mauch and the Global Lead Incident Manager trigger the largely automatic incident management process; this comprises a clearly defined and much rehearsed chain of activities. From now on, Mauch will receive status updates every ten minutes. And instead of taking the usual power-nap during the four-hour interval between reports, he will stay awake and alert the rest of the night.
Electronic Data Interchange – für Just-in-time-Fertigungsketten
Electronic Data Interchange: as the central platform for standardized, electronic processes, EDI is key to just-in-time manufacturing, e.g. in the automotive industry.
Saturday, 3:30 AM
In the ensuing management conference call, Mauch learns that no root cause has been identified, despite the latest round of automatic checks. In the meantime, the incident has been escalated to the highest level at the carmaker. Mauch calls his counterpart SVP at the customer’s organization, and explains their next course of action. They will reboot the database and applications, and involve their hardware/software vendor in ongoing fault analysis. Via a WebEx conference, Mauch brings the vendor’s database specialist up to speed on the steps already taken. Fittingly, the conference call ends just as the mail slot on Mauch’s front door claps shut – announcing the early morning arrival of his local newspaper. As far as the incident is concerned, Mauch can do little else but wait. He skims through his paper, though his mind is elsewhere. “Your attention is divided. In the back of your mind, you are thinking of the several dozen IT engineers on the ground, working under huge pressure.”
Saturday, 4:17 AM
The second call with the customer yields no good news. The reboots did not remedy the problem. Mauch decides that all senior executives at T-Systems need to be made aware of a Critical Delivery Situation (CriDS). “Beyond a certain, critical level, all customer-facing units must be informed; news of the major incident goes all the way to the top, to T-Systems CEO Reinhard Clemens,” notes Mauch. He and his contact at the customer organization go through each action taken by the Incident Management Team step by step. “Any activity that can rule out a possible root cause is a small victory.” The participants in the conference call decide to reboot the application server again. In addition, Mauch has received two text messages in quick succession – one from Reinhard Clemens, the other from Dr. Ferri Abolhassan, Director of the IT Division at T-Systems. Both want him to call back. He has 45 minutes.
Saturday, 5:05 AM
The vendor in the USA has both good and bad news in the next management call. It is not a hardware or operating system error – but the frantic search for the underlying problem continues. During the call, however, an MoD in Slovakia reports a suspicious SQL statement. The MoD wonders if a script meant to provide application software with data from the database system includes false instructions for data transfer. Mauch opens up logs of previous incidents on his tablet. He spends the next hour combing through these records, in hopes of finding similar patterns – “a correlation that could provide indication of a possible course of action.” But he has no luck.
Dr. Ferri Abolhassan
"The cloud is the basis for digitized business models and processes that will shape the future."
Dr. Ferri Abolhassan,
Director of the IT Division at T-Systems and responsible for Telekom Security
Saturday, 6:11 AM
After yet another reboot, the customer reports that they are still working in emergency mode: they are processing bills of materials and other vital documentation manually at three sites. As later noted in the incident log, there was “no production outage, but huge additional manual effort on the part of the customer to keep production running.” Meanwhile, a review of the most recent changes has been completed – but it simply adds to the growing list of ruled-out sources. A check of all layers relevant to the application is made – infrastructure, database, middleware and applications are given a clean bill of health. The next status update gives the Incident Management Team little to work from. Against this background, Mauch decides to escalate the incident to a top management level at the hardware/software vendor. He makes a call to Armonk, New York, on the US east coast. It is just past midnight at the other end of the line, but Mauch’s contact appreciates the gravity of the situation. He arranges an immediate second check of all hardware components and operating systems. Mauch knows that now, the search for the root cause in the customer’s IT infrastructure is going on across the world. “We are closing in on the error; our people are getting nearer and nearer to finding the underlying problem. That’s for certain.” It is just a question of time. He continues: “Everybody involved in the search is mindful of mean time to repair. It puts them under pressure. But nobody is frantic or jittery – they aren’t rushing. They know that in this situation, staying focused and paying attention to detail are what count.”
Saturday, 8:22 AM
After further developments, there is hope of a breakthrough. An automatic database cleanup has disconnected the database from the applications. And sure enough, after a final reboot, the entire system is stable – as long as the SQL script remains isolated.
Saturday, 9:30 AM
Following three more management, customer and IT engineer conference calls, the report finally reads: end of incident. But what exactly was the cause of all the hassle? “A nasty little bug,” Mauch relays to his family, waiting for him around the breakfast table. The program error arose from a software upgrade from seven weeks back.
Sunday, 10:00 AM
A mandatory safeguarding phase ensues in the 24 hours or so after any incident’s resolution. The latest report confirms that the system is still stable. A major incident review call follows, and all participants analyze the critical disruption, and pass on their conclusions to the Global Problem Management team. As Mauch reflects: “Analyzing the root cause is important. Identifying the source is central to continuous improvement. That, and finding a solution as soon as possible, is our main goal.”
Monday, 6:30 AM
Except for the routine four-hourly reports, the second half of the weekend remains quiet. On Monday, Mauch sums up the events of the last 60 hours for his co-worker, Dr. Stefan Bucher, SVP of the Solutions & Projects Delivery Unit. Bucher will assume MoD responsibilities from Mauch from now until an update call at 6:30 pm. Mauch has just a few more chores to complete. Detailed reports must be provided to everyone involved in the incident’s resolution. In addition, he thanks everyone who was on duty over the weekend via the intranet. And he contacts Stephan Kasulke, SVP of Quality, and confirms: “We have a fantastic team out there.”

More articles