Big Data for automotive development
Big Data

Big data in automotive development

Networking and distributed data analysis open up new ways of evaluating sensor data from vehicles and promptly deploying changes to a test vehicle.
Today, a single test vehicle delivers up to one terabyte of data per hour. Anyone using several vehicles in parallel therefore has to process data in the multi–digit petabyte range on a daily basis. In the future, networked and autonomous cars will also increase the volume of data enormously. Car manufacturers must evaluate the output of measurement and control units in test vehicles or sensors, control units and actuators in terms of big data as quickly as possible. This has to be done at such a rapid pace because the analysis results are fed back into the further development of the vehicles. In addition, millions of test kilometers are not only being driven in a physical sense, but increasingly in a virtual one as well: data also provides an important basis for this.

Shortening development processes

The automotive industry is faced with the challenge of consolidating large amounts of data during development and analyzing it as quickly as possible. During the journey, a logger records data onto solid state disks, which is then read at the end of the working day and fed into the evaluation software. The analysis results are normally available within a few hours to ensure that critical faults can be rectified before the next test drive and that the next tests can be prepared. This allows development processes to be shortened and costs to be reduced. However, existing data connections are not designed to quickly assemble vast amounts of data from globally distributed test areas. The volume of data pushes classic analysis architectures and transmission techniques beyond their limits.

Code to data

Whether it be 4G mobile radio networks, WiFi, VPN or Ethernet, the typical bandwidths in use today (which for vehicle tests must be available globally) are not sufficient for fast data throughput. Instead, automobile manufacturers have to pre–process data as close as possible to the point of origin and then merge and analyze the results centrally. The ability to analyze large amounts of data close to a test bench or directly in the vehicle is becoming increasingly important. Due to the sheer amount of test data and the increasing need for virtual and physical testing, co–designing workloads and the underlying platform as well as selecting an appropriate topology are now a must.
A significant reduction in the effort required for data analysis can only be achieved through skillful selection and placement of infrastructure and algorithms. There are three complementary approaches to this: data locality (code to data), highly scalable and parallel data processing (parallel code) and coordination of the hardware with the software (codesign). Rather than bringing the data to the algorithms, with the code–to–data principle we send the code, which has a much smaller volume, to the data. This enables parallelization and releases the main thread for further processing. In addition, the analysis code is executed where the data is generated. Only the results of the evaluation are sent to the central analysis via the usual connections. This accelerates the analysis process many times over and reduces costs. Communication is always costly and time–consuming because it cannot be condensed. Less data transmission means less energy consumption and lower costs – and this is a decisive advantage.

Integrated development and simulation platform

It is not only bridging distances that is crucial for efficiently performing simulations without duplicating data and orchestration code, it is also essential to bridge system boundaries. The re–simulation from numerical to physical simulations on HILs and testbeds must be seamless. Comprehensive orchestration and integrated semantic models form the basis for this kind of integrated development and simulation platform. Seamlessly embedding development and test pipelines by optimizing data flows reduces time and costs while achieving the same results.
From a technological point of view, reading data is also a special challenge. Until now, signal data has been difficult to compress and interpret efficiently, as it cannot be divided into more manageable sections. If it could, computers could then evaluate the individual parts in parallel and another computer could then merge the results. With an entire cluster and parallel software, the result is available after a few seconds.
However, it has not yet been possible to use such a method in automobile development, since machine signals use variable, situation–dependent coding. Classical decoding methods are pushed to their limits and do not scale to the required extent. T–Systems' software–based Big Data Signal Processing works in parallel; it decodes and normalizes logger and trace files from the vehicles. The signal channels (traces, videos, logs etc.) can be recombined, filtered and recoded – and remain horizontally scalable.  One not uninteresting side effect is the lossless compression for channels with a low rate of change (e.g. flags). The solution ensures rapid and compressed storage and processing of test data, also in the cloud. Big Data Signal Processing can decode, compress, recombine, intersect, filter and apply mathematical operators to data without any loss of information – and this simultaneously on all computer cores of a provided cluster. In practice, this makes the process 40 times faster than previous approaches. Depending on the channels measured, the stored data volume shrinks in parallel to up to ten percent of the original volume.

Simple data analysis, machine learning or AI

In addition to the problems of speed, analyzing large amounts of data also raises the issue of analysis quality. Today, even simple analysis algorithms are often stamped with the AI label. It refers to machine learning, which is currently the most widely used form of data analysis and which mostly only concerns data correlation. An algorithm recognizes patterns and regularities in the training data. So–called "learning" is based on the calculation of conditional probabilities, so it has nothing to do with intelligence, even if the results obtained are impressive. However, it is interesting to see how many tasks can already be solved based on association.
When it comes to machine intelligence, however, tools should be used that enable causal thinking, i.e. models that make decisions comprehensible. Analysis quality can be improved in three stages: The simplest level invokes purely statistical relationships. A simple example: The fact that a customer buys a black car increases the likelihood that they will also want black leather seats. Conditional probabilities can be calculated by evaluating large amounts of data and establish an association between two observations. At the intervention stage, it is not only a matter of identifying the “what”, but also answering the question “why”: Did the customer buy black leather seats because they bought a black car? The top level is the counter–factual level: What would happen if you doubled the price? Such questions cannot be answered from the correlations of sales data alone, as they involve a change in customer behavior in response to the new pricing.
Through knowledge of the data generating processes – or through causal models – something like machine intelligence can be constructed and functional objects can be created that trigger comprehensible actions. What are known as black box algorithms, which are based purely on correlation, elude communication about their inner decision–making process. Only the use of causal inference with corresponding causal models enables transparency in automated analysis.

Simulation and protection

Analyzing constantly growing amounts of data requires a high degree of automation. Automation in this regard refers to the fact that a system does not require the constant intervention of an operator. Current standards help to save on manual translation stages in order to accelerate the development process: executable model descriptions replace descriptive modeling. Models coded in this way thus serve a dual purpose – as documentation and as a basis for simulations.
In order to enable function–centric development of vehicles and components, a seamless coupling of digital models and physical simulations ("HIL", "SIL", "MIL", "test bench") is required in order to test and simulate new or modified vehicle functions promptly. This combination of digital and physical resources (co–simulation) is made possible via standardized protocols and simulation frameworks, cross–system orchestration and a parallel scalable persistence layer. In order to not delay the development process by data transport, simulations are also executed in a geographically distributed way. At the same time, data centralization takes place asynchronously to enable retrospective simulations to be carried out on the basis of a consolidated database.
Development engineers make a note of prominent sequences so that they can be reused for later re–simulations. This enables new software versions to be sent back to the vehicle more quickly. The same driving sequence must be repeated until the control software on the vehicle functions correctly. To accelerate this process, data analysis is performed on geo–distributed clusters. This enables development teams to view the results of the evaluation in real time, regardless of the data location.

Further articles