Recently, Ashok Elluswamy, director of Tesla autopilot software, gave a speech at CVPR 2022 conference, introducing many achievements made by Tesla autopilot team in the past year, especially the neural network model named Occupancy Networks (hereinafter referred to as occupation network).
He said that there are many problems in the traditional methods of semantic segmentation and depth information used in autonomous driving systems, such as the difficulty of converting 2D to 3D and the inaccurate estimation of depth information.
After using the occupancy network, the model can predict the space occupied by objects around the vehicle (including the space generated by the next action of dynamic objects).
Based on this, vehicles can make evasive actions without identifying the specific obstacles-Ashok Elluswamy even joked on Twitter that Tesla’s car can even avoid UFOs!
Based on this technology, vehicles can also see whether there are obstacles in the surrounding corners, so as to realize unprotected steering like human drivers!
In short, the occupation of the network has significantly enhanced Tesla’s autonomous driving ability (L2).
It is said that Tesla’s autopilot system can prevent 40 car accidents caused by driver errors every day!
In addition, Ashok Elluswamy also highlighted the efforts of Tesla Autopilot system in preventing drivers from misoperation.
By sensing the external environment and the driver’s operating system, the vehicle can identify the driver’s misoperation, such as stepping on the accelerator pedal at the wrong time, the vehicle will stop accelerating and brake automatically!
▲ Tesla active braking
That is to say, some problems of "brake failure" caused by driver’s misoperation that have been frequently exposed in China will be technically restricted.
I have to say that Tesla is really good at promoting technological progress. The following is a video compilation of Ashok Elluswamy’s speech, with a slight deletion.
First, the pure vision algorithm is powerful, and the two-dimensional image is transformed into three-dimensional.
At the beginning of the speech, Ashok said that not everyone knows the specific functions of Tesla’s autopilot system, so he briefly introduced it.
▲Ashok
According to him, Tesla’s autopilot system can help vehicles achieve lane keeping, vehicle following, deceleration and cornering, etc. In addition to these, Tesla’s autopilot system is also equipped with standard safety functions, such as emergency braking and obstacle avoidance, which can avoid a variety of collisions.
In addition, since 2019, about 1 million Tesla vehicles can use more advanced navigation on the expressway, check the information of adjacent lanes to perform lane changes and identify the entrances and exits of the expressway.
Moreover, the Tesla autopilot system can automatically park in the parking lot, identify traffic lights and road signs, and make a right turn to bypass obstacles such as cars. At present, these functions have been verified by hundreds of thousands of Tesla owners.
In the speech, Ashok also took out a video recorded by a user. The video shows that when the user is driving on a crowded road in San Francisco, the car screen shows the surrounding environment, such as road boundaries, lane lines, the position and speed of nearby vehicles.
▲ The system recognizes the surrounding environment.
On the one hand, these need the support of hardware such as Tesla car and camera, on the other hand, they also need the support of algorithms and neural networks built into Tesla autopilot system.
According to Ashok, Tesla is equipped with eight 1.2-megapixel cameras, which can capture 360-degree images of the surrounding environment and generate an average of 36 images per second. Then Tesla’s car will process this information, and it can perform 144 trillion operations per second (TeraOPs/s).
Moreover, these processes are based on pure vision algorithm, without using lidar and ultrasonic radar, and without high-definition map.
So how does Tesla’s autopilot system identify general obstacles?
Ashok said that when encountering general obstacles, the system will use the space division method. When using the space segmentation method, the system marks every pixel in the space as "drivable" and "non-drivable", and then automatically drives the chip to deal with the scene. However, there are some problems with this method.
▲ Marking of objects
First of all, the object pixels marked by the system are in two-dimensional space, and in order to navigate the car in three-dimensional space, the object pixels need to be converted into the corresponding predicted values in three-dimensional space, so that Tesla’s system can establish an interactive physical model and handle the navigation task smoothly.
▲ Marking of objects
However, when the system converts the pixels of an object from a two-dimensional image to a three-dimensional image, it needs to carry out image semantic segmentation (that is, identify the image at the pixel level, that is, label the object category to which each pixel in the image belongs).
This process will produce unnecessary images or unnecessary pixels in the system, and a few pixels on the ground plane of the image can have a great influence, which directly determines how to convert this two-dimensional image into a three-dimensional image. Therefore, Tesla does not want to produce such influential pixels during planning.
In addition, different obstacles need to be judged by different methods.
Generally speaking, the depth value of the object (the distance from the observer’s point of view to the object, which is finally obtained after projection transformation, standardization of equipment coordinates and scaling translation) is commonly used.
In some scenarios, the system can predict obstacles first. In another scene, the system can also detect the depth of the pixels of the image, so each pixel will produce some depth values.
▲ Depth map (right)
However, although the depth map finally generated is very beautiful, only three points are needed when the depth map is used for prediction.
Moreover, when visualizing these three points, although it is ok to look at them at close range, they will also be deformed with the increase of distance, and it is difficult to continue to use these images in the next stage.
For example, walls may be deformed and bent. The objects near the ground level are also determined by fewer points, which makes the system unable to correctly judge obstacles in planning.
Because these depth maps are converted from plane images collected by multiple cameras, it is difficult to produce the same obstacle in the end, and it is also difficult for the system to predict the boundary of the obstacle.
Therefore, Tesla proposed an occupation network scheme to solve this problem.
Second, calculate the space occupancy rate and code the object.
During the speech, Ashok also showed this network occupation scheme by video. He said that as can be seen from the video, in this scheme, the system processes the images captured by eight cameras, then calculates the space occupancy rate of the object, and finally generates a schematic diagram.
▲ The generated simulated image
Moreover, every time the Tesla car moves, the system network will recalculate the space occupancy rate of the surrounding objects. In addition, the system network will not only calculate the space occupancy of some static objects, such as trees and walls, but also calculate the space occupancy of dynamic objects, including driving cars.
After that, the network outputs the image into a three-dimensional image, and can also predict the occluded object, so even if the car only uploads part of the outline of the object, the user can distinguish the object clearly.
In addition, although the resolution of the images captured by the system is different because of the distance, based on the above scheme, the resolution of the simulated 3D images finally presented is the same.
▲ The generated images have the same resolution.
This means that the whole scheme runs very efficiently. Ashok said that the computing platform runs for 10 milliseconds, and the system network can run at the rate of 100 Hz, which is even faster than the speed of many cameras recording images.
So, how is this process completed? This requires an understanding of the architecture of the occupation network scheme.
When explaining the architecture of the occupation network scheme, Ashok takes the images taken by Tesla fisheye camera and the left camera as an example, and compares the image correction process between them.
First, the system will stretch the image, then extract the image features, find out whether the points related to the 3D image are occupied, then use the 3D position coding, and then map it to a fixed position, after which the information will be collected in the subsequent calculation.
▲ Conduct preliminary processing on the image.
After that, the system will embed the position of the image space, continue to process the image stream through 3D query, and finally generate 3D occupation features. Because high-dimensional occupation features are generated, it is difficult to perform this operation at every point in the space. Therefore, the system will generate these high-dimensional features in lower dimensions, such as using typical up-sampling technology to generate high-dimensional space occupancy.
▲ Calculate the space occupancy of objects.
Interestingly, Ashok revealed in his speech that this occupation network scheme was only used to deal with static objects, but in the end it was found that it was difficult to deal with only static trees, and the system also encountered many difficulties when it first began to distinguish between "true and false pedestrians".
But the team finally found that no matter whether these obstacles are moving or static, the system only needs to be able to avoid them in the end.
▲ True and false pedestrians
Therefore, the occupation network scheme no longer distinguishes between dynamic obstacles and static obstacles, but uses other classifications to deal with them and calculate the instantaneous space occupancy rate of objects, but this is not enough to ensure that Tesla cars can drive safely.
Because if only the instantaneous space occupancy rate is calculated, it is not reasonable for Tesla to meet a car while driving at high speed and then start to slow down. The system wants to know more about the space occupancy rate of this car at different times and the changes.
In this way, the system can predict when the car will leave. Therefore, the scheme also involves predicting the occupied flow.
▲ Calculation process of occupied flow
The data of occupancy flow can be the first and higher derivative of space occupancy or time, and it can also provide more accurate control and unify them in the same coordinate system. The system will use the same method to generate space occupancy and occupancy flow, which will also provide strong protection against various obstacles.
Third, the type of obstacle is not important. The system can avoid collision.
Ashok also said that conventional motion or mobile networks can’t judge the type of object, such as whether it is a static object or a moving vehicle.
However, from the control level, the type of objects is actually not important, and the network occupation scheme provides good protection to prevent the network from classification difficulties.
Because no matter what the obstacle is, the system will think that this part of the space is occupied and move at a certain speed. Some special types of vehicles may have strange protrusions, which are difficult to model with traditional technology. The system will use cubes or other polygons to represent moving objects.
In this way, the object can be squeezed at will, and this occupying method does not need complex grid topology modeling.
When the vehicle is turning unprotected or protected, geometric information can be used to infer the occlusion situation. Geometric information should not only infer the information recognized by the vehicle camera, but also infer the unrecognized information.
For example, when a car is making an unprotected turn, there is a fork in the road ahead, and there may be potential vehicles blocked by trees and road signs, so the car "knows" that it can’t see the vehicles from these obstructions. Based on different control strategies, cars can ask questions and eliminate this occlusion.
Therefore, for a stationary object, the car can recognize when it becomes visible while driving. Because there is a complete three-dimensional obstacle, the car can also predict what distance it will hit this object, and then the system will identify and pass this blocked object through smooth control.
Therefore, the occupy network scheme helps to improve the control stack in many different ways. This scheme is an extension of the nerve radiation field, and the nerve radiation field (NeRf) has largely taken over the research of computer vision in the past few years.
▲ Schematic diagram of the association between Nerf and occupied network
NeRf is an image reconstruction scene in a single scene or a single location, which is reconstructed from a point in a single location.
Ashok said that when Tesla’s vehicles are driving, the images received in the background processing are more accurate, so it is possible to generate cross-time and accurate image routes, and generate more accurate 3D reconstruction through the NeRf model and 3D state differential rendering images.
There is a problem with the images in the real world-we will see a lot of unreal or different scenes in the real world.
For example, the glare of the sun or the dirt or dust on the windshield will change due to the diffraction of light, or raindrops will further distort the propagation of light and eventually produce artifacts.
The way to improve the robustness is to use higher-level descriptors, but these descriptors will not change the local lighting artifacts (such as glare) to a certain extent.
Because RGB (color system) images may be very noisy, adding descriptors on RGB can provide a layer of semantic protection to prevent the change of RGB values. Therefore, Tesla’s goal is to use this method to occupy the network solution.
▲ Descriptors are more robust than RGB.
Because the network occupation scheme needs to generate space occupancy in several shots, it cannot run a complete neural optimization in the car, but the neural optimization can be reduced to run in the background, ensuring that the space occupancy generated by it can explain the images of all sensors received by the car when it is running.
In addition, descriptors can also be superimposed during training to produce good supervision for these networks; At the same time, different sensor data can be differentiated to monitor the images held.
At present, Tesla has a network to reduce obstacles. The next step is to avoid any collision. Autopilot already has many safety functions.
Then, Ashok showed three videos of Autopilot starting to avoid collision.
The collision accident here refers to the collision accident caused by the driver accidentally stepping on the accelerator pedal as a brake pedal.
Ashok means that when the driver accidentally uses the accelerator as a brake, the car will accelerate and cause a collision, but the car will recognize and automatically stop the acceleration and automatically brake to prevent the collision.
In the first video, Ashock said that if Autopilot didn’t start and stopped the car from accelerating, the driver in the video would probably fall into the river.
▲ Tesla AP starts to avoid the car falling into the river.
Similarly, the second video shows that a Tesla driver mistakenly stepped on the accelerator while parking, but Autopilot quickly started and prevented the car from hitting shops and pedestrians.
▲ Tesla AP starts to avoid the car hitting the store.
Fourth, automatically plan the path through the occupied vehicle.
However, it may take several seconds or even minutes for the car to brake and stop smoothly, and there may not be enough time to identify obstacles and calculate when the car is driving.
So we should use neural network to achieve this goal; Especially recently, there have been more complicated hidden scenes. What Tesla’s autonomous driving team has to do is to get the space occupancy rate from the previous network.
First, the space occupancy rate should be coded into a super-compressed MLP. In essence, this MLP is an implicit representation of whether collision can be avoided in any specific query state, and this collision avoidance method provides some guarantees in a certain time range. For example, collisions can be avoided in 2 seconds or 4 seconds or within a certain time range.
Ashok gives another example here. He gives a top-down road. Black pixels are obstacles, gray pixels are roads and white pixels are road lanes. In the top view of this three-dimensional space, the car can be placed at any pixel position to simulate whether the collision can be avoided.
▲ Schematic diagram of vehicle driving situation
He said: "If you think of the car as a single point and the period of avoiding collision is set to an instant, then whether there will be a collision at the current time depends only on the position of the obstacle; But the problem is that the car is not a point, it has a rectangular shape and can also turn. "
Therefore, only when the shape is convolved with the obstacle can we immediately know whether the car is in a collision state.
As the car turns (or rotates out of control), the collision field will change. Green means that the car is in a safe position without collision, and red means collision, so when the car rotates, there are more collision positions; But when the car position is aligned, the green position expands, which means that the car will not collide.
On the whole, Ashok shows how to use multiple camera videos and products to generate dense space occupation rate and occupation flow. Through space occupation rate, an effective collision avoidance field can be generated through neural network, that is, vehicles can "see" through the camera and, according to experience, pass through the road with appropriate speed and direction.
▲ Implicit neural network for collision avoidance.
Ashok also shared an experiment in a simulated environment. The driver stepped on the accelerator to accelerate and there was no steering behavior. The car monitored that there would be a collision and planned a path to make the car pass safely.
Ashok said at the end of his speech that if they can successfully implement all the above technologies, they can produce a car that will never crash.
Obviously, this work has not been completed. In his last PPT, Ashok actively invited engineers to join Tesla and build a car that will never crash!
▲ Ashok Elluswamy welcomes more talents to join Tesla.
Conclusion: Tesla continues to explore autonomous driving.
Since Tesla’s self-driving technology with fire, a large number of followers have emerged on the self-driving track. But it must be said that Tesla is still at the forefront of the industry and constantly explores new methods of autonomous driving.
The person in charge of the Tesla Autopilot project brought new technical interpretations, and also showed us the highlights of Tesla’s future autonomous driving technology in advance to some extent. With Tesla’s spirit of continuous exploration, its autonomous driving will continue to lead the entire automobile market.