Experts at the Table: Semiconductor Engineering sat down to discuss silicon lifecycle management, expansion and evolution, and issues, with Prashant Goteti, Principal Engineer at Intel; Rob Aitken, R&D researcher at Arm; Zoe Conroy, senior hardware engineer at Cisco; Subhasish Mitra, professor of electrical engineering and computer science at Stanford University; and Mehdi Tahoori, President of Trusted Nanocomputing. The following are excerpts from that conversation, held live (virtually) at the recent Synopsys User Group Conference.
SE: As semiconductors are used for safety and mission critical applications, and complexity increases with heterogeneous designs, there is much more focus on silicon lifecycle management. Chips need to last longer in automotive, industrial and data center applications, and the cost of designs increases the need for longer semiconductor life, even in cell phones.
Goteti: Traditionally, it was about extending life and getting data that you integrate for yield and manufacturing purposes. But the scope has now changed significantly, and what we’re going to see is that silicon lifecycle management needs to change accordingly. We’re going to see a stream of data from chiplets – multiple chiplets inside a packaged system. It should be used for all sorts of things in data centers, from workload balancing, dynamic performance improvement and management, to traditional telemetry-type applications. So it’s definitely a nascent field, and there’s a lot of work to be done. But this is nothing new. It’s been going on for a while.
Conroy: From the point of view of data centers and network products, it is a combination of hardware and software. The two must work together constantly, without bugs. On the hardware side, you envision a heterogeneous type of integration with many different components from different vendors. The first challenge is really understanding that and asking, “Okay, what are the components? What does everyone do? What is the risk for everyone if I go to SLM? What critical components do I want to monitor in my product that could negatively affect my network? The first is to really understand your product and how that product is tested, and what kind of functions it will perform throughout its lifecycle. And then you’re going to say, ‘Okay, if I want to monitor the SLM end-to-end, I’m going to go from wafer sorting to the field. So if I’m going to test and monitor my chips, what exactly do I want to monitor? And how, how am I going to monitor this? What data should I recover? How am I going to transport this data—from source, test, or field—through a network and into an area where I can do real-time analytics? The SLM has many components. And now we have things like cloud solutions, where we are now able to do end-to-end analytics. But it’s very complicated, and we’re only the tip of the iceberg for what’s going to happen in the future.
aitken: It’s not just about testing pads. We need to think about what really needs to be present in a CPU, surrounding logic, I/O, etc. — what actually needs to be there to provide the data. What can you do with the data? What we’ve come across often, even in the IoT space, is if you’re going to manage devices in some way as part of your silicon lifecycle management, how do you do that? you to upgrades? How is the software updated? How does a device trust the software provider? How does the cloud service know to trust the device? There are a lot of issues and challenges throughout this process, and there is a lot of work to be done. But there is already a lot of progress.
Mitra: It’s interesting to hear my colleagues in the industry talk about how they are already doing it. We are in the dark ages, very far from where we want to be. So if the network is down, we know we have problems. But what is happening in the real world today is not that things are getting worse. It’s that they produce incorrect results, and no one knows that those results are incorrect. They’re called silent errors, and the industry doesn’t seem to have a fix for them.
aitken: It is possible to be in the dark ages and still progress. There is general agreement that there is a lot of work to be done, but that does not mean that nothing has happened.
Mitra: But progress is made at a glacial pace.
Tahouri: On the positive side, there are a lot of opportunities. As we move forward, the systems become more and more complex. We address many issues beyond chip and system quality, including trust. SLM can be a solution Much progress remains to be made, but SLM promises to solve some of the problems of design, verification and confidence in very complex hardware and software systems. If done right, we can meet the challenges of increasing complexity.
SE: Is the solution better designed, with more verification and simulation, combined with in-circuit monitoring when a chip is in the field?
Goteti: It depends on what you want to achieve. Along with silent data corruption, silent data errors, these can be due to things like manufacturing defects. This is where better design, verification, and test content could help. But if you’re considering things like dynamic workload balancing or performance-per-watt tuning, better checking won’t help you in those kinds of situations. So you can approach some things with better design, better verification, better test content, but not everything. You have to choose your battles, and the strategies will be different.
Mitra: I agree and disagree. Many of these elements are dynamic in nature. You can’t just do it statically at time zero and expect everything to work. You need to be adaptive in the system. But when you have adaptability, it has to be checked. And you have to make sure things don’t go wrong on the pitch. Thus, adaptability will require more verification and more testing at the same time.
aitken: This also involves security. You mentioned silent data corruption as a challenge. But your object being hacked or used as the start of a botnet is also a challenge, and you need to make sure that the monitoring capability you have on the device is able to identify when the device is under attack and do something about it. on this subject. This is yet another vector that you could potentially pursue in this area.
Tahouri: Going forward with system requirements, adaptability is something we have to deal with, but it’s not necessarily SLM. They overlap, but they are not necessarily the same thing. SLM covers a wider field and basically allows us to collect data on a population of the system and the chip. And from this type of data, we could derive much more useful information that would not be possible by simply doing the adaptation on a single system or device. This allows a population of devices and systems to detect anomalies, whether it’s faulty behavior, silent data corruption, or some sort of security breach.
SE: It opens a Pandora’s box, because it’s very difficult to get some of this data. For 20 years we’ve been talking about data ownership, how much will be shared, the privacy issues associated with that data. Has it improved?
Conroy: When you make your own chip, you have your own data. If you purchase components from other vendors, you may or may not want this data, depending on the type of component. Usually when you buy silicon from other vendors, they really don’t want to share any data around that silicon other than it’s a pass-through chip and meets spec. But with SLM, the bottom line is that you want data to flow through your supply chain. If a part fails and it’s not your part, you want to know why. You wish you had more data to help you diagnose it and identify the root cause. There is still a reluctance in the industry to distribute the data of our private companies because it becomes a support burden for them to manage this.
aitken: It is also potentially a liability burden. When someone owns the data, someone else can own the problem. You need a combination of design data, foundry data, test data, production distribution data, field data, and it all belongs to five different companies. Everyone, at some level, would like to own some aspect of the problem, and at other levels would like someone else to own the problem. The movement of who owns what, and who will guarantee what, is part of the challenge. Who has what motivation to collect and use what data at what time?
Mitra: This is an important point involving data reliability and security. I’ve seen many forums where we get into this data ownership discussion, but the problem is figuring out what data we’re talking about. Most of the time, people don’t even know what data to collect, let alone who owns the data or who is responsible for it. That’s important, but the real focus should be what data to collect, what the mechanisms are, what the instrumentation is, what to put in the architecture to be able to collect the data. And how do you analyze the data? This is where we are far, far behind.
Goteti: I agree that the volume of data is going to be a big issue, and we are going to have a flood of data. If you assume you have 50 or 60 chiplets in a packet, you’ll get a lot of telemetry from all of those, and processing will be difficult unless you have an efficient system to do it. But getting back to the question of who owns the data, it’s an open question that needs to be resolved quickly. We are not the only pioneers. The aviation industry has been doing this for some time with big data. Engine manufacturers collect engine data and then decide whether or not to share this data with the airlines or aircraft manufacturers themselves. It’s something we hear in the semiconductor industry and need to understand – and soon enough because this data is coming in. We already have a lot of data and we are looking to use the right data.
Mitra: Your signal-to-noise ratio is very low.
Goteti: It is important to find a signal in the noise, but we have to solve these two problems. We need to determine how data is processed and how we deal with large amounts of data. And then we also have to figure out who can use that data and how, regardless of who collects it.