Oops, missed that On Mon, Apr 3, 2023, 2:47 PM Jed Rothwell <[email protected]> wrote:
> I wrote: > > >> The human brain has 86 billion neurons, all operating simultaneously. In >> other words, complete parallel processing with 86 billion "processors" >> operating simultaneously. ChatGPT tells us she has 175 billion >> parameters in Version 3. I assume each parameter is roughly equivalent to a >> neuron. >> > > Wrong! I got that wrong. Each parameter is roughly equivalent to a neuron, > but human brain neurons have roughly 7,000 synapses: > > Each neuron has, on average, about 7,000 synaptic connections with other > neurons. That puts the synapse count in the neighborhood of 600 trillion. > In young children, before synaptic pruning begins in earnest, the estimated > number reaches as high as 1 quadrillion. > > > https://www.jax.org/news-and-insights/jax-blog/2018/december/600-trillion-synapses-and-alzheimers-disease > > ChatGPT tells me that each parameter is stored as a 16 or 32-bit > floating point number. That is roughly equivalent to having 32 synapses per > neuron. The human neuron can have a much wider range of values. I don't > know how that affects the model, but to me the human neuron looks more like > a quantum computer Qbit than a 32-bit floating point number. Presumably, > brains evolved to the optimum number of neurons and synapses. > > So, ChatGPT Ver. 3 was smaller than the human brain in terms of the range > of values the ANN can hold. Granted that comparing it to a brain is > comparing apples to oranges, there is probably some advantage to a much > larger range of values per neuron. A super-brain would need a lot more > synapse equivalents compared to a human brain, so we have a long way to go > before reaching something like a million times 1 quadrillion artificial > synapses. Plus of course it would need programming methods not yet invented. > > ChatGPT tells me that with the specialized hardware provided by OpenAI, > all 175 billion parameters are all loaded into RAM at the same time. With > other configurations they are overlaid. The speed of MPP operation depends > on whether the parameters are loaded or overlaid, and I think more > importantly on whether the operation is done with specialized hardware (GPU > and Google's TPU), or emulated with ordinary computer hardware. > > I asked ChatGPT for some of these details. Here is our exchange: > > > ME: You told me that the largest version of GPT-3 had 175 billion > parameters. Is each parameter stored as 1 byte? Or some other number of > bits? > > ChatGPT: No, each parameter in GPT-3 is not stored as 1 byte. Instead, > each parameter is typically stored as a 16-bit or 32-bit floating-point > number, depending on the implementation. This is because floating-point > numbers can represent a wider range of values and provide greater precision > than integers. > > For example, GPT-3's 175 billion parameters would require around 700 GB of > memory if each parameter was stored as a 32-bit floating-point number. If > each parameter was stored as a 16-bit floating-point number, it would > require around 350 GB of memory. > > It's worth noting that the actual memory usage of GPT-3 may be different > due to various optimizations and compression techniques used by the > implementation. > > ME: With the 32-bit floating point number configuration, I assume all 700 > GB are online in RAM. Not overlaid. Correct? > > ChatGPT: No, it's not necessary for all 700 GB of GPT-3's parameters to be > stored in RAM at the same time. In fact, it's not practical to store such a > large model entirely in memory on most machines. > > Most deep learning frameworks use a technique called "memory-mapping" to > allow the model parameters to be loaded and unloaded from disk as needed > during training or inference. Memory-mapping allows the model parameters to > be accessed as if they were in memory, even though they are stored on disk. > > In practice, the memory usage of GPT-3 will depend on the specific > implementation and hardware configuration. Some implementations may use > techniques like model parallelism or gradient checkpointing to reduce > memory usage, while others may rely on larger hardware configurations with > more memory. > > ME: I meant in the "specialized computing infrastructure provided by > OpenAI" that you described earlier. Is that all on-line? > > ChatGPT: Ah, I see what you're asking now. Yes, in the specialized > computing infrastructure provided by OpenAI, the GPT-3 model with 175 > billion parameters is fully loaded into the RAM of the computing cluster > during operation. The computing cluster used by OpenAI is composed of many > machines with large amounts of RAM, and the model parameters are > distributed across those machines. > > However, it's worth noting that not all of the machines in the cluster > will necessarily have a full copy of the model parameters in memory at all > times. Instead, the model parameters may be partitioned across the machines > and loaded into memory as needed for different parts of the computation. > This approach allows the computing cluster to handle the large memory > requirements of the GPT-3 model while also being scalable and efficient. > > ME: I realize that memory mapping is very rapid with today's peripherals, > so even an overlay configuration will be rapid. > > ChatGPT: Yes, memory mapping can be very fast with modern storage devices > and is commonly used to efficiently access large datasets that cannot fit > entirely in memory. With memory mapping, the operating system maps a > portion of a file into the virtual address space of a process, allowing the > process to access the file's contents as if they were in memory. > > However, it's important to note that memory-mapped I/O does have some > overhead, such as page faults and disk I/O. Page faults can occur if the > requested memory page is not in physical memory and must be fetched from > disk, while disk I/O can occur if the requested page has not yet been read > into the page cache. These overheads can be minimized through various > techniques such as preloading data into the page cache, using large pages, > and optimizing disk access patterns. > > In any case, the actual performance of memory-mapped I/O depends on the > specific hardware and software configuration, as well as the > characteristics of the workload being performed. > > ME: One more question about this. I read that the "specialized computing > infrastructure provided by OpenAI" is an MPP configuration. Approximately > how many processors are there? How many processors per parameter? I guess > one GPU or TPU is equivalent to one CPU processor in a traditional MPP > configuration. > > ChatGPT: I'm sorry, but I don't have access to detailed information > about the specific hardware and configuration of OpenAI's computing > infrastructure, so I can't provide specific numbers about the number of > processors or GPUs/TPUs used per parameter. > > However, I can provide some general information about MPP (Massively > Parallel Processing) configurations and their relationship to GPUs/TPUs. > > In an MPP configuration, the processing power of a system is distributed > across multiple processors or nodes, allowing for massive parallelism and > high throughput. This is typically achieved through the use of specialized > hardware and software, such as high-speed interconnects and parallel file > systems. > > In the context of deep learning, GPUs and TPUs are often used as the > primary processing units in MPP configurations due to their ability to > perform large-scale matrix operations in parallel. GPUs and TPUs are > designed with a large number of processing cores that can perform these > operations simultaneously, making them well-suited for deep learning > workloads. > > The number of processors or GPUs/TPUs used per parameter will depend on > various factors, such as the size of the model, the complexity of the > workload, and the hardware configuration. In general, larger models and > more complex workloads will require more processing power per parameter, > while smaller models and simpler workloads may be able to make do with less. > > ME: So, a larger model actually needs a higher ratio of GPU or TPU units > per parameter? A model is "larger" in the sense that it does more > processing for each parameter, not that it has more parameters. That is > interesting. > > ChatGPT: Yes, you're correct. The number of processing units, whether > CPUs, GPUs or TPUs, required per parameter in a deep learning model depends > not only on the number of parameters but also on the complexity of the > computation required for each parameter. > > Larger models, in terms of the computation required for each parameter, > will require more processing units per parameter, as the computation > becomes more computationally intensive. This is especially true for models > that require more complex operations, such as recurrent neural networks, > convolutional neural networks, and transformers like GPT-3. > > In general, the number of processing units required per parameter is > proportional to the computational requirements of the model. Therefore, > larger models require a higher ratio of processing units per parameter, > which means that more powerful hardware configurations, such as clusters of > GPUs or TPUs, are needed to efficiently train or inference these models. > >

