Oops, missed that

On Mon, Apr 3, 2023, 2:47 PM Jed Rothwell <[email protected]> wrote:

> I wrote:
>
>
>> The human brain has 86 billion neurons, all operating simultaneously. In
>> other words, complete parallel processing with 86 billion "processors"
>> operating simultaneously. ChatGPT tells us she has 175 billion
>> parameters in Version 3. I assume each parameter is roughly equivalent to a
>> neuron.
>>
>
> Wrong! I got that wrong. Each parameter is roughly equivalent to a neuron,
> but human brain neurons have roughly 7,000 synapses:
>
> Each neuron has, on average, about 7,000 synaptic connections with other
> neurons. That puts the synapse count in the neighborhood of 600 trillion.
> In young children, before synaptic pruning begins in earnest, the estimated
> number reaches as high as 1 quadrillion.
>
>
> https://www.jax.org/news-and-insights/jax-blog/2018/december/600-trillion-synapses-and-alzheimers-disease
>
> ChatGPT tells me that each parameter is stored as a 16 or 32-bit
> floating point number. That is roughly equivalent to having 32 synapses per
> neuron. The human neuron can have a much wider range of values. I don't
> know how that affects the model, but to me the human neuron looks more like
> a quantum computer Qbit than a 32-bit floating point number. Presumably,
> brains evolved to the optimum number of neurons and synapses.
>
> So, ChatGPT Ver. 3 was smaller than the human brain in terms of the range
> of values the ANN can hold. Granted that comparing it to a brain is
> comparing apples to oranges, there is probably some advantage to a much
> larger range of values per neuron. A super-brain would need a lot more
> synapse equivalents compared to a human brain, so we have a long way to go
> before reaching something like a million times 1 quadrillion artificial
> synapses. Plus of course it would need programming methods not yet invented.
>
> ChatGPT tells me that with the specialized hardware provided by OpenAI,
> all 175 billion parameters are all loaded into RAM at the same time. With
> other configurations they are overlaid. The speed of MPP operation depends
> on whether the parameters are loaded or overlaid, and I think more
> importantly on whether the operation is done with specialized hardware (GPU
> and Google's TPU), or emulated with ordinary computer hardware.
>
> I asked ChatGPT for some of these details. Here is our exchange:
>
>
> ME: You told me that the largest version of GPT-3 had 175 billion
> parameters. Is each parameter stored as 1 byte? Or some other number of
> bits?
>
> ChatGPT: No, each parameter in GPT-3 is not stored as 1 byte. Instead,
> each parameter is typically stored as a 16-bit or 32-bit floating-point
> number, depending on the implementation. This is because floating-point
> numbers can represent a wider range of values and provide greater precision
> than integers.
>
> For example, GPT-3's 175 billion parameters would require around 700 GB of
> memory if each parameter was stored as a 32-bit floating-point number. If
> each parameter was stored as a 16-bit floating-point number, it would
> require around 350 GB of memory.
>
> It's worth noting that the actual memory usage of GPT-3 may be different
> due to various optimizations and compression techniques used by the
> implementation.
>
> ME: With the 32-bit floating point number configuration, I assume all 700
> GB are online in RAM. Not overlaid. Correct?
>
> ChatGPT: No, it's not necessary for all 700 GB of GPT-3's parameters to be
> stored in RAM at the same time. In fact, it's not practical to store such a
> large model entirely in memory on most machines.
>
> Most deep learning frameworks use a technique called "memory-mapping" to
> allow the model parameters to be loaded and unloaded from disk as needed
> during training or inference. Memory-mapping allows the model parameters to
> be accessed as if they were in memory, even though they are stored on disk.
>
> In practice, the memory usage of GPT-3 will depend on the specific
> implementation and hardware configuration. Some implementations may use
> techniques like model parallelism or gradient checkpointing to reduce
> memory usage, while others may rely on larger hardware configurations with
> more memory.
>
> ME: I meant in the "specialized computing infrastructure provided by
> OpenAI" that you described earlier. Is that all on-line?
>
> ChatGPT:   Ah, I see what you're asking now. Yes, in the specialized
> computing infrastructure provided by OpenAI, the GPT-3 model with 175
> billion parameters is fully loaded into the RAM of the computing cluster
> during operation. The computing cluster used by OpenAI is composed of many
> machines with large amounts of RAM, and the model parameters are
> distributed across those machines.
>
> However, it's worth noting that not all of the machines in the cluster
> will necessarily have a full copy of the model parameters in memory at all
> times. Instead, the model parameters may be partitioned across the machines
> and loaded into memory as needed for different parts of the computation.
> This approach allows the computing cluster to handle the large memory
> requirements of the GPT-3 model while also being scalable and efficient.
>
> ME: I realize that memory mapping is very rapid with today's peripherals,
> so even an overlay configuration will be rapid.
>
> ChatGPT:  Yes, memory mapping can be very fast with modern storage devices
> and is commonly used to efficiently access large datasets that cannot fit
> entirely in memory. With memory mapping, the operating system maps a
> portion of a file into the virtual address space of a process, allowing the
> process to access the file's contents as if they were in memory.
>
> However, it's important to note that memory-mapped I/O does have some
> overhead, such as page faults and disk I/O. Page faults can occur if the
> requested memory page is not in physical memory and must be fetched from
> disk, while disk I/O can occur if the requested page has not yet been read
> into the page cache. These overheads can be minimized through various
> techniques such as preloading data into the page cache, using large pages,
> and optimizing disk access patterns.
>
> In any case, the actual performance of memory-mapped I/O depends on the
> specific hardware and software configuration, as well as the
> characteristics of the workload being performed.
>
> ME: One more question about this. I read that the "specialized computing
> infrastructure provided by OpenAI" is an MPP configuration. Approximately
> how many processors are there? How many processors per parameter? I guess
> one GPU or TPU is equivalent to one CPU processor in a traditional MPP
> configuration.
>
> ChatGPT:   I'm sorry, but I don't have access to detailed information
> about the specific hardware and configuration of OpenAI's computing
> infrastructure, so I can't provide specific numbers about the number of
> processors or GPUs/TPUs used per parameter.
>
> However, I can provide some general information about MPP (Massively
> Parallel Processing) configurations and their relationship to GPUs/TPUs.
>
> In an MPP configuration, the processing power of a system is distributed
> across multiple processors or nodes, allowing for massive parallelism and
> high throughput. This is typically achieved through the use of specialized
> hardware and software, such as high-speed interconnects and parallel file
> systems.
>
> In the context of deep learning, GPUs and TPUs are often used as the
> primary processing units in MPP configurations due to their ability to
> perform large-scale matrix operations in parallel. GPUs and TPUs are
> designed with a large number of processing cores that can perform these
> operations simultaneously, making them well-suited for deep learning
> workloads.
>
> The number of processors or GPUs/TPUs used per parameter will depend on
> various factors, such as the size of the model, the complexity of the
> workload, and the hardware configuration. In general, larger models and
> more complex workloads will require more processing power per parameter,
> while smaller models and simpler workloads may be able to make do with less.
>
> ME: So, a larger model actually needs a higher ratio of GPU or TPU units
> per parameter? A model is "larger" in the sense that it does more
> processing for each parameter, not that it has more parameters. That is
> interesting.
>
> ChatGPT:  Yes, you're correct. The number of processing units, whether
> CPUs, GPUs or TPUs, required per parameter in a deep learning model depends
> not only on the number of parameters but also on the complexity of the
> computation required for each parameter.
>
> Larger models, in terms of the computation required for each parameter,
> will require more processing units per parameter, as the computation
> becomes more computationally intensive. This is especially true for models
> that require more complex operations, such as recurrent neural networks,
> convolutional neural networks, and transformers like GPT-3.
>
> In general, the number of processing units required per parameter is
> proportional to the computational requirements of the model. Therefore,
> larger models require a higher ratio of processing units per parameter,
> which means that more powerful hardware configurations, such as clusters of
> GPUs or TPUs, are needed to efficiently train or inference these models.
>
>

Reply via email to