MemVerge AI_white horizontal

GPU Orchestration

Icon_Multi-Instance-GPU-(MIG)-&-Timeslicing

Use Case

Maximize AI Workload Throughput and
Cost Efficiency with GPU Fractionalization

Executive Summary

GPUs are the powerhouse of modern AI, but they are also a significant investment, and low utilization is a common problem. To maximize ROI, organizations must effectively share GPUs. However, traditional methods like NVIDIA MIG and Time-Slicing introduce their own challenges of waste, rigidity, and inefficiency. MemVerge provides a smarter way forward.

MemVerge’s Fractional GPU technology offers a smarter way to share, ensuring workloads get precisely the resources they need, right when they need them.

A Common Problem

Modern AI workloads vary widely in their GPU resource requirements:

Lightweight inference jobs may use only 5–15% of a GPU
Early-phase model development or parameter tuning may use 20–30%
Even production batch pipelines often sit idle between data transfers

Yet conventional GPU orchestration assigns jobs one GPU at a time, regardless of need. The result:

Low GPU utilization (often 20–30% average across a cluster)
High costs for short or low-power job
Resource contention that slows scheduling of new jobs

A Typical Scenario

Let’s explore a common situation – A team runs 4 concurrent inference workloads. Each uses only 10% of the GPU memory and compute, but each occupies a full A100 GPU. As a result, 4 jobs consume 4 GPUs—but only 1 GPU’s worth of memory capacity is actually used.

Understanding GPU Sharing Options

The key to efficient GPU utilization is how workloads access the physical hardware. Let’s compare the common methods.

GPU Utilization Graphic

NVIDIA Multi-Instance GPU (MIG): Rigid Partitions, Wasted Potential

NVIDIA MIG works by partitioning a single physical GPU into several smaller, fully isolated vGPUs. Each MIG instance has its own dedicated memory and streaming multiprocessors (SMs), which provides isolation between workloads. However, this rigidity creates significant operational challenges:

Static & Rigid: Requires manual, upfront configuration by an administrator and a server reboot for any changes. Any changes to this configuration require a time-consuming server reboot, leading to downtime.
Limited & Inflexible Profiles: NVIDIA provides a limited list of supported MIG configurations. This means you are forced to fit your workloads into predefined slices, which rarely match their actual requirements.
Mismatched Resources: Workloads rarely align with the fixed partition sizes, resulting in unused and unallocatable memory and compute resources in every slice. If a workload is smaller than the instance, the leftover memory and compute power in that slice are wasted and cannot be allocated. If a workload is too large, it cannot run at all. This is a critical problem for platform teams who cannot predict user application needs at the time of configuration.

Time-Slicing: High Overhead and Low Throughput

Time-slicing allows multiple workloads to share a GPU by giving each one access to all of the GPU’s resources, but only for a very short period. While this ensures all assigned tasks make some progress, it comes at a high cost:

Inefficient All-or-Nothing Allocation: A workload gets exclusive access to the entire GPU, even when it only needs a fraction of the resources, leading to massive waste during its time slice.
High Latency and Reduces the Time to Result: Workloads spend a significant amount of time waiting for their turn on the GPU, hindering overall throughput and slowing down time-to-result.
Context-Switching Overhead: Each time a new workload gets its turn, its data must be copied to the GPU. This constant data movement adds significant overhead and reduces the time available for actual computation.

Fractional GPU: Parallel Processing with Right-Sized Resources

MemVerge introduces a superior approach with Fractional GPU. Instead of rigid partitions or inefficient turn-taking, Fractional GPU enables multiple workloads to run in parallel, dynamically allocating the precise amount of memory each workload requires.

This “bin-packing” approach allows workloads to run concurrently for as long as needed, maximizing the use of the physical GPU.

Run in Parallel: Multiple workloads execute simultaneously, dramatically increasing throughput.
Right-Sized Resources: Each workload receives exactly the memory it needs, eliminating the waste inherent in MIG’s fixed partitions and the over-allocation of time-slicing.
Eliminate Idle Time: By running workloads concurrently, Fractional GPU ensures that the hardware is always productive, maximizing utilization and delivering faster results for your AI and ML teams.

The Benefits of GPU Sharing using Fractionalization

The following table highlights the benefits of using MemVerge’s fractional GPU scheduling.

Feature	NVIDIA MIG	Time-Slicing	MemVerge Fractional GPU
Execution Model	Isolated Static Partitions	Serial (One workload at a time)	True Parallel (Concurrent workloads)
Workload Granularity	Coarse-grained (fixed hardware slices)	None (entire GPU per turn)	Fine-grained (precise memory allocation)
GPU Utilization	Low to Medium	Low	High
Primary Inefficiency	Stranded memory in oversized partitions	High overhead from context-switching	Minimal; workloads are bin-packed
Throughput	Low; limited by partition size	Low; hindered by latency and waiting	High; maximized by parallel execution
Best For	Manual, static, and requires server reboots	Managed by scheduler, no resource control	Fully dynamic and automatic

Quantified Impact

Let’s say a company operates a 20-GPU cluster of NVIDIA A100s for inference and training. Typical workloads:

70% are light inference jobs using <15% of GPU
30% are moderate development jobs using 30–50%

Without fractionalization

Each job takes a full GPU
Only ~25% of each GPU is used
Cluster is maxed out with ~20 jobs at once

With fractionalization

Inference jobs are packed 4–7 per GPU
Development jobs share GPUs with light workloads
Effective concurrent job count increases to 50–70
Utilization jumps to 80%+

Cost Efficiency

A100 spot instance cost: ~$2.50/hr
10 lightweight jobs on 10 GPUs = $25/hr
Same jobs on 2 GPUs with fractionalization = $5/hr
Savings: $20/hr, or ~$175,000/year for a mid-size team

Application Scenarios

1. High-Volume Inference

Companies running thousands of inference requests per second (e.g., chatbots, recommendation engines, real-time analytics) can:
- Consolidate requests across shared GPUs
- Assign fractional slices to microservices
- Isolate workloads while reducing idle time

Impact

Reduces GPU footprint by 60–80% with no SLA degradation.

2. Interactive Model Development

Data scientists running notebooks and training experiments often underuse GPU resources. Fractionalization enables:

Multiple notebooks to share a single GPU
Faster experiment cycles (no waiting for GPU assignment)
Better ROI on limited GPU inventory

Impact

Doubles developer productivity and reduces GPU starvation.

3. Hyperparameter Optimization

Tuning models with small batch sizes and fast iteration times leads to short-lived, low-utilization jobs. With fractionalization:

Dozens of tuning jobs run in parallel
Resources are fully consumed, not blocked
Jobs complete faster without scaling GPU count

Impact

Completes HPO runs 2–4× faster at 50% of the cost.

Technical Enablers

MIG Support: Native to NVIDIA A100/H100 via CUDA APIs
Kubernetes Plugins: Device plugins that expose GPU slices as schedulable resources
Schedulers (e.g., Volcano, KubeSlice): Handle fair sharing and resource guarantees
Runtime Isolation: Ensures security and stability for mixed tenant jobs
Transparent Memory Isolation: Prevents leaks and overconsumption

Future Outlook

As AI workloads diversify and GPU demand intensifies, fractionalization will become a default strategy for cost-effective scaling. Enterprises will:

Design job scheduling policies that default to shared GPUs
Use fractionalization-aware autoscalers to right-size clusters
Integrate usage-based billing to charge per fraction consumed

Ultimately, GPU fractionalization will be as fundamental as multi-core CPU sharing—unlocking massive scale without massive cost.

Conclusion

GPU fractionalization enables enterprises to maximize the value of every GPU—enabling more jobs, higher throughput, and greater savings. By dynamically partitioning GPU resources to fit the size of each job, organizations can:

Boost utilization from 30% to 90%
Cut costs by up to 70%
Accelerate AI delivery without expanding infrastructure

In a world where every GPU minute matters, fractionalization is the key to affordable, scalable AI.

MemVerge.ai GPU Orchestration

GPU Orchestration Dashboard

Creating Fractional GPUs

Supports:

Schedule a Demo