Model Architectures - UC Intel Final

Overview

The UC Intel Final platform provides three families of neural network architectures, each designed for different use cases and computational constraints:

Custom CNN

Build convolutional neural networks from scratch with configurable layer stacks

Transfer Learning

Fine-tune pre-trained models (VGG, ResNet, EfficientNet) for faster convergence

Vision Transformer

State-of-the-art transformer architecture with self-attention mechanisms

Base Model Interface

All models inherit from the BaseModel abstract class, ensuring consistent interfaces: Location: app/models/base.py:11-71

from abc import ABC, abstractmethod
from typing import Any, Dict, Tuple
import torch.nn as nn

class BaseModel(ABC):
    """Abstract base class for model implementations"""
    
    def __init__(self, config: Dict[str, Any]):
        """
        Initialize model with configuration
        
        Args:
            config: Model configuration dictionary
        """
        self.config = config
        self.model = None
    
    @abstractmethod
    def build(self) -> nn.Module:
        """
        Build and return the model
        
        Returns:
            PyTorch model (nn.Module)
        """
        pass
    
    @abstractmethod
    def get_parameters_count(self) -> Tuple[int, int]:
        """
        Get total and trainable parameter counts
        
        Returns:
            Tuple of (total_params, trainable_params)
        """
        pass
    
    def get_model_summary(self) -> Dict[str, Any]:
        """Get model summary statistics"""
        if self.model is None:
            self.model = self.build()
        
        total_params, trainable_params = self.get_parameters_count()
        
        return {
            "total_parameters": total_params,
            "trainable_parameters": trainable_params,
            "model_type": self.config.get("model_type", "Unknown"),
            "architecture": self.config.get("architecture", "Unknown"),
            "num_classes": self.config.get("num_classes", 0)
        }

All models implement the same interface, making it easy to swap architectures during experimentation.

Custom CNN

Overview

The Custom CNN builder allows you to construct convolutional neural networks from a layer stack configuration. This provides maximum flexibility for architecture experimentation. Location: app/models/pytorch/cnn_builder.py

Architecture

Supported Layer Types

Conv2D - 2D Convolutional layerParameters:

filters (int): Number of output channels (default: 32)
kernel_size (int): Kernel size (default: 3)
activation (str): Activation function - “relu”, “leaky_relu”, “gelu”, “swish” (default: “relu”)
padding (str): “same” or “valid” (default: “same”)

Implementation (app/models/pytorch/cnn_builder.py:178-194):

def _build_conv2d(self, in_channels: int, params: dict) -> tuple:
    filters = params.get("filters", 32)
    kernel_size = params.get("kernel_size", 3)
    activation = params.get("activation", "relu")
    padding_mode = params.get("padding", "same")
    
    padding = kernel_size // 2 if padding_mode == "same" else 0
    
    layers = [
        nn.Conv2d(in_channels, filters, 
                 kernel_size=kernel_size, 
                 padding=padding),
        self._get_activation(activation)
    ]
    
    return nn.Sequential(*layers), filters

Output shape: (batch, filters, height, width)

MaxPooling2D - Max pooling layerParameters:

pool_size (int): Pooling window size (default: 2)

Implementation:

layer = nn.MaxPool2d(kernel_size=pool_size, stride=pool_size)
current_spatial = current_spatial // pool_size

Output shape: (batch, channels, height//pool_size, width//pool_size)

AveragePooling2D - Average pooling layerParameters:

pool_size (int): Pooling window size (default: 2)

Implementation:

layer = nn.AvgPool2d(kernel_size=pool_size, stride=pool_size)
current_spatial = current_spatial // pool_size

BatchNorm - Batch normalization layerNormalizes activations across the batch dimension:Implementation (app/models/pytorch/cnn_builder.py:133-135):

layer = nn.BatchNorm2d(current_channels)
self.feature_layers.append(layer)

Benefits:

Stabilizes training
Allows higher learning rates
Reduces internal covariate shift
Acts as regularization

Dropout - Dropout layerParameters:

rate (float): Dropout probability (default: 0.25)

Implementation (app/models/pytorch/cnn_builder.py:137-144):

rate = params.get("rate", 0.25)
if in_classifier:
    layer = nn.Dropout(rate)           # 1D dropout for FC layers
    self.classifier_layers.append(layer)
else:
    layer = nn.Dropout2d(rate)         # 2D dropout for conv layers
    self.feature_layers.append(layer)

Usage:

Use Dropout2d (spatial dropout) after convolutional layers
Use regular Dropout after dense layers

Flatten - Flatten spatial dimensionsConverts (batch, channels, height, width) → (batch, channels * height * width)Implementation (app/models/pytorch/cnn_builder.py:146-148):

flatten_features = current_channels * current_spatial * current_spatial
in_classifier = True

Used in forward pass:

x = torch.flatten(x, 1)  # Flatten all dims except batch

GlobalAvgPool - Global average poolingConverts (batch, channels, height, width) → (batch, channels)Implementation (app/models/pytorch/cnn_builder.py:150-152):

flatten_features = current_channels
in_classifier = True

Used in forward pass:

x = torch.mean(x, dim=[2, 3])  # Average over spatial dims

Advantage: Reduces parameters compared to Flatten

Dense - Fully connected layerParameters:

units (int): Number of output units (default: 256)
activation (str): Activation function (default: “relu”)

Implementation (app/models/pytorch/cnn_builder.py:154-167):

units = params.get("units", 256)
activation = params.get("activation", "relu")

layer = nn.Linear(flatten_features, units)
self.classifier_layers.append(layer)
self.classifier_layers.append(self._get_activation(activation))

flatten_features = units  # Update for next layer

Note: Must come after Flatten or GlobalAvgPool

Activation Functions

Location: app/models/pytorch/cnn_builder.py:75-81

ACTIVATION_MAP = {
    "relu": nn.ReLU(inplace=True),
    "leaky_relu": nn.LeakyReLU(0.1, inplace=True),
    "gelu": nn.GELU(),
    "swish": nn.SiLU(inplace=True),
    "none": nn.Identity(),
}

ReLU

Formula: f(x) = max(0, x)Pros:

Fast computation
Sparse activation
Widely used

Cons:

Dying ReLU problem

Leaky ReLU

Formula: f(x) = x if x > 0 else 0.1xPros:

Fixes dying ReLU
Allows negative gradients

Use when: Training deep networks

GELU

Formula: f(x) = x * Φ(x) (Gaussian Error Linear Unit)Pros:

Smooth activation
Better for transformers
State-of-the-art results

Use when: Using transformer-style architectures

Swish (SiLU)

Formula: f(x) = x * sigmoid(x)Pros:

Self-gated activation
Smooth and non-monotonic
Often outperforms ReLU

Use when: Need smooth gradients

Example Configuration

Simple CNN for MNIST-style data:

config = {
    "model_type": "Custom CNN",
    "num_classes": 9,
    "cnn_config": {
        "layers": [
            # Block 1
            {"type": "Conv2D", "params": {"filters": 32, "kernel_size": 3, "activation": "relu"}},
            {"type": "Conv2D", "params": {"filters": 32, "kernel_size": 3, "activation": "relu"}},
            {"type": "MaxPooling2D", "params": {"pool_size": 2}},
            {"type": "BatchNorm"},
            {"type": "Dropout", "params": {"rate": 0.25}},
            
            # Block 2
            {"type": "Conv2D", "params": {"filters": 64, "kernel_size": 3, "activation": "relu"}},
            {"type": "Conv2D", "params": {"filters": 64, "kernel_size": 3, "activation": "relu"}},
            {"type": "MaxPooling2D", "params": {"pool_size": 2}},
            {"type": "BatchNorm"},
            {"type": "Dropout", "params": {"rate": 0.25}},
            
            # Block 3
            {"type": "Conv2D", "params": {"filters": 128, "kernel_size": 3, "activation": "relu"}},
            {"type": "GlobalAvgPool"},
            
            # Classifier
            {"type": "Dense", "params": {"units": 256, "activation": "relu"}},
            {"type": "Dropout", "params": {"rate": 0.5}},
        ]
    }
}

Parameter Count: ~200K parameters

Forward Pass

Location: app/models/pytorch/cnn_builder.py:205-232

def forward(self, x: torch.Tensor) -> torch.Tensor:
    """
    Forward pass
    
    Args:
        x: Input tensor of shape (batch, channels, height, width)
    
    Returns:
        Output logits of shape (batch, num_classes)
    """
    # Apply feature extraction layers
    for layer in self.feature_layers:
        x = layer(x)
    
    # Apply transition (flatten or global pool)
    if self.use_global_pool:
        x = torch.mean(x, dim=[2, 3])
    else:
        x = torch.flatten(x, 1)
    
    # Apply classifier layers
    for layer in self.classifier_layers:
        x = layer(x)
    
    # Output layer
    x = self.output_layer(x)
    
    return x

Transfer Learning

Overview

Transfer learning leverages pre-trained models trained on ImageNet (1.2M images, 1000 classes) to accelerate training and improve performance on smaller datasets. Location: app/models/pytorch/transfer.py

Supported Base Models

VGG
ResNet
InceptionV3
EfficientNet

VGG16 / VGG19Architecture: Deep CNNs with small 3x3 filtersCharacteristics:

16 or 19 layers
Simple, uniform architecture
Large number of parameters (~138M for VGG16)

Input size: 224x224Feature dimensions: 512 (after global pooling)Use when: Need simple, well-understood architectureImplementation (app/models/pytorch/transfer.py:152-154):

"VGG16": lambda: models.vgg16(pretrained=use_pretrained),
"VGG19": lambda: models.vgg19(pretrained=use_pretrained),

ResNet50 / ResNet101Architecture: Residual connections to enable very deep networksCharacteristics:

50 or 101 layers
Skip connections prevent vanishing gradients
Moderate parameter count (~25M for ResNet50)

Input size: 224x224Feature dimensions: 2048Use when: Need deeper network with good performance/cost ratioImplementation (app/models/pytorch/transfer.py:155-156):

"ResNet50": lambda: models.resnet50(pretrained=use_pretrained),
"ResNet101": lambda: models.resnet101(pretrained=use_pretrained),

Residual Block:

x --> Conv --> BN --> ReLU --> Conv --> BN --> (+) --> ReLU
|                                              |
+----------------------------------------------+

InceptionV3Architecture: Multi-scale feature extraction with inception modulesCharacteristics:

Parallel convolutions at multiple scales
Factorized convolutions
Efficient parameter usage (~24M params)

Input size: 299x299 (different from others!)Feature dimensions: 2048Use when: Need multi-scale featuresImplementation (app/models/pytorch/transfer.py:157-159):

"InceptionV3": lambda: models.inception_v3(
    pretrained=use_pretrained, 
    aux_logits=False  # Disable auxiliary classifier
),

EfficientNetB0Architecture: Compound scaling of depth, width, and resolutionCharacteristics:

State-of-the-art efficiency
Mobile-friendly architecture
Few parameters (~5M for B0)

Input size: 224x224Feature dimensions: 1280Use when: Need efficient inference or limited computeImplementation (app/models/pytorch/transfer.py:160):

"EfficientNetB0": lambda: models.efficientnet_b0(pretrained=use_pretrained),

Scaling strategy: Jointly scale depth, width, and resolution

Fine-Tuning Strategies

Location: app/models/pytorch/transfer.py:194-217

Feature Extraction
Partial Fine-tuning
Full Fine-tuning

Strategy: Freeze all base model layers, train only classifierImplementation:

# Freeze all base model parameters
for param in self.base_model.parameters():
    param.requires_grad = False

Trainable parameters: ~10K (classifier only)Use when:

Small dataset (<1000 images/class)
Limited compute resources
Domain similar to ImageNet

Training time: Fastest (1-2 hours)Expected performance: Good baseline

Strategy: Freeze early layers, unfreeze last N layersImplementation:

# Freeze all first
for param in self.base_model.parameters():
    param.requires_grad = False

# Unfreeze last N layers
all_layers = list(self.base_model.children())
layers_to_unfreeze = all_layers[-unfreeze_layers:]

for layer in layers_to_unfreeze:
    for param in layer.parameters():
        param.requires_grad = True

Trainable parameters: ~1-5M (depends on N)Use when:

Medium dataset (1000-5000 images/class)
Moderate compute resources
Domain somewhat different from ImageNet

Training time: Medium (3-6 hours)Expected performance: Better than feature extractionRecommended N: 2-4 layers for ResNet, 1-2 blocks for VGG

Strategy: Train all layers with differential learning ratesImplementation:

# All parameters are trainable (default)
# Use lower LR for base model
optimizer = torch.optim.Adam([
    {'params': base_model.parameters(), 'lr': 1e-5},
    {'params': classifier.parameters(), 'lr': 1e-3}
])

Trainable parameters: Full model (~25M for ResNet50)Use when:

Large dataset (>5000 images/class)
Sufficient compute resources (GPU)
Domain very different from ImageNet

Training time: Longest (8-24 hours)Expected performance: Best possibleTip: Use learning rate warmup and gradual unfreezing

Custom Classifier Head

Location: app/models/pytorch/transfer.py:125-146

# Build custom classifier head
classifier_layers = []

if global_pooling:
    self.global_pool = nn.AdaptiveAvgPool2d((1, 1))
else:
    self.global_pool = None

if add_dense:
    # Two-layer classifier
    classifier_layers.extend([
        nn.Linear(in_features, dense_units),
        nn.ReLU(inplace=True),
        nn.Dropout(dropout),
        nn.Linear(dense_units, num_classes)
    ])
else:
    # Single-layer classifier
    classifier_layers.extend([
        nn.Dropout(dropout),
        nn.Linear(in_features, num_classes)
    ])

self.classifier = nn.Sequential(*classifier_layers)

Options:

Global Pooling: Reduces spatial dimensions to 1x1
Extra Dense Layer: Adds capacity (useful for complex domains)
Dropout: Regularization (default: 0.5)

Forward Pass

Location: app/models/pytorch/transfer.py:219-243

def forward(self, x: torch.Tensor) -> torch.Tensor:
    # Extract features with frozen/unfrozen base model
    features = self.base_model(x)
    
    # Apply global pooling if needed
    if self.global_pool is not None and len(features.shape) == 4:
        features = self.global_pool(features)
        features = torch.flatten(features, 1)
    elif len(features.shape) == 4:
        features = torch.flatten(features, 1)
    
    # Apply custom classifier
    output = self.classifier(features)
    
    return output

Vision Transformer

Overview

Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) to image classification by treating images as sequences of patches. Location: app/models/pytorch/transformer.py Paper: “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020)

Architecture

Patch Embedding

Location: app/models/pytorch/transformer.py:72-114 Converts 2D image into sequence of patch embeddings:

class PatchEmbedding(nn.Module):
    def __init__(
        self,
        image_size: int = 224,
        patch_size: int = 16,      # 16x16 patches
        in_channels: int = 3,
        embed_dim: int = 768,
    ):
        super().__init__()
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = (image_size // patch_size) ** 2  # 196 for 224x224
        
        # Use convolution to extract and embed patches
        self.proj = nn.Conv2d(
            in_channels, 
            embed_dim, 
            kernel_size=patch_size, 
            stride=patch_size  # Non-overlapping patches
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # (B, C, H, W) -> (B, embed_dim, H/P, W/P)
        x = self.proj(x)
        
        # (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim)
        x = x.flatten(2).transpose(1, 2)
        
        return x

Example:

Input: (1, 3, 224, 224)
After projection: (1, 768, 14, 14)
After flatten: (1, 196, 768)

Multi-Head Self-Attention

Location: app/models/pytorch/transformer.py:117-164

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, dropout: float = 0.0):
        super().__init__()
        assert embed_dim % num_heads == 0
        
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5
        
        # Single linear layer to compute Q, K, V
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.attn_drop = nn.Dropout(dropout)
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.proj_drop = nn.Dropout(dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, N, C = x.shape
        
        # Generate Q, K, V
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, self.head_dim)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]  # Each: (B, num_heads, N, head_dim)
        
        # Scaled dot-product attention
        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        
        # Apply attention to values
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        
        # Output projection
        x = self.proj(x)
        x = self.proj_drop(x)
        
        return x

Attention Mechanism:

Linear projection to Q, K, V
Split into multiple heads
Compute attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k)V
Concatenate heads
Output projection

Transformer Block

Location: app/models/pytorch/transformer.py:195-220

class TransformerBlock(nn.Module):
    def __init__(
        self,
        embed_dim: int,
        num_heads: int,
        mlp_ratio: float = 4.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.mlp = MLP(
            in_features=embed_dim,
            hidden_features=int(embed_dim * mlp_ratio),  # 3072 for 768-dim
            dropout=dropout
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Attention with residual (pre-norm)
        x = x + self.attn(self.norm1(x))
        
        # MLP with residual (pre-norm)
        x = x + self.mlp(self.norm2(x))
        
        return x

Structure: LayerNorm → Attention → Residual → LayerNorm → MLP → Residual

Configuration Options

ViT-Base

Configuration:

Patch size: 16
Embed dim: 768
Depth: 12 blocks
Heads: 12
MLP ratio: 4.0

Parameters: ~86MUse when: Standard accuracy/speed tradeoff

ViT-Large

Configuration:

Patch size: 16
Embed dim: 1024
Depth: 24 blocks
Heads: 16
MLP ratio: 4.0

Parameters: ~307MUse when: Maximum accuracy, large dataset

ViT-Small

Configuration:

Patch size: 16
Embed dim: 384
Depth: 12 blocks
Heads: 6
MLP ratio: 4.0

Parameters: ~22MUse when: Limited compute, faster inference

Custom

Configurable parameters:

Patch size (8, 16, 32)
Embed dimension
Number of blocks
Number of heads
MLP ratio
Dropout rate

Use when: Specific requirements

Forward Pass

Location: app/models/pytorch/transformer.py:304-338

def forward(self, x: torch.Tensor) -> torch.Tensor:
    B = x.shape[0]
    
    # 1. Patch embedding
    x = self.patch_embed(x)  # (B, num_patches, embed_dim)
    
    # 2. Add CLS token
    cls_tokens = self.cls_token.expand(B, -1, -1)  # (B, 1, embed_dim)
    x = torch.cat((cls_tokens, x), dim=1)  # (B, num_patches + 1, embed_dim)
    
    # 3. Add position embeddings
    x = x + self.pos_embed
    x = self.pos_drop(x)
    
    # 4. Apply transformer blocks
    for block in self.blocks:
        x = block(x)
    
    # 5. Normalize
    x = self.norm(x)
    
    # 6. Extract CLS token and classify
    cls_output = x[:, 0]  # (B, embed_dim)
    x = self.head(cls_output)  # (B, num_classes)
    
    return x

Model Selection Guide

Dataset Size
Computational Budget
Domain Similarity
Inference Speed

Small (<1000 images/class):

✅ Transfer Learning (Feature Extraction)
✅ Transfer Learning (Partial Fine-tuning)
⚠️ Custom CNN (risk of overfitting)
❌ Vision Transformer (requires large dataset)

Medium (1000-5000 images/class):

✅ Transfer Learning (Partial/Full Fine-tuning)
✅ Custom CNN (with regularization)
⚠️ Vision Transformer (may underperform)

Large (>5000 images/class):

✅ All architectures
✅ Vision Transformer (best performance)
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN (deep architectures)

Performance Comparison

Typical Results on Malware Dataset

Architecture	Parameters	Training Time	Accuracy	GPU Memory
Custom CNN (Small)	~200K	1-2 hours	85-88%	2 GB
Custom CNN (Deep)	~2M	3-4 hours	88-91%	4 GB
ResNet50 (Feature Ext.)	~25M	1-2 hours	90-93%	4 GB
ResNet50 (Partial FT)	~25M	3-5 hours	92-95%	6 GB
ResNet50 (Full FT)	~25M	6-10 hours	93-96%	8 GB
EfficientNetB0	~5M	2-4 hours	91-94%	3 GB
ViT-Small	~22M	8-12 hours	90-93%	8 GB
ViT-Base	~86M	12-24 hours	94-97%	16 GB

Results vary based on dataset size, quality, and training configuration. These are representative ranges.

References

Custom CNN implementation: app/models/pytorch/cnn_builder.py
Transfer learning implementation: app/models/pytorch/transfer.py
Vision Transformer implementation: app/models/pytorch/transformer.py
Base model interface: app/models/base.py
Model building in training worker: app/training/worker.py:29-42

Get Started

Core Concepts

Dashboard Guide

Training

Model Interpretability

Documentation Index

​Overview

Custom CNN

Transfer Learning

Vision Transformer

​Base Model Interface

​Custom CNN

​Overview

​Architecture

​Supported Layer Types

​Activation Functions

ReLU

Leaky ReLU

GELU

Swish (SiLU)

​Example Configuration

​Forward Pass

​Transfer Learning

​Overview

​Supported Base Models

​Fine-Tuning Strategies

​Custom Classifier Head

​Forward Pass

​Vision Transformer

​Overview

​Architecture

​Patch Embedding

​Multi-Head Self-Attention

​Transformer Block

​Configuration Options

ViT-Base

ViT-Large

ViT-Small

Custom

​Forward Pass

​Model Selection Guide

​Performance Comparison

​Typical Results on Malware Dataset

​References

Overview

Base Model Interface

Custom CNN

Overview

Architecture

Supported Layer Types

Activation Functions

Example Configuration

Forward Pass

Transfer Learning

Overview

Supported Base Models

Fine-Tuning Strategies

Custom Classifier Head

Forward Pass

Vision Transformer

Overview

Architecture

Patch Embedding

Multi-Head Self-Attention

Transformer Block

Configuration Options

Forward Pass

Model Selection Guide

Performance Comparison

Typical Results on Malware Dataset

References