Documentation Index Fetch the complete documentation index at: https://mintlify.com/OverCV/UC-Intel-Final/llms.txt
Use this file to discover all available pages before exploring further.
Overview
The UC Intel Final platform provides three families of neural network architectures, each designed for different use cases and computational constraints:
Custom CNN Build convolutional neural networks from scratch with configurable layer stacks
Transfer Learning Fine-tune pre-trained models (VGG, ResNet, EfficientNet) for faster convergence
Vision Transformer State-of-the-art transformer architecture with self-attention mechanisms
Base Model Interface
All models inherit from the BaseModel abstract class, ensuring consistent interfaces:
Location : app/models/base.py:11-71
from abc import ABC , abstractmethod
from typing import Any, Dict, Tuple
import torch.nn as nn
class BaseModel ( ABC ):
"""Abstract base class for model implementations"""
def __init__ ( self , config : Dict[ str , Any]):
"""
Initialize model with configuration
Args:
config: Model configuration dictionary
"""
self .config = config
self .model = None
@abstractmethod
def build ( self ) -> nn.Module:
"""
Build and return the model
Returns:
PyTorch model (nn.Module)
"""
pass
@abstractmethod
def get_parameters_count ( self ) -> Tuple[ int , int ]:
"""
Get total and trainable parameter counts
Returns:
Tuple of (total_params, trainable_params)
"""
pass
def get_model_summary ( self ) -> Dict[ str , Any]:
"""Get model summary statistics"""
if self .model is None :
self .model = self .build()
total_params, trainable_params = self .get_parameters_count()
return {
"total_parameters" : total_params,
"trainable_parameters" : trainable_params,
"model_type" : self .config.get( "model_type" , "Unknown" ),
"architecture" : self .config.get( "architecture" , "Unknown" ),
"num_classes" : self .config.get( "num_classes" , 0 )
}
All models implement the same interface, making it easy to swap architectures during experimentation.
Custom CNN
Overview
The Custom CNN builder allows you to construct convolutional neural networks from a layer stack configuration. This provides maximum flexibility for architecture experimentation.
Location : app/models/pytorch/cnn_builder.py
Architecture
Supported Layer Types
Convolutional
Pooling
Normalization
Regularization
Transition
Dense
Conv2D - 2D Convolutional layerParameters :
filters (int): Number of output channels (default: 32)
kernel_size (int): Kernel size (default: 3)
activation (str): Activation function - “relu”, “leaky_relu”, “gelu”, “swish” (default: “relu”)
padding (str): “same” or “valid” (default: “same”)
Implementation (app/models/pytorch/cnn_builder.py:178-194):def _build_conv2d ( self , in_channels : int , params : dict ) -> tuple :
filters = params.get( "filters" , 32 )
kernel_size = params.get( "kernel_size" , 3 )
activation = params.get( "activation" , "relu" )
padding_mode = params.get( "padding" , "same" )
padding = kernel_size // 2 if padding_mode == "same" else 0
layers = [
nn.Conv2d(in_channels, filters,
kernel_size = kernel_size,
padding = padding),
self ._get_activation(activation)
]
return nn.Sequential( * layers), filters
Output shape : (batch, filters, height, width)MaxPooling2D - Max pooling layerParameters :
pool_size (int): Pooling window size (default: 2)
Implementation :layer = nn.MaxPool2d( kernel_size = pool_size, stride = pool_size)
current_spatial = current_spatial // pool_size
Output shape : (batch, channels, height//pool_size, width//pool_size)AveragePooling2D - Average pooling layerParameters :
pool_size (int): Pooling window size (default: 2)
Implementation :layer = nn.AvgPool2d( kernel_size = pool_size, stride = pool_size)
current_spatial = current_spatial // pool_size
BatchNorm - Batch normalization layerNormalizes activations across the batch dimension: Implementation (app/models/pytorch/cnn_builder.py:133-135):layer = nn.BatchNorm2d(current_channels)
self .feature_layers.append(layer)
Benefits :
Stabilizes training
Allows higher learning rates
Reduces internal covariate shift
Acts as regularization
Dropout - Dropout layerParameters :
rate (float): Dropout probability (default: 0.25)
Implementation (app/models/pytorch/cnn_builder.py:137-144):rate = params.get( "rate" , 0.25 )
if in_classifier:
layer = nn.Dropout(rate) # 1D dropout for FC layers
self .classifier_layers.append(layer)
else :
layer = nn.Dropout2d(rate) # 2D dropout for conv layers
self .feature_layers.append(layer)
Usage :
Use Dropout2d (spatial dropout) after convolutional layers
Use regular Dropout after dense layers
Flatten - Flatten spatial dimensionsConverts (batch, channels, height, width) → (batch, channels * height * width) Implementation (app/models/pytorch/cnn_builder.py:146-148):flatten_features = current_channels * current_spatial * current_spatial
in_classifier = True
Used in forward pass: x = torch.flatten(x, 1 ) # Flatten all dims except batch
GlobalAvgPool - Global average poolingConverts (batch, channels, height, width) → (batch, channels) Implementation (app/models/pytorch/cnn_builder.py:150-152):flatten_features = current_channels
in_classifier = True
Used in forward pass: x = torch.mean(x, dim = [ 2 , 3 ]) # Average over spatial dims
Advantage : Reduces parameters compared to FlattenDense - Fully connected layerParameters :
units (int): Number of output units (default: 256)
activation (str): Activation function (default: “relu”)
Implementation (app/models/pytorch/cnn_builder.py:154-167):units = params.get( "units" , 256 )
activation = params.get( "activation" , "relu" )
layer = nn.Linear(flatten_features, units)
self .classifier_layers.append(layer)
self .classifier_layers.append( self ._get_activation(activation))
flatten_features = units # Update for next layer
Note : Must come after Flatten or GlobalAvgPool
Activation Functions
Location : app/models/pytorch/cnn_builder.py:75-81
ACTIVATION_MAP = {
"relu" : nn.ReLU( inplace = True ),
"leaky_relu" : nn.LeakyReLU( 0.1 , inplace = True ),
"gelu" : nn.GELU(),
"swish" : nn.SiLU( inplace = True ),
"none" : nn.Identity(),
}
ReLU Formula : f(x) = max(0, x)Pros :
Fast computation
Sparse activation
Widely used
Cons :
Leaky ReLU Formula : f(x) = x if x > 0 else 0.1xPros :
Fixes dying ReLU
Allows negative gradients
Use when : Training deep networks
GELU Formula : f(x) = x * Φ(x) (Gaussian Error Linear Unit)Pros :
Smooth activation
Better for transformers
State-of-the-art results
Use when : Using transformer-style architectures
Swish (SiLU) Formula : f(x) = x * sigmoid(x)Pros :
Self-gated activation
Smooth and non-monotonic
Often outperforms ReLU
Use when : Need smooth gradients
Example Configuration
Simple CNN for MNIST-style data :
config = {
"model_type" : "Custom CNN" ,
"num_classes" : 9 ,
"cnn_config" : {
"layers" : [
# Block 1
{ "type" : "Conv2D" , "params" : { "filters" : 32 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "Conv2D" , "params" : { "filters" : 32 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "MaxPooling2D" , "params" : { "pool_size" : 2 }},
{ "type" : "BatchNorm" },
{ "type" : "Dropout" , "params" : { "rate" : 0.25 }},
# Block 2
{ "type" : "Conv2D" , "params" : { "filters" : 64 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "Conv2D" , "params" : { "filters" : 64 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "MaxPooling2D" , "params" : { "pool_size" : 2 }},
{ "type" : "BatchNorm" },
{ "type" : "Dropout" , "params" : { "rate" : 0.25 }},
# Block 3
{ "type" : "Conv2D" , "params" : { "filters" : 128 , "kernel_size" : 3 , "activation" : "relu" }},
{ "type" : "GlobalAvgPool" },
# Classifier
{ "type" : "Dense" , "params" : { "units" : 256 , "activation" : "relu" }},
{ "type" : "Dropout" , "params" : { "rate" : 0.5 }},
]
}
}
Parameter Count : ~200K parameters
Forward Pass
Location : app/models/pytorch/cnn_builder.py:205-232
def forward ( self , x : torch.Tensor) -> torch.Tensor:
"""
Forward pass
Args:
x: Input tensor of shape (batch, channels, height, width)
Returns:
Output logits of shape (batch, num_classes)
"""
# Apply feature extraction layers
for layer in self .feature_layers:
x = layer(x)
# Apply transition (flatten or global pool)
if self .use_global_pool:
x = torch.mean(x, dim = [ 2 , 3 ])
else :
x = torch.flatten(x, 1 )
# Apply classifier layers
for layer in self .classifier_layers:
x = layer(x)
# Output layer
x = self .output_layer(x)
return x
Transfer Learning
Overview
Transfer learning leverages pre-trained models trained on ImageNet (1.2M images, 1000 classes) to accelerate training and improve performance on smaller datasets.
Location : app/models/pytorch/transfer.py
Supported Base Models
VGG
ResNet
InceptionV3
EfficientNet
VGG16 / VGG19 Architecture : Deep CNNs with small 3x3 filtersCharacteristics :
16 or 19 layers
Simple, uniform architecture
Large number of parameters (~138M for VGG16)
Input size : 224x224Feature dimensions : 512 (after global pooling)Use when : Need simple, well-understood architectureImplementation (app/models/pytorch/transfer.py:152-154):"VGG16" : lambda : models.vgg16( pretrained = use_pretrained),
"VGG19" : lambda : models.vgg19( pretrained = use_pretrained),
ResNet50 / ResNet101 Architecture : Residual connections to enable very deep networksCharacteristics :
50 or 101 layers
Skip connections prevent vanishing gradients
Moderate parameter count (~25M for ResNet50)
Input size : 224x224Feature dimensions : 2048Use when : Need deeper network with good performance/cost ratioImplementation (app/models/pytorch/transfer.py:155-156):"ResNet50" : lambda : models.resnet50( pretrained = use_pretrained),
"ResNet101" : lambda : models.resnet101( pretrained = use_pretrained),
Residual Block :x --> Conv --> BN --> ReLU --> Conv --> BN --> (+) --> ReLU
| |
+----------------------------------------------+
InceptionV3 Architecture : Multi-scale feature extraction with inception modulesCharacteristics :
Parallel convolutions at multiple scales
Factorized convolutions
Efficient parameter usage (~24M params)
Input size : 299x299 (different from others!)Feature dimensions : 2048Use when : Need multi-scale featuresImplementation (app/models/pytorch/transfer.py:157-159):"InceptionV3" : lambda : models.inception_v3(
pretrained = use_pretrained,
aux_logits = False # Disable auxiliary classifier
),
EfficientNetB0 Architecture : Compound scaling of depth, width, and resolutionCharacteristics :
State-of-the-art efficiency
Mobile-friendly architecture
Few parameters (~5M for B0)
Input size : 224x224Feature dimensions : 1280Use when : Need efficient inference or limited computeImplementation (app/models/pytorch/transfer.py:160):"EfficientNetB0" : lambda : models.efficientnet_b0( pretrained = use_pretrained),
Scaling strategy : Jointly scale depth, width, and resolution
Fine-Tuning Strategies
Location : app/models/pytorch/transfer.py:194-217
Custom Classifier Head
Location : app/models/pytorch/transfer.py:125-146
# Build custom classifier head
classifier_layers = []
if global_pooling:
self .global_pool = nn.AdaptiveAvgPool2d(( 1 , 1 ))
else :
self .global_pool = None
if add_dense:
# Two-layer classifier
classifier_layers.extend([
nn.Linear(in_features, dense_units),
nn.ReLU( inplace = True ),
nn.Dropout(dropout),
nn.Linear(dense_units, num_classes)
])
else :
# Single-layer classifier
classifier_layers.extend([
nn.Dropout(dropout),
nn.Linear(in_features, num_classes)
])
self .classifier = nn.Sequential( * classifier_layers)
Options :
Global Pooling : Reduces spatial dimensions to 1x1
Extra Dense Layer : Adds capacity (useful for complex domains)
Dropout : Regularization (default: 0.5)
Forward Pass
Location : app/models/pytorch/transfer.py:219-243
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# Extract features with frozen/unfrozen base model
features = self .base_model(x)
# Apply global pooling if needed
if self .global_pool is not None and len (features.shape) == 4 :
features = self .global_pool(features)
features = torch.flatten(features, 1 )
elif len (features.shape) == 4 :
features = torch.flatten(features, 1 )
# Apply custom classifier
output = self .classifier(features)
return output
Overview
Vision Transformer (ViT) applies the transformer architecture (originally designed for NLP) to image classification by treating images as sequences of patches.
Location : app/models/pytorch/transformer.py
Paper : “An Image is Worth 16x16 Words” (Dosovitskiy et al., 2020)
Architecture
Patch Embedding
Location : app/models/pytorch/transformer.py:72-114
Converts 2D image into sequence of patch embeddings:
class PatchEmbedding ( nn . Module ):
def __init__ (
self ,
image_size : int = 224 ,
patch_size : int = 16 , # 16x16 patches
in_channels : int = 3 ,
embed_dim : int = 768 ,
):
super (). __init__ ()
self .image_size = image_size
self .patch_size = patch_size
self .num_patches = (image_size // patch_size) ** 2 # 196 for 224x224
# Use convolution to extract and embed patches
self .proj = nn.Conv2d(
in_channels,
embed_dim,
kernel_size = patch_size,
stride = patch_size # Non-overlapping patches
)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# (B, C, H, W) -> (B, embed_dim, H/P, W/P)
x = self .proj(x)
# (B, embed_dim, H/P, W/P) -> (B, num_patches, embed_dim)
x = x.flatten( 2 ).transpose( 1 , 2 )
return x
Example :
Input: (1, 3, 224, 224)
After projection: (1, 768, 14, 14)
After flatten: (1, 196, 768)
Multi-Head Self-Attention
Location : app/models/pytorch/transformer.py:117-164
class MultiHeadAttention ( nn . Module ):
def __init__ ( self , embed_dim : int , num_heads : int , dropout : float = 0.0 ):
super (). __init__ ()
assert embed_dim % num_heads == 0
self .embed_dim = embed_dim
self .num_heads = num_heads
self .head_dim = embed_dim // num_heads
self .scale = self .head_dim ** - 0.5
# Single linear layer to compute Q, K, V
self .qkv = nn.Linear(embed_dim, embed_dim * 3 )
self .attn_drop = nn.Dropout(dropout)
self .proj = nn.Linear(embed_dim, embed_dim)
self .proj_drop = nn.Dropout(dropout)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
B, N, C = x.shape
# Generate Q, K, V
qkv = self .qkv(x).reshape(B, N, 3 , self .num_heads, self .head_dim)
qkv = qkv.permute( 2 , 0 , 3 , 1 , 4 )
q, k, v = qkv[ 0 ], qkv[ 1 ], qkv[ 2 ] # Each: (B, num_heads, N, head_dim)
# Scaled dot-product attention
attn = (q @ k.transpose( - 2 , - 1 )) * self .scale
attn = attn.softmax( dim =- 1 )
attn = self .attn_drop(attn)
# Apply attention to values
x = (attn @ v).transpose( 1 , 2 ).reshape(B, N, C)
# Output projection
x = self .proj(x)
x = self .proj_drop(x)
return x
Attention Mechanism :
Linear projection to Q, K, V
Split into multiple heads
Compute attention scores: Attention(Q, K, V) = softmax(QK^T / √d_k)V
Concatenate heads
Output projection
Location : app/models/pytorch/transformer.py:195-220
class TransformerBlock ( nn . Module ):
def __init__ (
self ,
embed_dim : int ,
num_heads : int ,
mlp_ratio : float = 4.0 ,
dropout : float = 0.0 ,
):
super (). __init__ ()
self .norm1 = nn.LayerNorm(embed_dim)
self .attn = MultiHeadAttention(embed_dim, num_heads, dropout)
self .norm2 = nn.LayerNorm(embed_dim)
self .mlp = MLP(
in_features = embed_dim,
hidden_features = int (embed_dim * mlp_ratio), # 3072 for 768-dim
dropout = dropout
)
def forward ( self , x : torch.Tensor) -> torch.Tensor:
# Attention with residual (pre-norm)
x = x + self .attn( self .norm1(x))
# MLP with residual (pre-norm)
x = x + self .mlp( self .norm2(x))
return x
Structure : LayerNorm → Attention → Residual → LayerNorm → MLP → Residual
Configuration Options
ViT-Base Configuration :
Patch size: 16
Embed dim: 768
Depth: 12 blocks
Heads: 12
MLP ratio: 4.0
Parameters : ~86MUse when : Standard accuracy/speed tradeoff
ViT-Large Configuration :
Patch size: 16
Embed dim: 1024
Depth: 24 blocks
Heads: 16
MLP ratio: 4.0
Parameters : ~307MUse when : Maximum accuracy, large dataset
ViT-Small Configuration :
Patch size: 16
Embed dim: 384
Depth: 12 blocks
Heads: 6
MLP ratio: 4.0
Parameters : ~22MUse when : Limited compute, faster inference
Custom Configurable parameters :
Patch size (8, 16, 32)
Embed dimension
Number of blocks
Number of heads
MLP ratio
Dropout rate
Use when : Specific requirements
Forward Pass
Location : app/models/pytorch/transformer.py:304-338
def forward ( self , x : torch.Tensor) -> torch.Tensor:
B = x.shape[ 0 ]
# 1. Patch embedding
x = self .patch_embed(x) # (B, num_patches, embed_dim)
# 2. Add CLS token
cls_tokens = self .cls_token.expand(B, - 1 , - 1 ) # (B, 1, embed_dim)
x = torch.cat((cls_tokens, x), dim = 1 ) # (B, num_patches + 1, embed_dim)
# 3. Add position embeddings
x = x + self .pos_embed
x = self .pos_drop(x)
# 4. Apply transformer blocks
for block in self .blocks:
x = block(x)
# 5. Normalize
x = self .norm(x)
# 6. Extract CLS token and classify
cls_output = x[:, 0 ] # (B, embed_dim)
x = self .head(cls_output) # (B, num_classes)
return x
Model Selection Guide
Dataset Size
Computational Budget
Domain Similarity
Inference Speed
Small (<1000 images/class) :
✅ Transfer Learning (Feature Extraction)
✅ Transfer Learning (Partial Fine-tuning)
⚠️ Custom CNN (risk of overfitting)
❌ Vision Transformer (requires large dataset)
Medium (1000-5000 images/class) :
✅ Transfer Learning (Partial/Full Fine-tuning)
✅ Custom CNN (with regularization)
⚠️ Vision Transformer (may underperform)
Large (>5000 images/class) :
✅ All architectures
✅ Vision Transformer (best performance)
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN (deep architectures)
Low (CPU, <8GB RAM) :
✅ Transfer Learning (Feature Extraction, small models)
✅ Custom CNN (shallow, <1M params)
⚠️ EfficientNetB0
❌ Vision Transformer
❌ Large ResNets
Medium (GPU, 8-16GB VRAM) :
✅ Transfer Learning (all strategies)
✅ Custom CNN (deep)
✅ ViT-Small
⚠️ ViT-Base (small batch size)
High (GPU, >16GB VRAM) :
✅ All architectures
✅ Large batch sizes
✅ ViT-Large
Similar to ImageNet (natural images) :
✅ Transfer Learning (Feature Extraction)
Early layers capture generic features
Somewhat different (medical, satellite) :
✅ Transfer Learning (Partial Fine-tuning)
✅ Custom CNN
Adapt mid-to-high level features
Very different (grayscale, textures) :
✅ Transfer Learning (Full Fine-tuning)
✅ Custom CNN
✅ Vision Transformer (if enough data)
Need to learn domain-specific features
Real-time required (<50ms) :
✅ EfficientNetB0
✅ Custom CNN (shallow)
⚠️ ResNet50 (optimized)
❌ Vision Transformer
❌ Large models
Batch processing OK (>100ms) :
✅ All architectures
Optimize for accuracy over speed
Typical Results on Malware Dataset
Architecture Parameters Training Time Accuracy GPU Memory Custom CNN (Small) ~200K 1-2 hours 85-88% 2 GB Custom CNN (Deep) ~2M 3-4 hours 88-91% 4 GB ResNet50 (Feature Ext.) ~25M 1-2 hours 90-93% 4 GB ResNet50 (Partial FT) ~25M 3-5 hours 92-95% 6 GB ResNet50 (Full FT) ~25M 6-10 hours 93-96% 8 GB EfficientNetB0 ~5M 2-4 hours 91-94% 3 GB ViT-Small ~22M 8-12 hours 90-93% 8 GB ViT-Base ~86M 12-24 hours 94-97% 16 GB
Results vary based on dataset size, quality, and training configuration. These are representative ranges.
References
Custom CNN implementation: app/models/pytorch/cnn_builder.py
Transfer learning implementation: app/models/pytorch/transfer.py
Vision Transformer implementation: app/models/pytorch/transformer.py
Base model interface: app/models/base.py
Model building in training worker: app/training/worker.py:29-42