> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/OverCV/UC-Intel-Final/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset Preparation

> Learn how to prepare and augment datasets for training malware classification models

## Overview

The UC Intel Final platform provides comprehensive dataset handling capabilities for malware image classification. The dataset preparation system handles scanning, splitting, augmentation, and loading of malware binary visualizations.

## Dataset Structure

The platform expects datasets organized in a directory structure where each subdirectory represents a malware family:

```
dataset/
├── Family1/
│   ├── sample1.png
│   ├── sample2.png
│   └── ...
├── Family2/
│   ├── sample1.png
│   └── ...
└── Family3/
    └── ...
```

Supported image formats: `.png`, `.jpg`, `.jpeg`, `.bmp`

## Scanning Datasets

The `scan_dataset()` function automatically discovers images and labels from your dataset directory.

**Source:** `app/training/dataset.py:37-60`

```python theme={null}
def scan_dataset(
    dataset_path: Path, selected_families: list[str] | None = None
) -> tuple[list[Path], list[int], list[str]]:
    """Scan dataset directory and return image paths, labels, and class names."""
    image_paths = []
    labels = []
    class_names = []

    # Get all family directories
    family_dirs = sorted([d for d in dataset_path.iterdir() if d.is_dir()])

    # Filter if selected_families specified
    if selected_families:
        family_dirs = [d for d in family_dirs if d.name in selected_families]

    for class_idx, family_dir in enumerate(family_dirs):
        class_names.append(family_dir.name)
        # Get all images in this family
        for img_file in family_dir.iterdir():
            if img_file.suffix.lower() in [".png", ".jpg", ".jpeg", ".bmp"]:
                image_paths.append(img_file)
                labels.append(class_idx)

    return image_paths, labels, class_names
```

<Note>
  The function returns three values:

  * `image_paths`: List of Path objects to each image
  * `labels`: Integer labels (0 to num\_classes-1) for each image
  * `class_names`: Sorted list of malware family names
</Note>

## Creating Train/Val/Test Splits

The platform uses stratified splitting to maintain class distribution across splits.

**Source:** `app/training/dataset.py:63-96`

### Split Function

```python theme={null}
def create_splits(
    image_paths: list[Path],
    labels: list[int],
    train_ratio: float = 0.7,
    val_ratio: float = 0.15,
    test_ratio: float = 0.15,
    stratified: bool = True,
    random_seed: int = 72,
) -> dict:
    """Create train/val/test splits."""
    # First split: train vs (val+test)
    train_paths, temp_paths, train_labels, temp_labels = train_test_split(
        image_paths,
        labels,
        test_size=(val_ratio + test_ratio),
        random_state=random_seed,
        stratify=labels if stratified else None,
    )

    # Second split: val vs test
    val_test_ratio = test_ratio / (val_ratio + test_ratio)
    val_paths, test_paths, val_labels, test_labels = train_test_split(
        temp_paths,
        temp_labels,
        test_size=val_test_ratio,
        random_state=random_seed,
        stratify=temp_labels if stratified else None,
    )

    return {
        "train": {"paths": train_paths, "labels": train_labels},
        "val": {"paths": val_paths, "labels": val_labels},
        "test": {"paths": test_paths, "labels": test_labels},
    }
```

<Steps>
  <Step title="First Split">
    Separate training data from validation + test data using the specified `train_ratio`
  </Step>

  <Step title="Second Split">
    Split the remaining data into validation and test sets
  </Step>

  <Step title="Stratification">
    When `stratified=True`, maintains class distribution in all splits to prevent bias
  </Step>
</Steps>

### Configuration Example

```python theme={null}
split_config = {
    "train": 70,        # 70% for training
    "val": 15,          # 15% for validation
    "test": 15,         # 15% for testing
    "stratified": True, # Maintain class distribution
    "random_seed": 72   # For reproducibility
}
```

## Data Augmentation

Augmentation helps prevent overfitting and improves model generalization. The platform supports preset and custom augmentation strategies.

**Source:** `app/training/transforms.py:6-90`

### Augmentation Presets

<Accordion title="Light Augmentation">
  * Random horizontal flip (50% probability)
  * Random 90-degree rotations (0°, 90°, 180°, 270°)
  * Suitable for datasets with moderate diversity
</Accordion>

<Accordion title="Moderate Augmentation">
  * Random horizontal flip (50%)
  * Random vertical flip (50%)
  * Random 90-degree rotations
  * Color jitter: brightness ±10%, contrast ±10%
  * Balanced approach for most use cases
</Accordion>

<Accordion title="Heavy Augmentation">
  * All moderate augmentations
  * Increased color jitter: brightness ±20%, contrast ±20%
  * Gaussian blur (kernel=3, sigma=0.1-0.5)
  * Use with small or highly imbalanced datasets
</Accordion>

### Custom Augmentation

```python theme={null}
augmentation_config = {
    "preset": "Custom",
    "custom": {
        "horizontal_flip": True,
        "vertical_flip": True,
        "rotation": True,
        "rotation_angles": [90, 180, 270],
        "brightness_range": 20,  # ±20%
        "contrast_range": 20,    # ±20%
        "gaussian_noise": True
    }
}
```

### Transform Pipeline

The `create_train_transforms()` function builds a transform pipeline:

```python theme={null}
def create_train_transforms(config: dict) -> transforms.Compose:
    transform_list = []
    
    # 1. Resize to target size
    transform_list.append(transforms.Resize(target_size))
    
    # 2. Color mode conversion (if needed)
    if color_mode == "Grayscale":
        transform_list.append(transforms.Grayscale(num_output_channels=3))
    
    # 3. Augmentation transforms (based on preset/custom)
    if preset == "Moderate":
        transform_list.extend([
            transforms.RandomHorizontalFlip(p=0.5),
            transforms.RandomVerticalFlip(p=0.5),
            transforms.RandomChoice([...]),  # Rotations
            transforms.ColorJitter(brightness=0.1, contrast=0.1),
        ])
    
    # 4. Convert to tensor
    transform_list.append(transforms.ToTensor())
    
    # 5. Normalization
    if normalization == "ImageNet Mean/Std":
        transform_list.append(
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225]
            )
        )
    
    return transforms.Compose(transform_list)
```

<Warning>
  Validation and test sets should NOT use augmentation. Use `create_val_transforms()` which only applies resizing and normalization.
</Warning>

## Handling Class Imbalance

Malware datasets often have imbalanced class distributions. The platform provides two strategies:

### 1. Class Weights

Compute inverse frequency weights to penalize misclassification of rare classes more heavily.

**Source:** `app/training/dataset.py:99-110`

```python theme={null}
def compute_class_weights(labels: list[int], num_classes: int) -> torch.Tensor:
    """Compute inverse frequency class weights for imbalanced data."""
    counter = Counter(labels)
    total = len(labels)

    weights = []
    for i in range(num_classes):
        count = counter.get(i, 1)
        weight = total / (num_classes * count)
        weights.append(weight)

    return torch.tensor(weights, dtype=torch.float32)
```

### 2. Weighted Random Sampler

Oversample minority classes during training to balance batch composition.

**Source:** `app/training/dataset.py:113-126`

```python theme={null}
def create_weighted_sampler(
    labels: list[int], num_classes: int
) -> WeightedRandomSampler:
    """Create weighted random sampler for imbalanced data."""
    class_weights = compute_class_weights(labels, num_classes)

    # Assign weight to each sample
    sample_weights = [class_weights[label].item() for label in labels]

    return WeightedRandomSampler(
        weights=sample_weights,
        num_samples=len(labels),
        replacement=True,
    )
```

<Note>
  **When to use each strategy:**

  * **Class Weights**: Use with Cross-Entropy or Focal Loss when you want to keep natural class distribution but penalize errors on rare classes
  * **Weighted Sampler**: Use when you want balanced batches by oversampling minority classes (can increase training time)
</Note>

## PyTorch Dataset and DataLoader

### MalwareDataset Class

**Source:** `app/training/dataset.py:13-34`

```python theme={null}
class MalwareDataset(Dataset):
    """PyTorch Dataset for malware images."""

    def __init__(self, image_paths: list[Path], labels: list[int], transform=None):
        self.image_paths = image_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        label = self.labels[idx]

        # Load image
        image = Image.open(img_path).convert("RGB")

        if self.transform:
            image = self.transform(image)

        return image, label
```

### Creating DataLoaders

The `create_dataloaders()` function orchestrates the entire pipeline:

**Source:** `app/training/dataset.py:129-249`

```python theme={null}
dataloaders, class_names, class_weights = create_dataloaders(
    dataset_config={
        "dataset_path": "dataset",
        "selected_families": None,  # or ["Family1", "Family2"]
        "split": {
            "train": 70,
            "val": 15,
            "test": 15,
            "stratified": True,
            "random_seed": 72
        },
        "preprocessing": {
            "target_size": (224, 224),
            "normalization": "ImageNet Mean/Std",
            "color_mode": "RGB"
        },
        "augmentation": {
            "preset": "Moderate"
        }
    },
    training_config={
        "batch_size": 32,
        "class_weights": "Auto Class Weights"
    },
    num_workers=4
)
```

**Returns:**

* `dataloaders`: Dictionary with `'train'`, `'val'`, and `'test'` DataLoader objects
* `class_names`: List of malware family names
* `class_weights`: Tensor of class weights (or None)

<Steps>
  <Step title="Scan Dataset">
    Discover all images and create label mappings
  </Step>

  <Step title="Create Splits">
    Split data into train/val/test with stratification
  </Step>

  <Step title="Create Transforms">
    Build augmentation pipelines for training and validation
  </Step>

  <Step title="Create Datasets">
    Instantiate PyTorch Dataset objects for each split
  </Step>

  <Step title="Compute Class Weights">
    Calculate weights for imbalanced data handling (if enabled)
  </Step>

  <Step title="Create DataLoaders">
    Build DataLoader objects with proper batching and sampling
  </Step>
</Steps>

## Best Practices

### Split Ratios

* **70/15/15**: Standard split for moderate-sized datasets (1000+ samples per class)
* **80/10/10**: Use when you have larger datasets and want more training data
* **60/20/20**: Use when validation is critical and dataset is smaller

### Batch Size

* **32**: Good default for most GPUs
* **64-128**: Use with larger GPUs and simpler models
* **16-8**: Use with limited memory or very large models

### Augmentation Strategy

* Start with **Light** or **Moderate** augmentation
* Use **Heavy** only if you observe significant overfitting
* For malware binaries, geometric transforms (rotation, flip) are usually more important than color transforms

### Random Seed

* Always set `random_seed` for reproducibility
* Use the same seed across experiments for fair comparison
* Document the seed in experiment logs

## Common Issues

<Accordion title="Out of Memory Errors">
  **Solutions:**

  * Reduce `batch_size`
  * Reduce `num_workers` (try 2 or 0)
  * Reduce `target_size` (e.g., from 224 to 128)
  * Disable `pin_memory` if using MPS/CPU
</Accordion>

<Accordion title="Slow Data Loading">
  **Solutions:**

  * Increase `num_workers` (try 4-8 on multi-core CPUs)
  * Enable `pin_memory` when using CUDA
  * Convert images to a faster format (PNG is good, avoid BMP)
  * Use smaller image sizes if possible
</Accordion>

<Accordion title="Unbalanced Batches">
  **Solutions:**

  * Enable stratified splitting
  * Use WeightedRandomSampler with `class_weights="Auto Class Weights"`
  * Increase batch size for better class distribution
</Accordion>

## Next Steps

<Card title="Model Selection" icon="brain" href="/training/model-selection">
  Learn how to choose and configure model architectures
</Card>

<Card title="Hyperparameters" icon="sliders" href="/training/hyperparameters">
  Optimize training hyperparameters for best performance
</Card>
