> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/OverCV/UC-Intel-Final/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset Configuration

> Configure malware image datasets with splits, augmentation, and class balancing

## Overview

The Dataset Configuration page (`/dataset`) provides a comprehensive interface for preparing your malware image dataset for training. It automatically scans the dataset directory and provides tools for class selection, data splitting, augmentation, and imbalance handling.

<Info>
  The dataset is automatically scanned from `repo/malware/` on page load. All malware families are detected and indexed with sample counts.
</Info>

## Page Structure

The Dataset page is organized into **4 tabs**:

1. **Overview & Split** - Dataset statistics and train/val/test configuration
2. **Class Distribution** - Class selection and distribution visualization
3. **Samples & Preprocessing** - Sample viewer and preprocessing preview
4. **Augmentation** - Data augmentation settings and configuration save

***

## Tab 1: Overview & Split

### Dataset Overview

Displays key statistics about your dataset:

* **Total Samples**: Total number of images across all classes
* **Number of Classes**: Count of unique malware families
* **Dataset Location**: Relative path to dataset directory
* **Class Imbalance Ratio**: Max/min samples ratio (warns if >2x)

<Tip>
  The dashboard automatically calculates class imbalance and warns if the ratio exceeds 2:1, indicating potential training bias.
</Tip>

### Train/Validation/Test Split

Two split methods are available:

#### Fixed Split (Default)

<Steps>
  <Step title="Set Training Percentage">
    Use the slider to allocate data for training (0-100%)

    * Default: 70%
  </Step>

  <Step title="Set Validation Percentage">
    From the remaining data, allocate for validation

    * Default: 50% of remaining (15% of total)
    * Rest automatically goes to test set
  </Step>

  <Step title="Configure Options">
    * **Stratified Split**: Maintains class proportions in each split
    * **Random Seed**: For reproducible splits (default: 73)
  </Step>
</Steps>

**Example Split:**

* Train: 70% → 70% of 10,000 = 7,000 samples
* Val: 50% of remaining 30% → 15% of 10,000 = 1,500 samples
* Test: Remaining 15% → 1,500 samples

A **pie chart** visualizes the final split distribution.

#### K-Fold Cross-Validation

Enable **Use Cross-Validation** for advanced validation:

* **Number of Folds (K)**: 2-20 (typically 5 or 10)
* **Stratified K-Fold**: Maintains class proportions per fold
* **Training per iteration**: (K-1)/K of data
* **Validation per iteration**: 1/K of data

<Info>
  Cross-validation runs K training iterations, rotating validation folds. Final metrics are averaged across folds.
</Info>

### Class Imbalance Handling

The dashboard provides several strategies to handle imbalanced classes:

<Tabs>
  <Tab title="Auto Class Weights (Recommended)">
    Automatically calculates class weights inversely proportional to frequencies.

    * Classes with fewer samples get higher weights
    * Balanced loss function during training
    * No data duplication or removal

    **Best for**: Most scenarios with moderate imbalance
  </Tab>

  <Tab title="Selective Augmentation (H2)">
    Applies MORE augmentation to minority classes only.

    * Set **Minority Threshold**: Classes below this are considered minority (default: 200)
    * Set **Augmentation Multiplier**: How much more augmentation (1.5-5x)
    * Tests H2 hypothesis: augmentation improves minority class recall

    **Best for**: Research experiments testing augmentation effectiveness
  </Tab>

  <Tab title="Manual Class Weights">
    Set custom weights for each class (0.1-10.0).

    * Default weights auto-calculated from class frequencies
    * Fine-tune per-class importance

    **Best for**: Domain-specific requirements
  </Tab>

  <Tab title="Oversampling (SMOTE)">
    ⚠️ **Not recommended for image data!**

    SMOTE interpolates between feature vectors, creating unrealistic synthetic images.

    * Sampling Ratio: Target minority/majority ratio (0.1-1.0)

    **Use Auto Class Weights instead**
  </Tab>

  <Tab title="Undersampling">
    Reduces samples from majority classes to balance dataset.

    * Balances by removing data
    * May lose important information

    ⚠️ Reduces total training data size
  </Tab>

  <Tab title="No Adjustment">
    Train with natural class distribution.

    * May lead to bias towards majority classes
    * Consider if imbalance reflects real-world distribution
  </Tab>
</Tabs>

***

## Tab 2: Class Distribution

### Class Selection Interface

**Controls:**

* **Select All / Deselect All**: Quick selection buttons
* **Min samples per class**: Filter classes by sample count threshold
* **Multi-select dropdown**: Search and select individual classes

<Tip>
  Use the "Min samples per class" filter to automatically select only well-represented classes. For example, set to 100 to exclude rare malware families.
</Tip>

### Selection Summary

Displays metrics based on selected classes:

**For Fixed Split:**

* Selected Classes count
* Total Samples
* Train samples (with percentage)
* Val samples (with percentage)
* Test samples (with percentage)

**For Cross-Validation:**

* Selected Classes count
* Total Samples
* K-Folds count

### Distribution Visualization

#### Original Distribution Chart

**Grouped bar chart** showing samples per malware family:

* **Fixed Split**: Stacked bars with Train/Val/Test breakdown
  * Green bars: Training samples
  * Blue bars: Validation samples
  * Orange bars: Test samples
* **Cross-Validation**: Single bar showing total samples used in CV

#### Effective Distribution (After Balancing)

Shows the **effective contribution** of each class after applying imbalance handling:

* **Auto Class Weights**: All classes contribute equally (flat bars)
* **Selective Augmentation**: Minority classes show increased samples
* **Manual Weights**: Bars scaled by custom weights
* **SMOTE**: Minority classes boosted to target ratio
* **Undersampling**: All classes reduced to smallest size

**Improvement Metrics:**

* Original Ratio (before balancing)
* Effective Ratio (after balancing)
* Imbalance Reduction percentage

### Top/Bottom Classes

Two columns showing:

* **Most Common Selected Classes**: Top 5 by sample count
* **Least Common Selected Classes**: Bottom 5 by sample count

***

## Tab 3: Samples & Preprocessing

### Preprocessing Preview

**Family Selector**: Choose malware family to preview

**Preprocessing Options:**

<Tabs>
  <Tab title="Target Size">
    * 224x224 (default)
    * 256x256
    * 299x299
    * 512x512

    All images are resized to this dimension using LANCZOS resampling.
  </Tab>

  <Tab title="Normalization">
    * **\[0,1] Scale**: Divide by 255 (default)
    * **\[-1,1] Scale**: Rescale to -1 to 1 range
    * **ImageNet Mean/Std**: Normalize using ImageNet statistics
  </Tab>

  <Tab title="Color Mode">
    * **RGB**: 3-channel color (default)
    * **Grayscale**: Single channel (reduces parameters)
  </Tab>
</Tabs>

**Side-by-side comparison:**

* **Original Image**: As-is from dataset with original dimensions
* **After Preprocessing**: Resized and color-converted preview

### Dataset Samples

Displays a **6x6 grid** (36 images) showing random samples across all classes.

* Each image shows dimensions below (e.g., "128x128")
* Grid samples are **cached in session state** to prevent re-randomizing on rerun
* Images displayed at 150px width

<Note>
  The sample grid helps verify image quality and diversity before training.
</Note>

***

## Tab 4: Augmentation

### Augmentation Presets

Choose from predefined presets or create custom configuration:

<Tabs>
  <Tab title="None">
    No augmentation applied
  </Tab>

  <Tab title="Light">
    * Horizontal flip
    * Orthogonal rotation (90°/180°/270°)
  </Tab>

  <Tab title="Moderate">
    * H/V flip
    * Orthogonal rotation
    * Brightness (±10%)
  </Tab>

  <Tab title="Heavy">
    * All flips
    * Orthogonal rotation
    * Brightness/Contrast (±20%)
    * Gaussian noise
  </Tab>

  <Tab title="Custom">
    Configure individual augmentation transforms (see below)
  </Tab>
</Tabs>

<Info>
  Augmentation is applied **during training**, not during dataset preparation. This allows on-the-fly augmentation with minimal disk space.
</Info>

### Custom Augmentation Configuration

When **Custom** preset is selected:

**Geometric Transforms:**

* **Horizontal Flip**: Flip left-right (safe for malware images)
* **Vertical Flip**: Flip top-bottom
* **Orthogonal Rotation**: Select from \[90°, 180°, 270°]
  * Only 90° multiples (lossless, no interpolation)

**Photometric Transforms:**

* **Brightness Adjustment**: ±0-50% brightness change
* **Contrast Adjustment**: ±0-50% contrast change
* **Gaussian Noise**: Add random Gaussian noise

<Tip>
  Orthogonal rotations (90°/180°/270°) are lossless because they don't require interpolation. Avoid arbitrary angles for malware images.
</Tip>

### Augmentation Preview

**Select Family to Preview**: Choose malware family

Displays **3 side-by-side images:**

1. **After Preprocessing**: Base image (resized, normalized)
2. **Augmented (Example 1)**: Random augmentation applied
3. **Augmented (Example 2)**: Different random augmentation

**Refresh Button**: Generate new random augmentation examples

<Note>
  Augmentations are applied randomly, so each example shows different transforms. Refresh multiple times to see the full range of augmentations.
</Note>

***

## Configuration Summary & Save

The final section (bottom of Tab 4) displays:

### Summary Metrics

3 metric cards:

* **Selected Classes**: Number of families included
* **Total Samples**: Sum across selected classes
* **Method**: "5-Fold CV" or "Augmentation Preset"

### Save Configuration Button

**💾 Save Configuration** (primary button, centered)

* Validates all settings
* Saves to `state/workflow.py` session state
* Shows success message with balloons 🎈
* Enables Model page navigation

<Info>
  After saving, a green checkmark (✅) appears in the sidebar next to "Dataset configured".
</Info>

### Full Configuration JSON

Expandable section showing complete config structure:

```json theme={null}
{
  "dataset_path": "repo/malware",
  "total_samples": 10000,
  "num_classes": 25,
  "selected_families": ["Ramnit", "Lollipop", ...],
  "split": {
    "method": "fixed_split",
    "train": 70.0,
    "val": 15.0,
    "test": 15.0,
    "stratified": true,
    "random_seed": 73
  },
  "augmentation": {
    "preset": "Moderate"
  },
  "preprocessing": {
    "target_size": [224, 224],
    "normalization": "[0,1]",
    "color_mode": "RGB"
  },
  "imbalance_handling": {
    "strategy": "Auto Class Weights (Recommended)",
    "class_weights": null,
    "smote_ratio": null
  }
}
```

***

## Tips & Best Practices

<Tip>
  **Start with Auto Class Weights**: This is the safest and most effective approach for handling class imbalance without data manipulation.
</Tip>

<Tip>
  **Use Stratified Splits**: Always enable stratified splitting to maintain class proportions across train/val/test sets.
</Tip>

<Tip>
  **Preview Before Training**: Check the augmentation preview to ensure transforms are appropriate for malware images.
</Tip>

<Warning>
  Avoid SMOTE for image data. It creates unrealistic blended pixels. Use Auto Class Weights instead.
</Warning>

## Next Steps

After saving your dataset configuration:

<Card title="Model Builder" icon="brain" href="/dashboard/model-builder">
  Design your neural network architecture in the Model page
</Card>
