> ## Documentation Index
> Fetch the complete documentation index at: https://mintlify.com/OverCV/UC-Intel-Final/llms.txt
> Use this file to discover all available pages before exploring further.

# Results & Evaluation

> Analyze training curves, confusion matrices, and per-class performance metrics

## Overview

The Results & Evaluation page (`/results`) provides comprehensive analysis of completed training experiments. View training curves, test set performance, confusion matrices, and detailed per-class metrics.

<Info>
  Only **completed experiments** appear on this page. Experiments must finish training (or be manually stopped) to generate evaluation results.
</Info>

## Page Structure

The page displays:

* **Experiment Counter**: Number of completed experiments
* **Experiment Cards**: Expandable cards for each experiment (newest first)
* **Per-Experiment Analysis**: Training curves and advanced metrics within each card

***

## Experiment Selection

Completed experiments display as **expandable cards**.

### Card Label

The collapsed card shows:

* **Experiment Name**: User-defined or auto-generated name
* **Model Name**: Model used for training
* **Validation Accuracy**: Final validation accuracy (e.g., "Val Acc: 87.3%")

**Example:**

```
ResNet50_Baseline | ResNet50_v1 | Val Acc: 87.3%
```

<Tip>
  Cards are sorted by creation time (newest first). Recent experiments appear at the top.
</Tip>

### Expanding Cards

Click card to expand and view full analysis.

***

## Summary Section

Top section of expanded card shows experiment overview.

### Experiment Info

**3 columns:**

<Tabs>
  <Tab title="Column 1: Model">
    * **Model Name**: "ResNet50\_v1"
    * **Type**: "Transfer Learning", "Custom CNN", or "Transformer"
  </Tab>

  <Tab title="Column 2: Training Config">
    * **Training Config Name**: "Adam\_Default"
    * **Epochs**: Number of epochs trained (e.g., "Epochs: 45")
  </Tab>

  <Tab title="Column 3: Duration">
    * **Duration**: Total training time (e.g., "1:23:45")
    * **Best Epoch**: Epoch with best validation metric (e.g., "Best Epoch: 38")
  </Tab>
</Tabs>

### Final Metrics Row

**5 metric cards** showing final validation performance:

<CardGroup cols={5}>
  <Card title="Val Loss">
    Final validation loss (e.g., "0.3456")
  </Card>

  <Card title="Val Accuracy">
    Classification accuracy (e.g., "87.3%")
  </Card>

  <Card title="Val Precision">
    Macro-averaged precision (e.g., "86.5%")
  </Card>

  <Card title="Val Recall">
    Macro-averaged recall (e.g., "85.9%")
  </Card>

  <Card title="Val F1">
    Macro-averaged F1 score (e.g., "86.2%")
  </Card>
</CardGroup>

<Info>
  **Macro-averaging** computes metric for each class independently, then averages. This treats all classes equally regardless of size.
</Info>

***

## Training Curves Tab

First tab shows **training history visualizations**.

### Core Training Metrics (Row 1)

Three charts side-by-side:

<Tabs>
  <Tab title="Loss">
    **Train vs Validation Loss**

    * **Blue line**: Training loss per epoch
    * **Red line**: Validation loss per epoch
    * **X-axis**: Epoch number
    * **Y-axis**: Loss value

    **What to look for:**

    * Both curves should decrease over time
    * Validation loss should follow training loss
    * **Gap between curves** = overfitting
    * **Divergence** = training instability
  </Tab>

  <Tab title="Accuracy">
    **Train vs Validation Accuracy**

    * **Blue line**: Training accuracy per epoch
    * **Red line**: Validation accuracy per epoch
    * **X-axis**: Epoch number
    * **Y-axis**: Accuracy (0-1 or 0-100%)

    **What to look for:**

    * Both curves should increase over time
    * Training accuracy typically higher than validation
    * Plateau indicates convergence
  </Tab>

  <Tab title="Precision / Recall / F1">
    **Multi-line Chart**

    * **Precision**: Green line
    * **Recall**: Blue line
    * **F1 Score**: Purple line
    * All validation metrics

    **What to look for:**

    * F1 balances precision and recall
    * High precision, low recall = conservative predictions
    * High recall, low precision = aggressive predictions
  </Tab>
</Tabs>

<Tip>
  **Ideal convergence**: Loss decreases smoothly, accuracy increases steadily, train/val curves stay close together.
</Tip>

### Learning Dynamics (Row 2)

Three additional charts:

<Tabs>
  <Tab title="Learning Rate">
    **LR Schedule Visualization**

    * Shows learning rate per epoch
    * **Constant**: Flat line
    * **ReduceLROnPlateau**: Stepped decreases
    * **Cosine Annealing**: Smooth cosine curve

    **What to look for:**

    * LR reductions correlate with loss plateaus
    * Verify schedule executed as expected
  </Tab>

  <Tab title="Overfitting Gap">
    **Generalization Gap**

    * **Formula**: `Train Accuracy - Val Accuracy`
    * Positive gap = overfitting (train better than val)
    * Negative gap = underfitting (val better than train, unusual)

    **What to look for:**

    * Small gap (0-5%) = good generalization
    * Growing gap = increasing overfitting
    * Gap > 10% = concerning overfitting
  </Tab>

  <Tab title="Train vs Val F1">
    **F1 Score Comparison**

    * **Blue line**: Training F1
    * **Red line**: Validation F1
    * Shows classification quality across splits

    **What to look for:**

    * Close curves = model generalizes well
    * Large gap = overfitting to training set
  </Tab>
</Tabs>

<Note>
  These charts help diagnose training issues: overfitting, underfitting, learning rate problems, and convergence quality.
</Note>

### Export Section

Bottom of Training Curves tab:

**Two download buttons:**

1. **Download Training History (CSV)**
   * Exports all epoch-level metrics to CSV
   * Columns: epoch, train\_loss, train\_acc, val\_loss, val\_acc, learning\_rate, etc.
   * Import into Excel/Python for custom analysis

2. **Download Model (.pt)** *(currently disabled)*
   * Will export trained PyTorch model weights
   * Load with `torch.load()` for inference

***

## Advanced Metrics Tab

Second tab runs **test set evaluation** and displays detailed performance analysis.

<Info>
  Test evaluation runs automatically when you open this tab. Results are cached for subsequent views.
</Info>

### Test Set Performance

**Accuracy Summary Cards:**

Displays test set metrics in card format:

* **Test Accuracy**: Overall classification accuracy
* **Test Precision**: Macro-averaged precision
* **Test Recall**: Macro-averaged recall
* **Test F1 Score**: Macro-averaged F1

<Tip>
  Test metrics are typically slightly lower than validation metrics since the model never saw test data during training.
</Tip>

### Confusion Matrix

**Left column: Heatmap visualization**

<Steps>
  <Step title="Matrix Structure">
    * **Rows**: True labels (actual classes)
    * **Columns**: Predicted labels
    * **Diagonal**: Correct predictions (darker = more correct)
    * **Off-diagonal**: Misclassifications
  </Step>

  <Step title="Color Scale">
    * Dark colors: High counts
    * Light colors: Low counts
    * Helps identify confusion patterns
  </Step>

  <Step title="Interpretation">
    * **Strong diagonal**: Good performance
    * **Off-diagonal clusters**: Specific confusion pairs
    * Example: Ramnit confused with Lollipop (bright off-diagonal cell)
  </Step>
</Steps>

**How to use:**

* Hover over cells to see exact counts
* Identify which classes are frequently confused
* Diagonal dominance = good classification

<Warning>
  Heavy off-diagonal values indicate systematic misclassification. Investigate whether those classes are visually similar.
</Warning>

### Per-Class Metrics

**Right column: Bar chart**

Shows **Precision, Recall, F1** for each malware family.

* **Precision bars**: Green
* **Recall bars**: Blue
* **F1 bars**: Purple
* **X-axis**: Class names
* **Y-axis**: Metric value (0-1)

**What to look for:**

* **High F1**: Model performs well on this class
* **Low F1**: Struggling class (investigate why)
* **High precision, low recall**: Model is cautious (few predictions, mostly correct)
* **Low precision, high recall**: Model is aggressive (many predictions, many wrong)

<Tip>
  Identify underperforming classes and consider:

  * Adding more training samples
  * Increasing selective augmentation
  * Reviewing data quality
</Tip>

### Classification Report Table

**Bottom section: Detailed table**

Tabular breakdown of per-class metrics:

| Class            | Precision | Recall | F1-Score | Support |
| ---------------- | --------- | ------ | -------- | ------- |
| Ramnit           | 0.87      | 0.89   | 0.88     | 150     |
| Lollipop         | 0.82      | 0.79   | 0.80     | 142     |
| ...              | ...       | ...    | ...      | ...     |
| **Macro Avg**    | 0.85      | 0.86   | 0.85     | 3000    |
| **Weighted Avg** | 0.86      | 0.87   | 0.86     | 3000    |

**Columns:**

* **Class**: Malware family name
* **Precision**: TP / (TP + FP)
* **Recall**: TP / (TP + FN)
* **F1-Score**: Harmonic mean of precision and recall
* **Support**: Number of test samples for this class

**Averages:**

* **Macro Avg**: Unweighted mean (all classes equal)
* **Weighted Avg**: Weighted by support (larger classes matter more)

<Info>
  Use **Macro Avg** to evaluate overall performance treating all classes equally. Use **Weighted Avg** if class sizes reflect real-world deployment.
</Info>

***

## Interpreting Results

### Good Performance Indicators

✅ **Training curves:**

* Smooth loss decrease
* Steady accuracy increase
* Small train/val gap (\<5%)

✅ **Test metrics:**

* Test accuracy close to validation accuracy
* High F1 scores (>0.85)
* Balanced precision and recall

✅ **Confusion matrix:**

* Strong diagonal
* Minimal off-diagonal clusters

### Poor Performance Indicators

❌ **Overfitting:**

* Train accuracy >> Val accuracy (gap >10%)
* Training loss decreases but val loss increases
* **Solutions**: Add regularization, reduce model size, increase dropout, more data augmentation

❌ **Underfitting:**

* Both train and val accuracy are low
* Loss plateaus at high value
* **Solutions**: Increase model capacity, train longer, increase learning rate

❌ **Class confusion:**

* Specific off-diagonal clusters in confusion matrix
* Low recall for certain classes
* **Solutions**: Collect more data for confused classes, apply selective augmentation, use class weights

***

## Experiment Comparison

Compare multiple experiments to identify best configurations:

<Steps>
  <Step title="Expand Multiple Cards">
    Open several experiment cards side-by-side (scroll between them)
  </Step>

  <Step title="Compare Metrics">
    Look at Final Metrics row across experiments

    * Which has highest val accuracy?
    * Which has best F1 score?
    * Which trained fastest?
  </Step>

  <Step title="Analyze Curves">
    Compare training curves

    * Which converged faster?
    * Which shows better generalization?
    * Which avoided overfitting?
  </Step>

  <Step title="Review Confusion Matrices">
    Identify which model makes fewer critical misclassifications
  </Step>
</Steps>

<Tip>
  Download CSV files for all experiments and create comparison plots in Python/Excel for side-by-side analysis.
</Tip>

***

## Metric Definitions

### Accuracy

**Formula**: `(TP + TN) / (TP + TN + FP + FN)`

* Percentage of correct predictions
* **Caution**: Can be misleading with imbalanced data
* Example: 95% accuracy on 95% majority class = useless model

### Precision

**Formula**: `TP / (TP + FP)`

* Of all positive predictions, how many were correct?
* High precision = few false alarms
* **Use case**: When false positives are costly

### Recall (Sensitivity)

**Formula**: `TP / (TP + FN)`

* Of all actual positives, how many did we find?
* High recall = few missed detections
* **Use case**: When false negatives are costly (e.g., malware detection)

### F1 Score

**Formula**: `2 × (Precision × Recall) / (Precision + Recall)`

* Harmonic mean of precision and recall
* Balances both metrics
* **Use case**: When you need balanced performance

<Info>
  For malware classification, **recall is often more important** than precision. Missing malware (false negative) is worse than flagging benign files (false positive).
</Info>

***

## Tips & Best Practices

<Tip>
  **Focus on F1 Score**: It balances precision and recall, providing a single metric for model quality.
</Tip>

<Tip>
  **Check Overfitting Gap**: Train/val gap > 10% indicates overfitting. Add regularization or more data.
</Tip>

<Tip>
  **Analyze Confusion Matrix**: Identify specific class pairs that are confused. This guides data collection and augmentation.
</Tip>

<Warning>
  Don't rely solely on accuracy with imbalanced data. A model predicting only the majority class can have high accuracy but zero utility.
</Warning>

<Tip>
  **Export History CSV**: Download training history for custom visualizations and deeper analysis in Jupyter/Excel.
</Tip>

## Next Steps

After analyzing results:

<Card title="Interpretability Tools" icon="microscope" href="/dashboard/interpretability">
  Visualize model attention with Grad-CAM and explore embeddings
</Card>