Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OverCV/UC-Intel-Final/llms.txt
Use this file to discover all available pages before exploring further.
Results Overview
This section presents comprehensive experimental results for all three hypotheses, including quantitative performance metrics, qualitative analysis, and hypothesis verification.Experiment 1 Results: Architecture Comparison (H1)
Performance Summary
| Model | Test Accuracy | Macro F1 | Best Epoch | Training Time | Expected |
|---|---|---|---|---|---|
| Conventional CNN (Baseline) | 72.39% | 74.01% | 9 | 39 min | — |
| VGG-Mini-H1 (5 blocks) | 61.30% | N/A | 10 | 397 min | ≥93% |
| ViT-Small | 74.92% | 76.48% | 10 | 76 min | ≥91% |
| ResNet50 (Fine-tuned) | 96.30% | 95.35% | 6 | 57 min | ≥96% |
Key Finding: ResNet50 achieved 96.30% accuracy, meeting the expected threshold and significantly outperforming all other architectures. It also converged fastest (6 epochs) and trained efficiently (57 minutes).
Detailed ResNet50 Metrics
The best-performing model (ResNet50 Fine-tuned) achieved:| Metric | Training | Validation |
|---|---|---|
| Loss (final) | 0.0378 | 0.1972 |
| Accuracy | 98.91% | 97.85% |
| Precision (macro) | — | 95.23% |
| Recall (macro) | — | 95.55% |
| F1-Score (macro) | 98.93% | 95.35% |
- Train-validation accuracy gap: 1.06% (excellent generalization)
- Loss ratio (train/val): 5.22 (acceptable, controlled overfitting)
Analysis by Architecture
ResNet50: Transfer Learning Superiority
ResNet50: Transfer Learning Superiority
ResNet50 with fine-tuning achieved the best performance with 96.30% accuracy and 95.35% F1-macro, confirming the effectiveness of transfer learning for moderate-sized datasets.Key Observations:
- Fast convergence: Reached best performance in only 6 epochs, significantly faster than other architectures
- Controlled overfitting: Gap between train (98.91%) and validation (97.85%) accuracy was only 1.06%, indicating good generalization
- Effective transfer: Low-level features learned on ImageNet (edges, textures, patterns) proved transferable to the malware image domain
- Efficiency: With 57 minutes training time, more efficient than ViT (76 min) and dramatically more efficient than 5-block CNN (397 min)
- Require fewer epochs to converge
- Learn domain-specific features more efficiently
- Achieve superior generalization despite limited data
ViT-Small: Limitations with Small Datasets
ViT-Small: Limitations with Small Datasets
Vision Transformer achieved 74.92% accuracy, below the expected threshold of 91%. This result aligns with literature indicating transformers require significantly larger datasets.Key Observations:
- Insufficient data: MalImg contains ~9,300 samples, far below the millions typically required to train transformers from scratch
- Lack of pre-training: Unlike ResNet50, ViT-Small was trained from scratch without leveraging prior knowledge
- Attention mechanism complexity: Transformers have more parameters to optimize, making learning difficult with limited data
VGG-Mini-H1 (5 blocks): Unexpected Poor Performance
VGG-Mini-H1 (5 blocks): Unexpected Poor Performance
The most surprising result was the poor performance of the 5-block CNN (61.30%), significantly lower than the conventional baseline (72.39%).Key Observations:
- Oversized architecture: 5 convolutional blocks with progression 32→64→128→256→512 filters proved excessive for the dataset
- Optimization difficulty: Extreme training time (397 minutes) suggests convergence problems
- Possible vanishing gradients: Depth without residual connections may have hindered gradient flow
- Important lesson: More depth doesn’t guarantee better performance; architecture should be proportional to dataset size and complexity
Conventional CNN: Simple but Effective Baseline
Conventional CNN: Simple but Effective Baseline
The conventional model (JorgeNet) with 72.39% accuracy provides an important reference point:
- Demonstrates that simple, well-designed architectures can be competitive
- Outperformed the 5-block CNN, evidencing that simplicity can be advantageous
- The 24 percentage point gap with ResNet50 quantifies the value of transfer learning
- Fast training (39 minutes) makes it suitable for rapid prototyping
Hypothesis H1 Verification
Status: ✅ CONFIRMEDResNet50 achieved 96.30% accuracy (exceeding 96% threshold) and 95.35% F1-macro, significantly outperforming custom CNN (72.39%) and ViT-Small (74.92%). Transfer learning proved superior for malware classification on moderate-sized datasets.
Experiment 2 Results: Data Augmentation Impact (H2)
Global Metrics Comparison
The experiment compared ResNet50 performance with and without data augmentation, focusing on minority class recall improvement.
- Minority class recall improvement: +17.2 percentage points
- Global accuracy impact: -0.4% (negligible)
- Verification: Hypothesis H2 confirmed
Expected vs. Actual Results
| Metric | Threshold | Achieved | Status |
|---|---|---|---|
| Minority recall increase | ≥15 pp | +17.2 pp | ✅ Exceeded |
| Global accuracy degradation | ≤2% | -0.4% | ✅ Within limit |
| Overall hypothesis | — | — | ✅ Confirmed |
Impact Analysis
Minority Class Benefits
Minority Class Benefits
Data augmentation significantly improved recall for underrepresented families:Average Improvement:
- Minority class recall increased from ~61% to ~78% (+17.2 pp)
- All 5 minority classes improved by ≥15 percentage points
- Most benefited classes: Lolyda.AA 3, Malex.gen!J (smallest families)
- Learn more robust features for underrepresented families
- Reduce overfitting to limited minority samples
- Better generalize to minority class test samples
Global Performance Trade-off
Global Performance Trade-off
The impact on overall accuracy was minimal:Trade-off Analysis:
- Global accuracy: 96.2% → 95.8% (-0.4%)
- Loss of 0.4% is negligible compared to +17.2 pp minority recall gain
- Macro F1-score improved due to better class balance
- Equity (minority recall): +17.2 pp
- Global performance cost: -0.4%
- Ratio: ~43:1 benefit-to-cost
Hypothesis H2 Verification
Status: ✅ CONFIRMEDData augmentation improved minority class recall by +17.2 pp (exceeding the +15 pp threshold) while global accuracy decreased only -0.4% (far below the -2% limit). The favorable trade-off validates augmentation as an effective strategy for addressing class imbalance.
Experiment 3 Results: CNN Depth Effect (H3)
Performance Comparison
| Architecture | Val Accuracy | Test Accuracy | Macro F1 | Train Time | Parameters |
|---|---|---|---|---|---|
| H2_MOD.A (9 layers) | 85.29% | N/A | N/A | 33m 48s | ~210,000 |
| 12-layer CNN | 83.45% | 85.58% | 83.54% | ~45m | ~280,000 (+33%) |
Detailed Metrics Analysis
| Category | Metric | 9 Layers | 12 Layers | Interpretation |
|---|---|---|---|---|
| Accuracy | Val Accuracy | 85.29% | 83.45% | Decreased with depth |
| Test Accuracy | N/A | 85.58% | Moderate generalization | |
| Loss | Val Loss | 0.3677 | 0.4095 | Higher residual error |
| Train Loss | 0.2644 | 0.3061 | Optimization difficulty | |
| Generalization | Gap (pp) | 2.86 | 4.13 | Increased overfitting |
| Train/Val Loss Ratio | 1.39 | 1.33 | Lower stability | |
| Metrics | Macro F1 | N/A | 83.54% | Unbalanced performance |
| Weighted F1 | N/A | 85.54% | Dominated by majority | |
| Efficiency | Training Time | 33m 48s | ~45m | +33% cost |
Key Finding: Increasing depth from 9 to 12 layers degraded validation accuracy (-1.84 pp), increased training time (+33%), and worsened the generalization gap (+44.4%), demonstrating diminishing returns with increased depth for this dataset size.
Analysis by Hypothesis Component
Performance Improvement (Expected but NOT Achieved)
Performance Improvement (Expected but NOT Achieved)
Expected: F1-score improvement of +8 percentage points
Actual: Validation accuracy decreased by -1.84 ppAnalysis:
Actual: Validation accuracy decreased by -1.84 ppAnalysis:
- Deeper model (12 layers) actually performed worse than shallower model (9 layers)
- The dataset size (~9,300 samples) may be insufficient to benefit from very deep architectures
- More parameters (+33%) did not translate to better performance
- Negative marginal return: -0.0055% per 1000 additional parameters
- Overfitting: Generalization gap increased from 2.86 pp to 4.13 pp (+44.4%)
- Vanishing gradients: Deeper network without residual connections struggled with gradient flow
- Co-adaptation: More parameters led to feature co-adaptation with poor generalization
- Insufficient regularization: Dropout alone insufficient for very deep networks
Diminishing Returns (CONFIRMED)
Diminishing Returns (CONFIRMED)
Evidence:
- 33% increase in parameters yielded negative performance return
- Loss increased, accuracy decreased
- Optimal depth appears to be around 9 layers for this configuration
- Parameter increase: +70,000 (+33%)
- Accuracy change: -1.84 pp
- Efficiency: Negative marginal productivity
Computational Cost (CONFIRMED)
Computational Cost (CONFIRMED)
Expected: ~40% increase in training time
Actual: +33% training time, +28% memory, +35% FLOPsComputational Analysis:
Actual: +33% training time, +28% memory, +35% FLOPsComputational Analysis:
- Forward Pass FLOPs: ~85 MFLOPs → ~115 MFLOPs (+35%)
- Memory Usage: ~45 MB → ~62 MB (+28%)
- Time per Epoch: 3.38 min → 4.57 min (+35%)
- Total Training Time: 33m 48s → ~45m (+33%)
Generalization Gap Analysis
The generalization gap increased by 44.4% (2.86 pp → 4.13 pp), attributable to:- Unfavorable parameter-to-data ratio: More parameters relative to training samples
- Partial gradient vanishing: Without residual connections, deeper networks struggle
- Feature co-adaptation: More layers can lead to co-adapted features with poor generalization
Learning Curve Observations
- 9-layer model: Stable convergence, consistent validation performance
- 12-layer model: Higher volatility, irregular validation loss curves, signs of optimization difficulty
Hypothesis H3 Verification
Status: ⚠️ PARTIALLY CONFIRMED
- ✅ Diminishing returns: Confirmed (negative returns observed)
- ✅ Computational cost increase: Confirmed (+33% time, aligned with ~40% expectation)
- ❌ Performance improvement: Rejected (accuracy decreased instead of improving +8 pp)
Cross-Experiment Insights
Key Findings Summary
- Transfer learning is superior: ResNet50 (96.30%) dramatically outperformed custom architectures, validating pre-training value
- Data augmentation effective for imbalance: +17.2 pp minority recall with only -0.4% global accuracy cost demonstrates effective imbalance mitigation
- Depth has limits: Simply increasing network depth without architectural innovations (like residual connections) can harm performance
- Dataset size matters: ~9,300 samples insufficient for very deep networks (5-block CNN, ViT) but adequate for transfer learning
Discriminative Features
Learned Feature Analysis (Grad-CAM)
Learned Feature Analysis (Grad-CAM)
Visualizations using Grad-CAM showed that models focus on:
- Dense code regions: .text section containing characteristic instructions for each family
- Resource sections: Import tables and data sections varying between families
- Structural patterns: Models learn to ignore padding regions (uniform areas), indicating learned features are relevant, not noise
Performance Comparison Across All Experiments
Best Configuration
Optimal Model: ResNet50 Fine-tuned with Data Augmentation- Test Accuracy: ~95.8%
- Macro F1-Score: ~95.0%
- Minority Class Recall: +17.2 pp improvement
- Training Time: 57 minutes
- Convergence: 6 epochs
Architecture Ranking
- ResNet50 (fine-tuned): 96.30% - Transfer learning winner
- ViT-Small: 74.92% - Limited by dataset size
- Conventional CNN: 72.39% - Simple but effective
- VGG-Mini-H1 (5 blocks): 61.30% - Oversized for dataset
Practical Recommendations
Based on experimental results:For Similar Projects:
- Use transfer learning (ResNet50, EfficientNet) for datasets <100k samples
- Apply moderate data augmentation to address class imbalance
- Avoid very deep custom CNNs without residual connections
- Start with simpler architectures and increase complexity only if justified by data scale
- Vision Transformers require datasets with millions of samples for competitive performance
Statistical Significance
All results are based on:- Fixed train/validation/test splits (stratified)
- Fixed random seed (42) for reproducibility
- Early stopping to prevent overfitting
- Multiple metrics (accuracy, precision, recall, F1) for robust evaluation