Predictions Feature – Technical Overview #

1. Introduction #

Our churn prediction system is designed to analyze comprehensive customer journeys by ingesting both structured activity data and unstructured customer interaction data. At its core, the system uses an ensemble of machine learning models to generate dynamic churn and retention scores, rather than relying on fixed, linearly weighted inputs. This document provides an in-depth explanation of the architecture, data collection, preprocessing, feature engineering, model training, and inference processes that make our solution robust and adaptive.

2. System Architecture & Data Flow #

A. Data Collection #

Sources:
- Structured Data: Relational databases, data warehouses, CRMs, support systems, and marketing platforms.
- Unstructured Data: Meeting transcripts, call recordings, notes, email bodies, chat messages, and support tickets.
Collection Methods:
- SQL queries against databases.
- Reverse ETL from data warehouses.
- APIs (RESTful/GraphQL) to retrieve external data.
- Webhooks for real-time event notifications.
- File uploads (CSV/JSON) for batch data processing.
- LLM-driven extraction (e.g., audio transcription, image-to-text) for unstructured sources.

B. Data Storage & Processing #

Data Storage:
A centralized storage layer securely stores both raw and processed data for continuous analysis.
Pre-Processing Tasks:
- Cleaning: Duplicate removal, outlier filtering (e.g., using Z-scores), and consistency validation.
- Normalization: Techniques like log scaling for high-frequency events.
- Fallback Strategy: When structured data is missing, rely on features extracted via unstructured data pipelines.
Aggregation:
Summary statistics (e.g., event counts over various time windows) are calculated to provide context for churn predictions.

3. Machine Learning Pipeline #

A. Data Preparation & Feature Engineering #

Labeling:
Historical records are labeled using known churn events (sourced from CRM timestamps or manual annotations).
Feature Creation:
- Activity Features:
  - e.g., activity:30_days:Conversation Created, activity:30_days:App Page Viewed
- Metric Features:
  - e.g., Conversation Sentiment, Support Sentiment.
- Trait Features:
  - e.g., customer country (trait:country=austria, trait:country=united states), property flags such as property:missing_statement_months_not_available.
Transformations:
- Normalization (scaling) techniques.
- Regularization methods (lasso, ridge) to address multi-collinearity.
- Business rules via penalty adjustments.

B. Feature Selection #

Statistical Analysis:
- Correlation analysis (e.g., Pearson, Spearman) to identify strong churn predictors.
Importance Ranking:
- Algorithms like Random Forest or XGBoost rank features based on their predictive value.
Dimensionality Reduction:
- Applying PCA (Principal Component Analysis) to reduce feature space while retaining essential information.
Adaptability:
- The system can select features solely from unstructured data if needed.

C. Model Training #

Ensemble Strategy:
- Uses a variety of algorithms (logistic regression, decision trees, random forest, SVMs, neural networks) based on data context.
Dynamic Calibration:
- The model dynamically adjusts the weight of each factor; it does not operate on a simple fixed-sum weighting.
Hyperparameter Tuning:
- Techniques like grid search or random search with cross-validation (e.g., k-fold cross-validation) ensure optimal performance.
Validation:
- Separate validation sets ensure that models generalize well to unseen data.

D. Model Evaluation #

Performance Metrics:
- Accuracy, precision, recall, F1 score, and AUC-ROC.
Adaptability:
- The evaluation process accommodates both structured and unstructured data inputs, ensuring robust model performance regardless of data modality.

4. Prediction Generation & Scoring #

A. Inference Mechanism #

Score Calculation:
- Two separate models generate a churnScore and a retentionScore (both between 0 and 1).
- Composite score is computed as:
  composite score = retentionScore - churnScore
Ensemble Aggregation:
- Predictions are aggregated from multiple models, with weights derived dynamically based on validation performance.

B. Context-Dependent Factor Importance #

Dynamic Weighting:
- The system adjusts the importance of each feature based on customer behavior and historical context.
- For example, activity:30_days:App Page Viewed might have a low, positive impact on a customer’s health score in one context and different significance when combined with other metrics.
Unstructured Data Signals:
- In addition to structured signals, the analysis incorporates insights from conversation transcripts, sentiment data, and sometimes even visual cues (gestures, tone) when available.
Scenario Adaptability:
- The relative importance of factors may shift dynamically, especially in cases where only unstructured data is available.

5. User Interface & Visualization (Internal View) #

Dashboard Overview:
- Displays churn and composite scores alongside granular data on feature contributions.
Visualization Components:
- Charts and graphs illustrating trends, patterns, and what-if scenarios.
- Detailed metrics showing the impact of each feature:
  - Example:
    - activity:30_days:App Page Viewed: Indicates the number of app page views over the past 30 days; shows a low, positive impact.
    - property:missing_statement_months_not_available: The presence of this flag may indicate a high negative impact on customer health.
Interactivity:
- Allows internal users to filter data by account type, industry, region, etc., to quickly diagnose key drivers of churn risk.

6. Why This Approach? #

A. Ensemble-Based and Data-Driven Weights #

No Fixed Weights:
- Unlike linear equations that assign fixed percentages (e.g., 40/30/30), our model learns and dynamically adjusts the weights based on predictive power.
Enhanced Accuracy:
- This approach captures the nuanced, context-specific impact of various signals and leads to more accurate churn predictions.

B. Adaptability of the System #

Data Flexibility:
- The system functions effectively even when only unstructured data is available, ensuring resilience against data sparsity.
Continuous Learning:
- As customer behavior evolves, the model recalibrates its weighting strategy to continuously reflect current engagement patterns.

7. Conclusion #

Our churn prediction model is a sophisticated, adaptive system that leverages both traditional activity metrics and rich, contextual insights drawn from unstructured data. The ensemble-based approach, with dynamic weighting and rigorous validation, ensures that our churn and retention predictions accurately convey customer health. This internal document serves as a reference for engineering and product teams involved in further development, troubleshooting, or integration work related to the churn prediction feature. For additional technical insights, please refer to the technical deep dive on revenue protection AI available here.