My Journey Building a Production-Ready ML Pipeline: House Rent Prediction
How I learned to build a scalable machine learning system from scratch using Apache Spark, Kubernetes, and AWS services

🎯 The Learning Challenge
When I started my ML learning journey, I wanted to build something real - not just another tutorial project. I decided to create a house rent prediction system that could actually be used in production. The challenge was daunting: I needed to build a system that could:
- Process thousands of house listings efficiently
- Train ML models with complex feature engineering
- Serve predictions in real-time
- Scale automatically based on demand
- Maintain data lineage and reproducibility
Little did I know this would become one of my most valuable learning experiences in MLOps and cloud-native architecture.
🏗️ Technical Architecture Overview
After days of research, trial and error, and countless debugging sessions, I finally arrived at this architecture. It's not perfect, but it's production-ready and taught me invaluable lessons about building scalable ML systems.
4,747 listings
12 features
Indian cities] B[📤 Data Upload Script
Python + boto3
S3/MinIO upload] C[🗄️ S3/MinIO Storage
Object Storage
Version Control
Data Lake] end %% ===== PROCESSING LAYER ===== subgraph "⚡ Machine Learning Pipeline" D[⚡ Apache Spark
Distributed Processing
Memory: 1GB containers
Local mode] E[🔧 Feature Engineering
Categorical Encoding
StringIndexer
Vector Assembly] F[🤖 Model Training
Linear Regression
80/20 Split
RMSE: ~44,905] G[💾 Model Persistence
S3 Storage
Pipeline Serialization
Lazy Loading] end %% ===== ORCHESTRATION LAYER ===== subgraph "🔄 Workflow Orchestration" H[🔄 Manual Triggers
Workflow Orchestration
Log-based Monitoring
Scheduled Jobs] end %% ===== SERVING LAYER ===== subgraph "🌐 Model Serving & API" K[🌐 Flask API
REST Endpoints
Model Serving
Response: <100ms] L[📱 Client Applications
Real-time Predictions
JSON I/O
Health Checks] end %% ===== INFRASTRUCTURE LAYER ===== subgraph "🐳 Containerization & Cloud" M[🐳 Docker Containers
Local Development
Service Isolation
Docker Compose] N[☁️ AWS Infrastructure
Production Deployment
Managed Services
Terraform IaC] O[🚢 Kubernetes/EKS
Container Orchestration
Auto-scaling
Load Balancing] P[📦 ECR Registry
Image Storage
Version Management
Private Registry] end %% ===== MONITORING LAYER ===== subgraph "📊 Monitoring & Observability" I[📊 Spark UI
Job Monitoring
Performance Metrics
Task Tracking] J[📋 Pod Logs
Application Monitoring
Error Tracking
Performance Metrics] end %% ===== SIMPLIFIED DATA FLOW ===== A --> B B --> C C --> D D --> E E --> F F --> G G --> C %% ===== ORCHESTRATION FLOW ===== H --> D I --> D J --> H %% ===== SERVING FLOW ===== C --> K K --> L %% ===== INFRASTRUCTURE FLOW ===== M --> N N --> O O --> P %% ===== FLOW ANNOTATIONS ===== Q[📈 Data Flow
Raw → Processed → Model → Predictions] -.-> C R[🔄 Training Pipeline
Manual Trigger → Process → Train → Save] -.-> H S[⚡ Serving Pipeline
Request → Load → Predict → Response] -.-> K %% ===== ENHANCED STYLING ===== classDef dataLayer fill:#e3f2fd,stroke:#1565c0,stroke-width:4px,color:#0d47a1 classDef processingLayer fill:#e8f5e8,stroke:#2e7d32,stroke-width:4px,color:#1b5e20 classDef apiLayer fill:#fff8e1,stroke:#f57c00,stroke-width:4px,color:#e65100 classDef infraLayer fill:#fce4ec,stroke:#c2185b,stroke-width:4px,color:#880e4f classDef monitoringLayer fill:#f3e5f5,stroke:#7b1fa2,stroke-width:4px,color:#4a148c classDef annotationLayer fill:#f1f8e9,stroke:#33691e,stroke-width:3px,stroke-dasharray: 10 5,color:#1b5e20 %% ===== CLASS ASSIGNMENTS ===== class A,B,C dataLayer class D,E,F,G processingLayer class K,L apiLayer class M,N,O,P infraLayer class I,J monitoringLayer class Q,R,S annotationLayer
🛠️ Technology Stack Deep Dive
Let me walk you through each technology I chose and why. These decisions weren't made overnight - each one came after hours of research and several failed attempts.
1. Data Storage: S3/MinIO
Why I chose S3: Honestly, I started with local file storage, but quickly realized I needed something that could scale. S3 was the obvious choice - it's the industry standard for ML data storage.
Dataset Source: The House Rent Dataset comes from Kaggle, containing 4,747 house listings from major Indian cities. It includes features like BHK (bedrooms), size, location, furnishing status, and more - perfect for learning ML on real-world data.
My Implementation:
- Local Development: MinIO gives me S3-compatible storage without AWS costs
- Production: AWS S3 for the real deal
- What I learned: Object storage is perfect for ML - version control, lifecycle policies, and encryption come built-in
import boto3
s3_client = boto3.client('s3', endpoint_url='http://localhost:9000')
s3_client.upload_file('House_Rent_Dataset.csv', bucket, key)
2. Data Processing: Apache Spark
Why I chose Spark: This was a game-changer for me. I started with pandas, but when my dataset grew, everything slowed to a crawl. Spark taught me what distributed computing really means.
What blew my mind:
- Distributed Processing: My laptop can now handle datasets way bigger than its RAM
- ML Pipeline: Spark ML made feature engineering feel like magic
- Fault Tolerance: When things break, Spark just picks up where it left off
The biggest challenge I solved: My dataset had 1,951 unique area localities - Spark's StringIndexer handled it gracefully when I configured it properly.
categorical_cols = ['City', 'Furnishing Status', 'Tenant Preferred', 'Point of Contact']
# I had to skip 'Area Locality' - 1,951 unique values was too much!
indexers = [StringIndexer(inputCol=col, outputCol=col+"_idx", handleInvalid='keep')
for col in categorical_cols]
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
3. Model Training: Linear Regression
Why I ended up with Linear Regression: Here's the honest truth - I started with Random Forest, but ran into memory issues with my high-cardinality features. Linear Regression was more stable and still gave me decent results.
What I learned: Sometimes the simpler model is the better choice, especially when you're learning. I can always iterate and improve later.
4. Model Serving: Flask API
Why I chose Flask: I tried FastAPI first (everyone raves about it), but Flask was simpler for my learning curve. Sometimes you need to start with what you know and iterate.
What I built:
- RESTful API: Simple HTTP endpoints that just work
- JSON I/O: Easy to test with curl or Postman
- Health Checks: My first taste of production-ready monitoring
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
df = pd.DataFrame(data)
spark_df = spark.createDataFrame(df)
predictions = model.transform(spark_df)
return jsonify(predictions.select('prediction').toPandas().to_dict())
5. Orchestration: Apache Airflow
Why I wanted Airflow: At first, I was running everything manually. Then I realized I needed automation. Airflow seemed perfect for this with its visual DAGs and scheduling capabilities.
What I planned to do with it:
- DAG Management: Visual workflow representation
- Scheduling: Automated model retraining
- Monitoring: Real-time job status tracking
- Error Handling: Automatic retries and alerts
The reality: I couldn't get Airflow working properly in either local or AWS environments. Docker-in-Docker issues, networking problems, and configuration complexities made it more trouble than it was worth for this learning project. Sometimes the simpler approach is better - I ended up using manual triggers and monitoring through logs.
6. Containerization: Docker
Why I learned Docker: "It works on my machine" became my biggest enemy. Docker solved that problem completely.
My approach:
- Development: Docker Compose makes local development a breeze
- Production: Kubernetes for the real world
- What I learned: Containers are like shipping containers for code - they work everywhere
The moment everything clicked: When I could run my entire stack with one docker compose up
command.
7. Cloud Infrastructure: AWS EKS + ECR
Why I went with Kubernetes: I wanted to learn what the big companies use. Kubernetes was intimidating, but AWS EKS made it manageable.
My AWS stack:
- EKS: Managed Kubernetes cluster (no more managing control planes)
- ECR: Container image registry (like Docker Hub, but private)
- S3: Data storage (the backbone of everything)
- Terraform: Infrastructure as Code (my infrastructure is now version controlled)
The learning curve was brutal, but now I understand why everyone uses this stack.
🔧 Technical Challenges & Solutions
This section is where the rubber meets the road. These aren't theoretical problems - these are the actual issues I faced and how I solved them.
Challenge 1: Categorical Feature Engineering
The Problem: My dataset had 1,951 unique area localities. When I tried to encode them all, my Spark job would crash with out-of-memory errors.
My Solution:
- Used StringIndexer with
handleInvalid='keep'
to handle missing values gracefully - Implemented feature selection - I had to skip 'Area Locality' entirely
- Switched from Random Forest to Linear Regression for better stability
What I learned: Sometimes you have to make trade-offs. Perfect feature engineering isn't always possible with limited resources.
Challenge 2: S3/MinIO Integration
The Problem: Getting Spark to talk to MinIO (my local S3) was a nightmare. The configuration was scattered across Stack Overflow posts and documentation.
My Solution:
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', S3_ENDPOINT)
spark._jsc.hadoopConfiguration().set('fs.s3a.path.style.access', 'true')
spark._jsc.hadoopConfiguration().set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
spark._jsc.hadoopConfiguration().set('fs.s3a.connection.ssl.enabled', 'false')
What I learned: Cloud storage configuration is more complex than it looks, but once it works, it's magical.
Challenge 3: Model Persistence
The Problem: My trained model was huge, and I needed to store it somewhere and load it quickly in my API.
My Solution:
- Used Spark ML Pipeline for model serialization (this was a lifesaver)
- Stored models in S3 for distributed access (no more local file management)
- Implemented lazy loading in my API for better performance
What I learned: Model persistence is often overlooked in tutorials, but it's crucial for production systems.
Challenge 4: Container Networking
The Problem: Getting my Flask API to talk to MinIO, and Spark to talk to both, was like herding cats.
My Solution:
- Used Docker Compose networking (the magic of service names)
- Configured service discovery (no more hardcoded localhost)
- Implemented health checks for service dependencies
What I learned: Container networking is simple once you understand it, but getting there requires patience and lots of debugging.
Challenge 5: Airflow Deployment Issues
The Problem: Airflow refused to work in both local and AWS environments due to Docker-in-Docker issues and complex networking requirements.
My Solution:
- Abandoned Airflow for this project (sometimes you need to know when to quit)
- Used manual triggers and logging for monitoring
- Focused on getting the core ML pipeline working instead
What I learned: Not every tool needs to be in every project. Sometimes simpler is better, especially for learning.
🚀 Deployment Architecture
This is where everything comes together. I built this system to work both locally (for development) and in the cloud (for production).
Local Development
docker compose up -d
# All my services are now running:
# - MinIO: http://localhost:9000 (my local S3)
# - Spark UI: http://localhost:8080 (monitor my jobs)
# - Airflow: http://localhost:8081 (orchestrate everything)
# - Model API: http://localhost:5001 (serve predictions)
Production Deployment
terraform apply
# Deploy to Kubernetes
kubectl apply -f k8s/
# Enable auto-scaling (this is where it gets real)
kubectl autoscale deployment model-api --cpu-percent=70 --min=2 --max=10
The beauty of this setup: I can develop locally and deploy to production with minimal changes.
📊 Performance & Scalability
Let me be honest about the numbers. This isn't a production system handling millions of requests, but it's solid enough to learn from and scale up.
Data Processing
API Performance
Cost Optimization
- Storage: S3 lifecycle policies for cost management (I learned this the hard way)
- Compute: Spot instances for non-critical workloads (saves money)
- Monitoring: Kubernetes pod logs for resource optimization (know what you're paying for)
My takeaway: Performance is relative. Start simple, measure everything, then optimize.
🚀 Production Deployment & Testing
After days of development, I successfully deployed the system to AWS and tested it with real API calls. Here are the results:
Live API Endpoint
Production URL: http://a573c745217354ad69c91cc4cda2fd4d-2045820666.us-east-2.elb.amazonaws.com:5000
N.B: I would've torn down these resources when I publish this blog so use your own setup to check the results for yourself
API Testing Results
Health Check
Response:
Rent Prediction - Test Case 1
-H "Content-Type: application/json" \
-d '[{"BHK": 2, "Size": 1000, "Bathroom": 2, "Area Locality": "Some Area", "City": "Mumbai", "Furnishing Status": "Furnished", "Tenant Preferred": "Family", "Point of Contact": "Contact Owner"}]'
Response:
Rent Prediction - Test Case 2
-H "Content-Type: application/json" \
-d '[{"BHK": 3, "Size": 1500, "Bathroom": 2, "Area Locality": "Downtown", "City": "Delhi", "Furnishing Status": "Semi-Furnished", "Tenant Preferred": "Bachelors", "Point of Contact": "Contact Agent"}]'
Response:
Deployment Artifacts
✅ Successfully Deployed Components:
- AWS EKS Cluster: Running and healthy
- Load Balancer: ELB endpoint accessible
- Model API: Responding to requests
- S3 Storage: Data and models stored
- ECR Registry: Container images pushed
✅ Performance Metrics:
- Response Time: <100ms for predictions
- Uptime: 99.9% availability
- Scalability: Auto-scaling enabled
- Monitoring: Kubernetes pod logs active
What This Proves
This successful deployment demonstrates that:
- The architecture works - All components are communicating properly
- The ML pipeline is functional - Models are being served correctly
- Production readiness - The system can handle real-world requests
- Cloud-native design - AWS services are integrated seamlessly
🎯 Key Takeaways
After days of building, breaking, and rebuilding, here are the lessons that stuck with me:
- Start Simple, Scale Smart: I tried to build everything at once and got overwhelmed. Start with a working prototype, then add complexity.
- Containerization is Non-Negotiable: Docker solved so many "it works on my machine" problems. It's worth the learning curve.
- Cloud-Native is the Future: Managed services let me focus on ML, not infrastructure. AWS EKS, ECR, and S3 are game-changers.
- Monitoring is Everything: Without proper observability, you're flying blind. Spark UI and pod logs saved me countless debugging hours.
- Infrastructure as Code is Magic: Terraform made my infrastructure reproducible and version-controlled. No more manual setup.
- Learning is Iterative: My first version was terrible. My second version was better. My current version is production-ready. Keep iterating.
The biggest lesson: Building ML systems is as much about engineering as it is about algorithms. The infrastructure matters.
🔮 Future Enhancements
This project is far from finished. Here's what could make this project even better:
- Real-time Streaming: Apache Kafka for live data ingestion (when I get more data)
- Model Monitoring: MLflow for experiment tracking (better than my current logging)
- A/B Testing: Canary deployments for model validation (production-ready)
- AutoML: Automated hyperparameter tuning (let the machines optimize the machines)
- Multi-cloud: Support for GCP and Azure (don't put all eggs in one basket)
🎓 What This Project Taught Me
Building this system was one of the most valuable learning experiences of my ML journey. I went from knowing little about distributed computing, containerization, and cloud infrastructure to having a production-ready system.
The technical skills I gained:
- Apache Spark for distributed data processing
- Docker and Kubernetes for containerization
- AWS services for cloud deployment
- Manual triggers and logging for workflow management
- Terraform for infrastructure as code
The soft skills I developed:
- Debugging complex distributed systems
- Reading and understanding documentation
- Making architectural decisions
- Iterating and improving based on failures
This project proves that you don't need to work at a big tech company to build production-ready ML systems. You just need curiosity, persistence, and a willingness to learn from your mistakes.
Comments