Ep 63 | Open Lakehouse Architecture: How to Scale AI to Production

Open lakehouse architecture is becoming the foundation for production AI and enterprise AI at scale.

In this episode of The AI Forecast, Dipankar Mazumdar, Director of Developer Relations at Cloudera and co-author of the book “Engineering Lakehouse with Open Table Formats,” joins Paul Muller to explain why open lakehouse architecture is critical for moving from AI pilot to production AI.

They break down:
✅ How Apache Iceberg and open table formats decouple storage from compute
✅ How schema evolution enables change without costly data rewrites
✅ How multiple engines can securely access the same data without duplication
✅ How to prevent small-file performance bottlenecks
✅ How to control AI compute costs at scale
✅ How to embed governance, metadata, and data lineage into AI workloads

Production-ready AI requires scalable data architecture and governance built in from day one. AI and GenAI pilots may be everywhere, but your architecture is what truly decides what survives.

Chapters:

00:00 Intro & Welcome to The AI Forecast

01:44 The Fast Four: Getting to Know Dipankar Mazumdar

06:51 3 Best Practices for Working With Data

09:27 Dipankar's Journey to Developer Advocacy

14:10 What Exactly is a Data Lakehouse?

20:39 Why Write a Book on Open Table Formats?

25:35 Common Misconceptions in Lakehouse Adoption

28:43 Anti-Patterns: Streaming Workloads & The "Small Files" Problem

34:26 Balancing Cloud Costs & Context Engineering for AI

39:31 Connecting Lakehouse Architecture to Business Value

41:52 Why Governance and Lineage are Non-Negotiable

44:22 Operational Pain Points: Moving from POC to Production

47:25 The Future: How Lakehouses Power Generative AI

51:09 Conclusion & Where to Learn More

#DataLakehouse #ApacheIceberg #DataEngineering #Cloudera #OpenTableFormats #DataArchitecture #GenerativeAI #TechPodcast #DataScience