top of page

Let's Learn and Share AI


Coalesce vs Repartition vs Repartition-by-Range - My Experience with them
If you've spent any time tuning Spark jobs, you've run into the classic question: do I call `coalesce()`, `repartition()`, or `repartitionByRange()`? All three change how your data is partitioned across the cluster, but they behave very differently under the hood — and choosing the wrong one is one of the easiest ways to either tank performance or quietly produce skewed output. This post walks through each one visually, explains the shuffle that sits underneath two of them,
Aashish Arora
May 267 min read
From Custom Docker Images to One‑Click Libraries: My Experience Customizing AWS EMR Serverless vs Azure Databricks Compute
Modern data platforms live or die by how quickly you can ship code to production.For one of our recent projects, that speed was determined by something deceptively simple: adding a custom Python module to our distributed jobs. We started on AWS EMR Serverless and later moved to Azure Databricks for compute jobs.Both platforms can absolutely run serious Spark workloads—but the experience of customizing the runtime for a small Python dependency could not have been more differen
Jaskirat Singh
May 189 min read
Why Databricks as a First-Party Azure Service Changes the Game
You’ve seen what Databricks can do. Here’s why running it on Azure unlocks a completely different experience. We’ve been backing Databricks for a while now. Our customers have used it on AWS to build lakehouses, run ML pipelines, and unify analytics at serious scale - and the platform has delivered. The technology isn’t in question. But Databricks on Azure isn’t just the same product in a different cloud. The partnership between Microsoft and Databricks goes deeper than hosti
Harsh Dhariwal
Apr 227 min read
Strengthening Container Security: A Practical Guide to Docker Hardened Images
Docker containers have become the backbone of modern application deployment, but with widespread adoption comes increased security scrutiny. Organizations face mounting pressure to secure their software supply chain, especially when using open-source container images that may contain packages with known Common Vulnerabilities and Exposures (CVEs). In December 2025, Docker made a groundbreaking move by releasing over 1,000 hardened container images completely free under the Ap
Jaskirat Singh
Jan 83 min read


Generating Synthetic Data Beyond Tabular Data Generation
Why This Pipeline Needed to Exist Most teams now hit a common wall: they need production‑like data, but real tables are locked behind privacy rules, legal reviews, or pure operational friction. Synthetic data promises a way out—but only if it behaves like the real thing, not just “passes the schema.” The project goal was clear and unforgiving: build a synthetic data pipeline that can plug into any PostgreSQL database with zero code changes, and still maintain close to 90% fid
Harsh Dhariwal
Dec 24, 20255 min read
Challenges in Relational Multi-Table Synthetic Data Generation
1. Introduction Synthetic data generation is increasingly important when working with sensitive or regulated datasets. While generating synthetic data for single tables is straightforward using GANs or statistical models, generating relational multi-table synthetic data is significantly more complex. Relational databases do not exist in isolation. They contain relationships that define how information flows across the system: Foreign keys (parent → child) Many-to-one (St
Akshat Gupta
Nov 19, 20255 min read


Semantic Data Matching for Large Datasets: A Scalable Pipeline
In the realm of data management, integrating information from diverse sources poses significant challenges due to variations in terminology, structure, and content. Traditional matching methods, which depend on exact or approximate string comparisons, often fail to capture underlying meanings, leading to incomplete or inaccurate alignments. To overcome this, fuzzy logic and phonetic matching became prominent approaches. Fuzzy matching uses algorithms like Levenshtein distanc
Akshat Gupta
Oct 22, 20258 min read


Entity resolution using Artificial intelligence
In the age of big data, organizations are swimming in vast oceans of information. While this data holds immense potential, its true value...
Harsh Dhariwal
Sep 23, 20258 min read
bottom of page