Generating Synthetic Data Beyond Tabular Data Generation
- Harsh Dhariwal
- Dec 24, 2025
- 5 min read
Why This Pipeline Needed to Exist
Most teams now hit a common wall: they need production‑like data, but real tables are locked behind privacy rules, legal reviews, or pure operational friction. Synthetic data promises a way out—but only if it behaves like the real thing, not just “passes the schema.”
The project goal was clear and unforgiving: build a synthetic data pipeline that can plug into any PostgreSQL database with zero code changes, and still maintain close to 90% fidelity to the original data’s behavior. That meant preserving correlations, skewness, foreign‑key graphs, and ugly real‑world edge cases, not just generating random rows. This kind of database‑agnostic approach is increasingly seen as critical for scaling synthetic data across diverse environments.

Beyond “Just Use a GAN”
GANs like CTGAN and related tabular architectures are a natural starting point for synthetic tabular data, and they are well documented to outperform many classical methods on single‑table benchmarks. However, anyone who has trained them on real production schemas discovers the cracks quickly: they tend to learn single‑column distributions reasonably well but struggle with preserving correlations and logical relationships across columns and tables.
This showed up starkly in a marks table where the correlation between max_score and score_obtained dropped significantly in early synthetic runs, weakening any downstream model that cared about realistic performance patterns. Similar issues are noted in studies that compare CTGAN, CopulaGAN, and copula‑only models: each has strengths, but none cleanly solves correlation preservation alone. Correlations matter in every domain—balances vs. transaction history, prices vs. costs, scores vs. max scores—so losing them turns “synthetic data” into little more than a fancy randomizer.
The key design decision was to split responsibilities: use a GAN to learn marginal distributions, and layer a Gaussian Copula model on top to restore correlation structure. Copulas are a standard technique for modeling joint distributions and dependency structures in tabular data. In practice, the pipeline transforms data, trains a GAN, then applies a copula‑based post‑correction that nudges synthetic samples to match the original correlation matrix. Internal metrics, aligned with multi‑metric evaluation frameworks such as SDMetrics, showed correlation gaps dropping into low single digits and average correlation preservation climbing above 97% when this combo was enabled.

Fixing High‑Cardinality Categorical Nightmares
The next major pain point was categorical columns with huge cardinality: qualifications with dozens of variants, hundreds of pickup points, long‑tail merchant names, and more. Classical encoding strategies break down here. One‑hot encoding explodes dimensionality; frequency encoding throws away meaning; and pure resampling does not handle out‑of‑vocabulary synthetic values gracefully. Similar challenges with high‑cardinality categories are widely noted as a core difficulty in tabular generative models.
To stabilize these columns, the pipeline introduced semantic awareness. Sentence‑transformer models like all‑MiniLM‑L6‑v2 are designed to map short text into dense vectors that capture semantic similarity, making them ideal for matching noisy or approximate strings against a catalog of real values. The system embeds all real categorical values once, then, whenever the GAN produces garbled or off‑distribution text, it finds the closest valid value in embedding space and replaces it. This mirrors recommended usage for sentence transformers in semantic similarity and search scenarios, where even lightweight models provide robust matching.
On top of that, the pipeline uses aggressive resampling: for extreme high‑cardinality columns, a very high fraction of GAN outputs is replaced with samples drawn from the original value distribution. This aligns with insights from recent synthetic‑data work showing that mixing generative outputs with targeted resampling often yields better fidelity than relying purely on a single model. The net effect is that high‑cardinality columns retain both semantic meaning and distributional shape, rather than turning into a soup of half‑real strings.
Making It Truly Database‑Agnostic
The most important architectural shift was moving from a hardcoded, database‑specific implementation to a configuration‑driven, auto‑detecting system. Early prototypes baked in table names, transform maps, and column‑name logic tailored to a single school database—exactly the sort of anti‑pattern that platform‑agnostic pipeline guidance warns about. That model would have required code edits for every new customer schema, which is a non‑starter at scale.
The revised design follows three principles often emphasized in synthetic data and pipeline engineering guides: introspect, profile, and adapt. The pipeline now:
Uses database metadata to discover tables, columns, types, and foreign keys at runtime.
Profiles each column for skewness, cardinality, NULL behavior, and PII‑like patterns to decide which transforms and strategies to apply.
Applies a layered config system where a default auto‑detection policy works out of the box, and optional JSON configs let advanced users override behavior for specific schemas.
Foreign key handling is implemented with a topological sort over the table dependency graph, an established technique that ensures parent tables are generated before their children, thereby maintaining referential integrity. This is crucial for multi‑table synthetic data, and it aligns with best practices from relational synthetic data systems that highlight dependency ordering as a first‑class concern. In tests across school, e‑commerce, healthcare, HR, finance, and IoT schemas, the same pipeline—with no code changes—maintained 100% foreign key integrity.

Evaluating Fidelity: More Than One Number
Evaluating synthetic data quality is notoriously tricky. Recent work has argued strongly for multi‑objective, model‑agnostic metrics that capture different failure modes rather than collapsing everything into a single score. Libraries such as SDMetrics embody this principle by providing a battery of metrics for distributions, correlations, boundaries, and privacy behavior.
This pipeline adopts a similar philosophy. It tracks:
Correlation gap between real and synthetic numeric correlations.
Univariate similarity via distribution tests such as Kolmogorov–Smirnov.
Categorical similarity using overlap measures like Total Variation Distance.
Datetime behavior via distributional and periodicity comparisons.
Format compliance using regex checks for emails, UUIDs, phone numbers, and other structured fields.
Foreign key integrity and logical relationship checks across tables.
In internal evaluations, this multi‑stage architecture consistently produced correlation gaps around 0.03–0.04, univariate similarity near or above 0.79, categorical similarity in the mid‑0.8s, datetime similarity around 0.95, and perfect foreign key adherence, resulting in an overall weighted fidelity score in the high‑80% range. These figures are competitive with, and often stronger than, what individual models like CTGAN or copula‑only approaches achieve on comparable workloads. Crucially, similar quality numbers held when the pipeline was pointed at entirely different domains, validating the database‑agnostic design in practice.
Table | Mean abs correlation gap (↓ better) | Categorical similarity (↑ better) |
canteen_transactions | 0.0604 | 0.9463 |
fees | 0.0816 | 0.9837 |
marks | 0.0444 | 0.9563 |
Table 1: Table Comparison for categorical similarity
Table | Column | Similarity |
canteen_transactions | quantity | 0.7708 |
canteen_transactions | amount | 0.8605 |
courses | credits | 0.58 |
fees | amount_due | 0.6366 |
fees | amount_paid | 0.8102 |
games_participation | season_year | 0.545 |
marks | max_score | 0.6047 |
marks | score_obtained | 0.7623 |
marks | weightage_percent | 0.6077 |
student_courses | grade_points | 0.7999 |
teachers | experience_years | 0.7917 |
transport_assignments | monthly_fee | 0.6925 |
Table 2: Numeric (univariate) similarity by column

Where This Goes Next
The future roadmap will significantly enhance data intelligence through the integration of Large Language Models (LLMs) and Small Language Models (SLMs). These models will be used to generate 'smarter' synthetic data, moving beyond statistical preservation to creating semantically meaningful and highly realistic examples. For instance, in healthcare, this could involve generating synthetic records for both healthy individuals and cancer patients, where physiological variables like blood pressure and heart rate are accurately and consistently aligned with their respective health labels. A crucial, parallel step will be to proactively manage and avoid the inclusion of Personally Identifiable Information (PII) during the training process, ensuring the privacy and security of the underlying data models.
Beyond this, the pipeline is set to expand beyond its current focus on tabular, relational schemas. Time-series-aware models are a natural next step, especially for IoT and financial data, where temporal structure is as important as per-row distributions. Graph-based methods could model richer relationship patterns and multi-hop dependencies beyond simple foreign keys. Furthermore, differential privacy techniques, increasingly discussed in healthcare and high-sensitivity settings, would add formal guarantees on top of the already synthetic nature of the data.
Finally, extending beyond PostgreSQL toward MySQL, SQL Server, cloud data warehouses, and even NoSQL backends would push the system from database-agnostic within one family to truly platform-agnostic synthetic infrastructure. The core lesson remains: high-fidelity synthetic data is achieved by orchestrating specialized components—GANs, copulas, embeddings, resampling, transforms, configs—into a pipeline that understands data structure, respects statistical properties, and adapts automatically.



Comments