Tabular GANs for uneven distribution

Oct 1, 2020

Insaf Ashrapov

Introduction

GANs are renowned for their success in the realistic image generation domain. However, their application in tabular data generation is under-explored. This blog post sheds light on some recent papers about tabular GANs in action, highlighting their potential use in uneven data distribution cases between training and test data.

Understanding GANs

The Generator and Discriminator, two deep networks that constitute GAN, are trained simultaneously. The generator's goal is generating samples that the discriminator can't distinguish from real samples. Modern architectures such as StyleGAN 2 can create outstanding photo-realistic images, but issues still exist due to high training speed requirements and challenges in specific domains.

Challenges

Training models like StyleGAN 2 demand significant computational resources, and the model output may still fail on tasks like generation of cats and dogs due to non-trivial data distribution and high object type variety. Furthermore, GANs often struggle with generating the correct image background.

Tabular GANs for Uneven Distribution

In light of these challenges, the question arises: What can GANs achieve in tabular data? Recent papers such as "TGAN: Synthesizing Tabular Data using Generative Adversarial Networks" and "Modeling Tabular Data using Conditional GAN (CTGAN)" propose interesting approaches regarding this matter.

Preprocessing Numerical Variables

Neural networks can generate values with a distribution centered over (-1, 1) effectively using tanh. That being said, it has been shown that nets struggle with generating suitable data for multi-modal data. In light of this, the numerical data is clustered using Gaussian Mixture Model (GMM) with m (m=5) components for each of the nc continuous variables.

Preprocessing Categorical Variables

Due to low cardinality, it was found that the probability distribution can be generated directly using softmax. Still, it's necessary to convert categorical variables to a one-hot-encoding representation with noise to binary variables. After preprocessing, the T table with nc + nd columns is converted to V, U, D vectors. These vectors serve as the generator's output and the discriminator's GAN input, which does not have access to GMM parameters.

Generator and Discriminator

A two-step process was used to generate the numerical variable. First, the scalar value V was generated, then the cluster vector U. Complementary to this, categorical features were created as a probability distribution over all possible labels with softmax.

For distinguishing real and fake data, a Multi-Layer Perceptron (MLP) equipped with LeakyReLU and Batch-Norm was utilized. The loss function sums the KL divergence term of input variables with the ordinal log loss function.

CTGAN Approach

The papers underline the key improvement over prior TGANs like mode-specific normalization for overcoming non-Gaussian and multimodal distribution. Additionally, the use of a conditional generator and training-by-sampling method was suggested to address the issue of uneven data distribution.

Implication

In conclusion, through the incorporation of GAN, we are moving closer to achieving a higher level of model performance in cases of uneven distribution between training and test data. In particular, tools like TGAN and CTGAN present promising avenues towards better tabular data synthesis, although there's still room for optimization and improvement to solidify their efficacy in various practical applications.

Sign up to AI First Newsletter

Recommended

We use our own cookies as well as third-party cookies on our websites to enhance your experience, analyze our traffic, and for security and marketing. Select "Accept All" to allow them to be used. Read our Cookie Policy.