Apr 20, 2023 4 min read kafka

Effortless mock data generation for Kafka with LLMs

Ever had to generate fake data to quickly test an idea, or to try and replicate an issue from production? You fire up your local environment, Kafka cluster, and all other parts of your application, and then you realize you have to write a script to generate mock data! Ugh.

Ever had to generate fake data to quickly validate an idea, or to try and replicate an issue that happened in production? You fire up your local environment, Kafka cluster, and all other parts of your application, and then you realize you have to write a custom script to generate mock data! Ugh.

But, it's not the end of the world, right? There are tools that help you with this! All you have to do is download the tool, configure it, connect it to your Kafka cluster, create the test topics, define the schema of the messages you need to send, and... UGH! Way too many steps!

My ideal workflow is the following:

Grab a record that I can use to replicate an issue
Start sending fake data based on the record
Identify issue / fix / test / do whatever

Three steps, maximum!

datagen

The folks over at Materialize made a great little CLI tool that almost satisfies my requirements but it still requires assembling a schema by hand. Even though they provide a nice interface through their schema, it's still grudge work to manually create something like this:

[
  {
    "_meta": {
      "topic": "<my kafka topic>",
      "key": "<field to be used for kafka record key>" ,
      "relationships": [
        {
          "topic": "<topic for dependent dataset>",
          "parent_field": "<field in this dataset>",
          "child_field": "<matching field in dependent dataset>",
          "records_per": <number of records in dependent dataset per record in this dataset>
        },
        ...
      ]
    },
    "<my first field>": "<method from the faker API>",
    "<my second field>": "<another method from the faker API>",
    ...
  },
  {
    ...
  },
  ...
]

That's a lot of thinking and writing I really don't want to do. I already have a sample record, I should be able to generate this schema from it!

Enter GPT

The answer is, obviously, LLMs! Using some very basic prompt engineering we should be able to get GPT to generate the above JSON with the proper FakerJS methods for each field type.

With a very simple function, we can make a call to the OpenAI API (or any other available LLM through LangChainJs).

export async function generateFakerRecordFromExampleRecord(exampleRecord: any) {
    const promptText = await fs.promises.readFile(
        'prompt.txt',
        'utf8'
    );
    const model = new OpenAI({ temperature: 0.0 , maxTokens: -1});
    const prompt = new PromptTemplate({
        template: promptText,
        inputVariables: ["example_json_record"],
    });
    const input = await prompt.format({example_json_record: exampleRecord})
    const response = await model.call(
        input,
    )
    return JSON.parse(response);
}

In the prompt, we'll instruct the model to take on the role of a JSON interpreter and give it some examples of what kind of response we expect. This method is also called few-shot prompting.

The full prompt looks like this:

You are a JSON record interpreter. You are given a JSON record and you need to generate a faker record
that will generate the same JSON record.

A random integer number is always translated to faker.datatype.number and a float is faker.datatype.float.

An example JSON record is:
{{
  "id": 1,
  "title": "iPhone 9",
  "description": "An apple mobile which is nothing like apple",
  "price": 549,
  "discountPercentage": 12.96,
  "createdAt": "2020-10-01T09:00:00.000Z",
  "camera_features": {{
    "front_camera": "12MP",
    "rear_camera": "12MP"
  }},
  "chief_designer": "John Doe",
  "Designed at": "California",
  "Produced in": "China"
}}
And the Faker JSON Array of records generated from this looks like this:
[{{
    "_meta": {{
        "topic": "mz_datagen_devices",
        "key": "id",
        "relationships": []
    }},
    "id" "faker.datatype.number({{min: 1, max: 100}})"",
    "title": "faker.commerce.productName()"",
    "description": "faker.commerce.productDescription()",
    "price": "faker.commerce.price()",
    "discountPercentage": "faker.finance.amount()",
    "createdAt": "faker.date.past()"",
    "camera_features": {{
        "front_camera": "faker.commerce.productAdjective()",
        "rear_camera": "faker.commerce.productAdjective()"
    }},
    "chief_designer": "faker.name.findName()",
    "Designed at": "faker.address.state()",
    "Produced in": "faker.address.country()"
}}]

A second example input:
{{
    "nested" : {{
        "phone": "1234567890",
        "website": "www.example.com"
    }},
    "id": 1,
    "name": "John Doe",
    "email": "johndoe@google.com",
    "website": "www.example.com"
}}

And it's output:
[
    {{
        "_meta": {{
            "topic": "mz_datagen_users",
            "key": "id",
            "relationships": []
        }},
        "nested": {{
            "phone": "faker.phone.imei()",
            "website": "faker.internet.domainName()"
        }},
        "id": "faker.datatype.number(100)",
        "name": "faker.internet.userName()",
        "email": "faker.internet.exampleEmail()",
        "website": "faker.internet.domainName()"
    }}
]

Your input JSON is this:
{example_json_record}

And the generated Faker JSON Array of records are:

After adding this functionality to the datagen tool, our lazy workflow finally becomes reality!

Let's say we have a record like this:

{
  "id": 1,
  "title": "iPhone 9",
  "description": "An apple mobile which is nothing like apple",
  "price": 549,
  "discountPercentage": 12.96,
  "rating": 4.69,
  "stock": 94,
  "brand": "Apple",
  "category": "smartphones",
  "thumbnail": "https://i.dummyjson.com/data/products/1/thumbnail.jpg",
  "images": [
    "https://i.dummyjson.com/data/products/1/1.jpg",
    "https://i.dummyjson.com/data/products/1/2.jpg"
  ],
  "createdAt": "2020-10-01T09:00:00.000Z",
  "camera_features": {
    "front_camera": "12MP",
    "rear_camera": "12MP"
  }
}

Not that complex, few non-trivial data structures, but still want to save time!

All we have to do is save this example as example_record.json and run the script and watch the magic happen!

 datagen --example 'example_record.json'

Our prompt produced this schema for datagen:

[
    {
        "_meta": {
            "topic": "mz_datagen_products",
            "key": "id",
            "relationships": []
        },
        "id": "faker.datatype.number({min: 1, max: 100})",
        "title": "faker.commerce.productName()",
        "description": "faker.commerce.productDescription()",
        "price": "faker.commerce.price()",
        "discountPercentage": "faker.finance.amount()",
        "rating": "faker.finance.amount()",
        "stock": "faker.datatype.number()",
        "brand": "faker.company.companyName()",
        "category": "faker.commerce.department()",
        "thumbnail": "faker.image.imageUrl()",
        "images": [
            "faker.image.imageUrl()",
            "faker.image.imageUrl()"
        ],
        "createdAt": "faker.date.past()",
        "camera_features": {
            "front_camera": "faker.commerce.productAdjective()",
            "rear_camera": "faker.commerce.productAdjective()"
        }
    }
]

Which fits the schema required by datagen perfectly, and produces mock data events like this:

{
   "id":45,
   "title":"Rustic Fresh Pants",
   "description":"Boston's most advanced compression wear technology increases muscle oxygenation, stabilizes active muscles",
   "price":"92.00",
   "discountPercentage":"285.32",
   "rating":"994.25",
   "stock":100,
   "brand":"Witting, Kiehn and Hettinger",
   "category":"Grocery",
   "thumbnail":"https://loremflickr.com/640/480",
   "images":{
      "0":"https://loremflickr.com/640/480",
      "1":"https://loremflickr.com/640/480"
   },
   "createdAt":"2022-10-24T19:31:13.475Z",
   "camera_features":{
      "front_camera":"Unbranded",
      "rear_camera":"Licensed"
   }
}

That's all! With this little extension, I am able to spin up a development environment and start producing test data in around 5 seconds. A short example of how LLMs can make daily life easier for developers. More to come!

You can find my fork with the LLM-based function in this repository.

datagen

Enter GPT

You might also like...

PydanticAI for Building Agentic AI-Based LLM Applications

Real-time Analytics with Snowflake Dynamic Tables & Redpanda

Data Ingestion for Snowflake ❄️

Streaming Data Lakehouse Foundations: Powering Real-Time Insights with Kafka, Flink, and Iceberg

Kafka Consumer Groups