Feb 4, 2023 3 min read data engineering

Fixing iMessage search with DuckDB

Searching through your iMessage history is a terrible experience, let's fix it with DuckDB!

iMessage is an instant messaging service developed by Apple for its devices running iOS and macOS operating systems. It allows users to send text messages, photos, videos, music, and other multimedia content over Wi-Fi or cellular data to other Apple devices.

It's one of the best features of the Apple ecosystem, most of the time it actually does just work.

Except for search. Search, for some reason, is still an unsolved topic at Apple, which in current-year is honestly pretty embarrassing.

So, let's fix it.

Getting the data

Did you know that all of your iMessages are available in an SQLite database on your Mac? The "production" database is located in the folder ~/Library/Messages/chat.db. Instead of running queries on this, let's create a working copy first.

💡

This database is synced to iCloud! Always start with a copy and don't modify the original file.

❯ mkdir duckdb-imessage && cd duckdb-imessage
❯ cp ~/Library/Messages/chat.db .

Great, now we definitely won't mess up anything important!

O Pato

DuckDB is an

in-process SQL OLAP database management system

which is perfect for our use case as we want something lightweight that enables us to do some more analytical type querying.

Let's start it up and load the required extension that allows us to connect to an SQLite database on our filesystem.

❯ duckdb

v0.6.1 919cad22e8
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D INSTALL sqlite;
D LOAD sqlite;

Now, we can attach our session to the SQLite file with the sqlite_attach function.

CALL sqlite_attach('chat.db');

Time to explore!

To get an idea of what tables we can use, run the following command:

D PRAGMA show_tables;

┌───────────────────────────┐
│           name            │
│          varchar          │
├───────────────────────────┤
│ _SqliteDatabaseProperties │
│ attachment                │
│ chat                      │
│ chat_handle_join          │
│ chat_message_join         │
│ deleted_messages          │
│ handle                    │
│ kvtable                   │
│ message                   │
│ message_attachment_join   │
│ message_processing_task   │
│ sqlite_sequence           │
│ sqlite_stat1              │
│ sync_deleted_attachments  │
│ sync_deleted_chats        │
│ sync_deleted_messages     │
├───────────────────────────┤
│          16 rows          │
└───────────────────────────┘

Alright, not bad, looks like it's fairly well organized. Let's run a few sample queries to get an idea of what we can do!

Let's see how we could rank our messaging partners based on the number of messages exchanged!

SELECT
 chat.chat_identifier,
 COUNT(chat.chat_identifier) AS message_count
FROM chat
JOIN chat_message_join ON chat."ROWID" = chat_message_join.chat_id
JOIN message ON chat_message_join.message_id = message."ROWID"
GROUP BY chat.chat_identifier
ORDER BY message_count DESC LIMIT 5;

This returns something like the following (phone numbers masked for privacy!)

┌───────────────────────┬───────────────┐
│    chat_identifier    │ message_count │
│        varchar        │     int64     │
├───────────────────────┼───────────────┤
│ +xxxxxxxxxxx          │         25530 │
│ +xxxxxxxxxxx          │         14913 │
│ +xxxxxxxxxxx          │          4551 │
│ xxxxxxxxxxxxxxxxxxxxx │          2585 │
│ +xxxxxxxxxxx          │          2112 │
└───────────────────────┴───────────────┘

Let's get searching

Now, the original mission of this article is to create something that allows us to search in our iMessage history properly. Let's clean up our data a little first.

We can create a table with cleaned-up timestamps and only the fields that we need

CREATE OR REPLACE TABLE messages_clean AS (
SELECT
    message.ROWID AS message_id,
    to_timestamp(message.date / 1000000000 + epoch('2001-01-01'::TIMESTAMP)) AS message_at,
    chat.chat_identifier,
    message.text
FROM
    chat
    JOIN chat_message_join ON chat."ROWID" = chat_message_join.chat_id
    JOIN message ON chat_message_join.message_id = message. "ROWID"
ORDER BY
    message_date DESC
  );

Now we have a usable dataset that can serve as the source for our search queries.

D select column_name, data_type from information_schema.columns where table_name = 'messages_clean';

┌─────────────────┬───────────┐
│   column_name   │ data_type │
│     varchar     │  varchar  │
├─────────────────┼───────────┤
│ message_id      │ BIGINT    │
│ message_at      │ TIMESTAMP │
│ chat_identifier │ VARCHAR   │
│ text            │ VARCHAR   │
└─────────────────┴───────────┘

To search we could use the basic SQL functionality to compare strings, but we want to go a bit further than that, so let's build a Full-text Search index over our dataset.

To achieve this, DuckDB provides an extension, that we can use straight out of the box! Let's create our search index:

PRAGMA create_fts_index('messages_clean', 'message_id', '*');

This PRAGMA builds the index under a newly created schema. The schema will be named after the input table: if an index is created on the table 'main.table_name', then the schema will be named 'fts_main_table_name', so in our case, the schema is going to be called fts_main_messages_clean.

And that's it! We can run full search queries against our index now.

For example, if we want to see all the messages that contain the phrase "cute dog", we can do so using something like this:

SELECT text, score
FROM (SELECT *,
      fts_main_messages_clean.match_bm25(message_id, 'cute dog') AS score
    FROM messages_clean) sq
WHERE score IS NOT NULL
ORDER BY score DESC LIMIT 5;

And our results will look like this

┌──────────────────────────────────────────────────┬───────────────────┐
│                       text                       │       score       │
│                     varchar                      │      double       │
├──────────────────────────────────────────────────┼───────────────────┤
│ Their dog is still cute tho                      │ 8.606972176668958 │
│ There’s a very cute dog here                     │ 8.606972176668958 │
│ Ok the dog is cute but Misha is waaaaaay better  │ 5.085938104395293 │
│ Too cute                                         │ 4.594079954266784 │
│ Theyre cute                                      │ 4.594079954266784 │
└──────────────────────────────────────────────────┴───────────────────┘

Nice! The score gives a great estimation of the accuracy of the search results, as we can see from the actual text.

Exactly what we are looking for ~ and way better than anything Apple provides us so far except for all the plumbing required, but at least we had a reason to play around with DuckDB!

Getting the data

O Pato

Let's get searching

You might also like...

Distributing DuckDB From The CLI

Low-latency trading with Estuary Flow

My Data Engineering Study Framework for 2024

SQL for Google Sheets with DuckDB

BigQuery optimization fundamentals