Files

T

Jan Doubravský 33aa126ff6 Add incremental delta refresh and fix Decimal/datetime cache binding

2026-06-05 11:09:16 +02:00

13 KiB

Raw Blame History

SQLmem

Transparent in-memory cache layer between SQLAlchemy and your database. Drop it in front of any SQLAlchemy engine — SELECT queries are served from a fast in-memory SQLite cache, writes are rejected (read-only cache).

Goals

SQLmem sits between your application and the database and behaves like a normal SQLAlchemy connection. It transparently:

Intercepts every query that passes through it and learns, from the SQL itself, which tables and which columns the application actually uses.
Holds exactly those tables/columns locally in SQLite — primarily in RAM, secondarily persisted to disk (cache.db) at regular intervals and on shutdown.
Serves repeated queries from RAM with no database round-trip.
Stays in sync incrementally (see Incremental refresh): for large tables you declare a change-timestamp column, and SQLmem only re-fetches rows that changed in the last few minutes (or since the last shutdown) instead of reloading tens of millions of rows on every start.

The application keeps calling SQL as usual — the cache is an implementation detail behind the interface.

How it works

flowchart TB
    App["Application (SQLAlchemy code)"]
    DB[("Source database")]

    subgraph SM["SQLmem - transparent cache layer"]
        direction TB
        P["SQL Parser (sqlglot)<br/>detect SELECT vs write<br/>extract tables + columns"]
        R["Column Registry<br/>tracks tables + columns in cache"]
        QE["Query Executor<br/>cache hit / miss / refetch"]
        MEM[("In-memory SQLite - PRIMARY")]
        DISK[("cache.db on disk - SECONDARY")]
        P --> R --> QE --> MEM
        MEM -->|"backup every N s + on shutdown"| DISK
        DISK -->|"load on startup"| MEM
    end

    App -->|"execute(sql, params)"| P
    QE -->|"cache miss / delta refresh only"| DB
    DB -->|"rows"| MEM
    MEM -->|"list of dicts"| App

On the first SELECT touching a table, SQLmem fetches the required rows from the database and stores them in the in-memory SQLite. Subsequent queries for the same columns hit RAM with no database round-trip. When a query requests a column not yet cached, SQLmem re-fetches the table with the expanded column set. Parametrized queries, JOINs and SELECT * are all supported; each table in a JOIN is cached independently and the JOIN runs inside the in-memory SQLite.

Query lifecycle

sequenceDiagram
    participant App
    participant SQLmem
    participant Mem as In-memory SQLite
    participant DB as Source DB

    App->>SQLmem: execute(SELECT a, b FROM t WHERE id = ?, params)
    SQLmem->>SQLmem: parse -> table = t, columns = {a, b, id}
    alt columns already cached
        SQLmem->>Mem: run query in RAM (with params)
        Mem-->>SQLmem: rows
    else cache miss or new column
        SQLmem->>DB: SELECT a, b, id FROM t   (whole table, no WHERE)
        DB-->>SQLmem: rows
        SQLmem->>Mem: store / expand table
        SQLmem->>Mem: run query in RAM (with params)
        Mem-->>SQLmem: rows
    end
    SQLmem-->>App: list[dict]

Note: query parameters are applied only to the in-memory query, never to the source fetch — a cache load always pulls the full table so the cache can answer any later WHERE on those columns.

Installation

pip install sqlmem
# or with Poetry
poetry add sqlmem

Requires Python 3.14.

Quick start

from sqlmem import CachingEngine
from sqlalchemy import create_engine

base_engine = create_engine("postgresql://user:pass@host/db")
engine = CachingEngine(base_engine)

# Use exactly like a regular SQLAlchemy engine:
results = engine.execute("SELECT id, name FROM users WHERE status = 'active'")
for row in results:
    print(row["id"], row["name"])

# Positional parameters (?):
engine.execute("SELECT id, name FROM users WHERE id = ?", ("42",))

# Named parameters (:name):
engine.execute("SELECT id, name FROM users WHERE id = :id", {"id": "42"})

# JOINs — each table is cached independently:
engine.execute(
    "SELECT u.name, o.total FROM users u "
    "JOIN orders o ON o.user_id = u.id WHERE u.id = ?",
    ("42",),
)

# SELECT * — loads and caches the whole table:
engine.execute("SELECT * FROM users")

execute() returns a list of dicts. Parameters are passed straight through to SQLite, so positional (?) and named (:name) styles both work.

Cache behaviour

Column accumulation — SQLmem learns which columns your app needs at runtime, no upfront configuration required:

Query 1: SELECT a, b FROM orders   → cache miss → fetch orders(a, b) from DB
Query 2: SELECT a, d FROM orders   → new column d → re-fetch orders(a, b, d)
Query 3: SELECT b FROM orders      → cache hit, no DB query
Query 4: SELECT * FROM orders      → fetches all columns, marks the table fully cached
Query 5: SELECT a FROM orders      → cache hit (table already full)

SELECT * loads every column and marks the table as fully cached, so any later column query is a guaranteed cache hit with no re-fetch.

Writes are blocked — INSERT, UPDATE, and DELETE raise ReadOnlyError. SQLmem is a read-only cache.

Incremental (delta) refresh

Reloading a table with tens of millions of rows on every startup is unacceptable. To avoid it, SQLmem keeps the cache in sync by pulling only changed rows. For each delta-tracked table you declare its last-change timestamp column and the key column(s) that identify a row:

from sqlmem import CachingEngine, DeltaConfig

engine = CachingEngine(
    base_engine,
    delta={
        "VW_P_PRATVALUES": DeltaConfig(
            change_column="LAST_CHANGE_DATE",   # required — the row's change timestamp
            key_columns=["PRODUCT_PRODUCTNR"],  # optional for base tables (auto-discovered)
        ),
    },
)

What you must configure, and what is automatic:

Item	Source
which tables / columns to cache	automatic — learned from the queries that pass through
`change_column` (timestamp)	manual, always — its meaning can't be inferred from the column type^*
`key_columns` (primary key)	auto-discovered for real tables (`inspect(engine).get_pk_constraint`); manual for views, which carry no key in the DB catalog

^* The one exception is a true MSSQL rowversion/timestamp-typed column, which is unique per table and auto-maintained — that could be detected automatically. A plain DATETIME like LAST_CHANGE_DATE cannot.

If key_columns is omitted, SQLmem tries to read the primary key from the source DB on startup and raises a clear error if it can't (e.g. for a view) so you can supply it explicitly.

How sync works

The boundary of "what changed since last time" is a data-driven watermark, not a wall-clock window. SQLmem persists, per delta-tracked table, last_synced_at = the maximum change_column value actually present in the cache after the previous sync (stored in cache.db, so it survives restarts). The next sync pulls WHERE change_column >= last_synced_at.

Why a watermark and not now − 5 min:

No clock dependency — it compares DB values to DB values, so app-server vs database clock skew is irrelevant.
Survives downtime for free — after hours offline, >= watermark pulls everything since then; "catch up since last shutdown" needs no special case.
Never misses late commits — a wall-clock window can drop a row whose timestamp falls outside the window by the time it commits.

The filter is >= (not >) so rows sharing the exact boundary timestamp are re-read; combined with idempotent upsert by key_columns, re-reading a handful of boundary rows each tick is harmless (they overwrite themselves), and no row is ever skipped. The 5-minute interval is only the polling cadence, never the filter boundary.

sequenceDiagram
    participant Trigger as Startup / every 5 min
    participant SQLmem
    participant Mem as In-memory SQLite
    participant DB as Source DB

    Trigger->>SQLmem: refresh delta-tracked tables
    SQLmem->>Mem: read last_synced_at for table
    SQLmem->>DB: SELECT * FROM t WHERE LAST_CHANGE_DATE >= last_synced_at
    DB-->>SQLmem: only rows changed since the watermark
    SQLmem->>Mem: upsert rows by key_columns (INSERT OR REPLACE)
    SQLmem->>Mem: last_synced_at = max(LAST_CHANGE_DATE)

First use of a delta table → full load; the watermark is set to the table's current max(change_column).
On startup → for each delta table restored from disk, a single catch-up query pulls everything changed since the last shutdown and upserts it, bringing the cache back in sync without a full reload.
While running → a background thread repeats the delta pull every SQLMEM_REFRESH_INTERVAL seconds (default 5 minutes), so the cache trails the source DB by at most that interval.
Tables without a DeltaConfig keep the current behaviour: full load on miss, never auto-refreshed.

Requirements and limits of delta sync

The change_column must be set by the source DB on every insert/update and be non-decreasing (e.g. a DATETIME/rowversion/timestamp maintained by a trigger or the application).
key_columns must uniquely identify a row — they are used to upsert changed rows in place.
Updates, including "deletes by nulling" (a row that keeps its identity but has values cleared), are handled automatically: the change timestamp bumps, the row is re-pulled and overwritten in place.
Structural changes are not covered by delta sync — adding/removing attributes, or clearing values without bumping change_column, won't be picked up. For those, force a clean reload with engine.reset() (or invalidate() for a single table).
Hard DELETEs of whole rows are not detected by a change-timestamp; this workload doesn't delete rows, but if yours does, use a soft-delete flag column or reset().

Persistence

The in-memory cache is persisted to cache.db on disk:

On startup: if cache.db exists, it is loaded into memory.
Periodically: a background thread writes a snapshot to disk every SQLMEM_BACKUP_INTERVAL seconds.
On shutdown: a final flush via atexit and SIGTERM handler.

The schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.

Manual cache control

engine.invalidate("orders")   # drop one table from cache; next query re-fetches it from DB
engine.reset()                # wipe the whole cache (RAM + cache.db) — full clean slate
engine.refresh()              # pull deltas for all delta-tracked tables now
engine.close()                # flush to disk and shut down background thread

Use reset() after a structural change in the source (columns added/removed, values cleared in bulk without bumping the change timestamp) so the cache rebuilds from scratch. invalidate(table) is the targeted version for a single table.

Runtime statistics

stats = engine.stats          # Stats snapshot
print(stats.hits, stats.misses, stats.refetches)
for name, t in stats.tables.items():
    print(name, t.rows, t.columns, t.last_refresh)

Configuration

Set via environment variables or a .env file:

Variable	Default	Description
`SQLMEM_DEBUG`	`false`	`true` enables DEBUG-level logging
`SQLMEM_CACHE_DB`	`cache.db`	Path to the on-disk persistence file
`SQLMEM_BACKUP_INTERVAL`	`3600`	Disk backup interval in seconds
`SQLMEM_SQL_DIALECT`	`tsql`	sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`)
`SQLMEM_REFRESH_INTERVAL`	`300`	delta-refresh interval in seconds for delta-tracked tables

Exceptions

Exception	When raised
`ReadOnlyError`	INSERT, UPDATE, or DELETE statement
`UnsupportedQueryError`	non-SELECT statement, `SELECT` without `FROM`, or an unqualified column in a multi-table query

from sqlmem import ReadOnlyError, UnsupportedQueryError

Logging

SQLmem is silent by default. Call add_sink() to opt in:

import sys
from sqlmem import add_sink

add_sink(sys.stderr)                      # INFO by default
add_sink(sys.stderr, level="DEBUG")       # verbose: every query, cache hit/miss, backup
add_sink("sqlmem.log", rotation="10 MB")  # to a file

Set SQLMEM_DEBUG=true in .env to make the default level DEBUG when no explicit level is passed to add_sink().

Limitations

In a multi-table (JOIN) query, every column must be qualified with its table or alias; unqualified columns raise UnsupportedQueryError.
Tables are keyed by their base name — two tables with the same name in different schemas share one cache entry.
No distributed cache backend (Redis etc.).
No transactional consistency guarantees; the cache trails the source DB.
Write operations (INSERT/UPDATE/DELETE) are always blocked.

Roadmap

Incremental (delta) refresh via per-table change-timestamp + key columns (see above) — the key feature for large tables.
Primary-key auto-discovery from the source DB (inspect(engine).get_pk_constraint) so key_columns is only needed for views.
engine.reset() — wipe RAM + cache.db for a clean rebuild after structural changes.
Per-table TTL (time-to-live) expiry.

Dependencies

Layer	Library
SQL parsing	`sqlglot`
Cache storage	`sqlite3` (stdlib)
Integration	SQLAlchemy 2.x
Logging	`loguru`, `python-dotenv`

License

MIT

13 KiB Raw Blame History Unescape Escape