316 lines
15 KiB
Markdown
316 lines
15 KiB
Markdown
# SQLmem
|
||
|
||
Transparent in-memory cache layer between SQLAlchemy and your database. Drop it in front of any SQLAlchemy engine — `SELECT` queries are served from a fast in-memory SQLite cache, writes are rejected (read-only cache).
|
||
|
||
## Goals
|
||
|
||
SQLmem sits **between your application and the database** and behaves like a normal SQLAlchemy connection. It transparently:
|
||
|
||
1. **Intercepts every query** that passes through it and learns, from the SQL itself, **which tables and which columns** the application actually uses.
|
||
2. **Holds exactly those tables/columns locally in SQLite** — primarily in **RAM**, secondarily persisted to **disk** (`cache.db`) at regular intervals and on shutdown.
|
||
3. **Serves repeated queries from RAM** with no database round-trip.
|
||
4. **Stays in sync incrementally** (see [Incremental refresh](#incremental-delta-refresh)): for large tables you declare a *change-timestamp* column, and SQLmem only re-fetches rows that changed in the last few minutes (or since the last shutdown) instead of reloading tens of millions of rows on every start.
|
||
|
||
The application keeps calling SQL as usual — the cache is an implementation detail behind the interface.
|
||
|
||
## How it works
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
App["Application (SQLAlchemy code)"]
|
||
DB[("Source database")]
|
||
|
||
subgraph SM["SQLmem - transparent cache layer"]
|
||
direction TB
|
||
P["SQL Parser (sqlglot)<br/>detect SELECT vs write<br/>extract tables + columns"]
|
||
R["Column Registry<br/>tracks tables + columns in cache"]
|
||
QE["Query Executor<br/>cache hit / miss / refetch"]
|
||
MEM[("In-memory SQLite - PRIMARY")]
|
||
DISK[("cache.db on disk - SECONDARY")]
|
||
P --> R --> QE --> MEM
|
||
MEM -->|"backup every N s + on shutdown"| DISK
|
||
DISK -->|"load on startup"| MEM
|
||
end
|
||
|
||
App -->|"execute(sql, params)"| P
|
||
QE -->|"cache miss / delta refresh only"| DB
|
||
DB -->|"rows"| MEM
|
||
MEM -->|"list of dicts"| App
|
||
```
|
||
|
||
On the first `SELECT` touching a table, SQLmem fetches the required rows from the database and stores them in the in-memory SQLite. Subsequent queries for the same columns hit RAM with no database round-trip. When a query requests a column not yet cached, SQLmem re-fetches the table with the expanded column set. Parametrized queries, JOINs and `SELECT *` are all supported; each table in a JOIN is cached independently and the JOIN runs inside the in-memory SQLite.
|
||
|
||
### Query lifecycle
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant App
|
||
participant SQLmem
|
||
participant Mem as In-memory SQLite
|
||
participant DB as Source DB
|
||
|
||
App->>SQLmem: execute(SELECT a, b FROM t WHERE id = ?, params)
|
||
SQLmem->>SQLmem: parse -> table = t, columns = {a, b, id}
|
||
alt columns already cached
|
||
SQLmem->>Mem: run query in RAM (with params)
|
||
Mem-->>SQLmem: rows
|
||
else cache miss or new column
|
||
SQLmem->>DB: SELECT a, b, id FROM t (whole table, no WHERE)
|
||
DB-->>SQLmem: rows
|
||
SQLmem->>Mem: store / expand table
|
||
SQLmem->>Mem: run query in RAM (with params)
|
||
Mem-->>SQLmem: rows
|
||
end
|
||
SQLmem-->>App: list[dict]
|
||
```
|
||
|
||
Note: query **parameters are applied only to the in-memory query**, never to the source fetch — a cache load always pulls the full table so the cache can answer any later `WHERE` on those columns.
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
pip install sqlmem
|
||
# or with Poetry
|
||
poetry add sqlmem
|
||
```
|
||
|
||
Requires Python 3.14.
|
||
|
||
## Quick start
|
||
|
||
```python
|
||
from sqlmem import CachingEngine
|
||
from sqlalchemy import create_engine
|
||
|
||
base_engine = create_engine("postgresql://user:pass@host/db")
|
||
engine = CachingEngine(base_engine)
|
||
|
||
# Use exactly like a regular SQLAlchemy engine:
|
||
results = engine.execute("SELECT id, name FROM users WHERE status = 'active'")
|
||
for row in results:
|
||
print(row["id"], row["name"])
|
||
|
||
# Positional parameters (?):
|
||
engine.execute("SELECT id, name FROM users WHERE id = ?", ("42",))
|
||
|
||
# Named parameters (:name):
|
||
engine.execute("SELECT id, name FROM users WHERE id = :id", {"id": "42"})
|
||
|
||
# JOINs — each table is cached independently:
|
||
engine.execute(
|
||
"SELECT u.name, o.total FROM users u "
|
||
"JOIN orders o ON o.user_id = u.id WHERE u.id = ?",
|
||
("42",),
|
||
)
|
||
|
||
# SELECT * — loads and caches the whole table:
|
||
engine.execute("SELECT * FROM users")
|
||
```
|
||
|
||
`execute()` returns a list of dicts. Parameters are passed straight through to SQLite, so positional (`?`) and named (`:name`) styles both work.
|
||
|
||
## Cache behaviour
|
||
|
||
**Column accumulation** — SQLmem learns which columns your app needs at runtime, no upfront configuration required:
|
||
|
||
```
|
||
Query 1: SELECT a, b FROM orders → cache miss → fetch orders(a, b) from DB
|
||
Query 2: SELECT a, d FROM orders → new column d → re-fetch orders(a, b, d)
|
||
Query 3: SELECT b FROM orders → cache hit, no DB query
|
||
Query 4: SELECT * FROM orders → fetches all columns, marks the table fully cached
|
||
Query 5: SELECT a FROM orders → cache hit (table already full)
|
||
```
|
||
|
||
**`SELECT *`** loads every column and marks the table as fully cached, so any later column query is a guaranteed cache hit with no re-fetch.
|
||
|
||
**Writes are blocked** — INSERT, UPDATE, and DELETE raise `ReadOnlyError`. SQLmem is a read-only cache.
|
||
|
||
## Incremental (delta) refresh
|
||
|
||
Reloading a table with tens of millions of rows on every startup is unacceptable. To avoid it, SQLmem keeps the cache in sync by pulling **only changed rows**. For each delta-tracked table you declare its **last-change timestamp** column and the **key column(s)** that identify a row:
|
||
|
||
```python
|
||
from sqlmem import CachingEngine, DeltaConfig
|
||
|
||
engine = CachingEngine(
|
||
base_engine,
|
||
delta={
|
||
"VW_P_PRATVALUES": DeltaConfig(
|
||
change_column="LAST_CHANGE_DATE", # required — the row's change timestamp
|
||
key_columns=["PRODUCT_PRODUCTNR"], # optional for base tables (auto-discovered)
|
||
),
|
||
},
|
||
)
|
||
```
|
||
|
||
**What you must configure, and what is automatic:**
|
||
|
||
| Item | Source |
|
||
|---|---|
|
||
| which **tables / columns** to cache | **automatic** — learned from the queries that pass through |
|
||
| `change_column` (timestamp) | **manual, always** — its meaning can't be inferred from the column type<sup>*</sup> |
|
||
| `key_columns` (primary key) | **auto-discovered** for real tables (`inspect(engine).get_pk_constraint`); **manual** for views, which carry no key in the DB catalog |
|
||
|
||
<sup>*</sup> The one exception is a true MSSQL `rowversion`/`timestamp`-typed column, which is unique per table and auto-maintained — that could be detected automatically. A plain `DATETIME` like `LAST_CHANGE_DATE` cannot.
|
||
|
||
If `key_columns` is omitted, SQLmem tries to read the primary key from the source DB on startup and raises a clear error if it can't (e.g. for a view) so you can supply it explicitly.
|
||
|
||
### How sync works
|
||
|
||
The boundary of "what changed since last time" is a **data-driven watermark**, not a wall-clock window. SQLmem persists, per delta-tracked table, `last_synced_at` = the **maximum `change_column` value** actually present in the cache after the previous sync (stored in `cache.db`, so it survives restarts). The next sync pulls `WHERE change_column >= last_synced_at`.
|
||
|
||
Why a watermark and not `now − 5 min`:
|
||
|
||
- **No clock dependency** — it compares DB values to DB values, so app-server vs database clock skew is irrelevant.
|
||
- **Survives downtime for free** — after hours offline, `>= watermark` pulls *everything* since then; "catch up since last shutdown" needs no special case.
|
||
- **Never misses late commits** — a wall-clock window can drop a row whose timestamp falls outside the window by the time it commits.
|
||
|
||
The filter is `>=` (not `>`) so rows sharing the exact boundary timestamp are re-read; combined with **idempotent upsert by `key_columns`**, re-reading a handful of boundary rows each tick is harmless (they overwrite themselves), and no row is ever skipped. The 5-minute interval is only the **polling cadence**, never the filter boundary.
|
||
|
||
```mermaid
|
||
sequenceDiagram
|
||
participant Trigger as Startup / every 5 min
|
||
participant SQLmem
|
||
participant Mem as In-memory SQLite
|
||
participant DB as Source DB
|
||
|
||
Trigger->>SQLmem: refresh delta-tracked tables
|
||
SQLmem->>Mem: read last_synced_at for table
|
||
SQLmem->>DB: SELECT * FROM t WHERE LAST_CHANGE_DATE >= last_synced_at
|
||
DB-->>SQLmem: only rows changed since the watermark
|
||
SQLmem->>Mem: upsert rows by key_columns (INSERT OR REPLACE)
|
||
SQLmem->>Mem: last_synced_at = max(LAST_CHANGE_DATE)
|
||
```
|
||
|
||
- **First use** of a delta table → full load; the watermark is set to the table's current `max(change_column)`.
|
||
- **On startup** → for each delta table restored from disk, a single catch-up query pulls everything changed **since the last shutdown** and upserts it, bringing the cache back in sync without a full reload.
|
||
- **While running** → a background thread repeats the delta pull every `SQLMEM_REFRESH_INTERVAL` seconds (default 5 minutes), so the cache trails the source DB by at most that interval.
|
||
- Tables **without** a `DeltaConfig` keep the default behaviour: full load on miss, never auto-refreshed — unless they are given a [TTL](#time-based-refresh-tables-without-a-change-column).
|
||
|
||
### Requirements and limits of delta sync
|
||
|
||
- The `change_column` must be **set by the source DB on every insert/update** and be non-decreasing (e.g. a `DATETIME`/`rowversion`/`timestamp` maintained by a trigger or the application).
|
||
- `key_columns` must uniquely identify a row — they are used to upsert changed rows in place.
|
||
- **Updates, including "deletes by nulling"** (a row that keeps its identity but has values cleared), are handled automatically: the change timestamp bumps, the row is re-pulled and overwritten in place.
|
||
- **Structural changes are not covered by delta sync** — adding/removing attributes, or clearing values *without* bumping `change_column`, won't be picked up. For those, force a clean reload with [`engine.reset()`](#manual-cache-control) (or `invalidate()` for a single table).
|
||
- Hard `DELETE`s of whole rows are not detected by a change-timestamp; this workload doesn't delete rows, but if yours does, use a soft-delete flag column or `reset()`.
|
||
|
||
## Time-based refresh (tables without a change column)
|
||
|
||
Some tables can't be delta-synced because they have no change timestamp. For those you can set a **TTL** (max age in seconds): SQLmem keeps serving from cache and guarantees the cached copy is **never older than the TTL** by doing a full reload when it expires.
|
||
|
||
```python
|
||
engine = CachingEngine(
|
||
base_engine,
|
||
ttl={
|
||
"VW_LOOKUP_CODES": 300, # full-reload if the cache is older than 5 minutes
|
||
"VW_SETTINGS": 3600,
|
||
},
|
||
)
|
||
```
|
||
|
||
- **Read-time guarantee** — when a query touches a TTL table whose cache is older than its TTL, the table is fully reloaded *before* the query is answered, so a stale copy is never returned.
|
||
- **Proactive** — the background thread also full-reloads expired TTL tables every `SQLMEM_REFRESH_INTERVAL` seconds, keeping them warm so reads usually don't pay the reload latency.
|
||
- TTL age is measured from `last_refresh_at`, which is persisted in `cache.db`, so the guarantee holds across restarts (an expired table is reloaded on first use after start).
|
||
- A table may be in **either** `delta` **or** `ttl`, not both (delta already keeps it fresh) — supplying both raises `ValueError`.
|
||
|
||
```python
|
||
engine.refresh() # also reloads any expired TTL tables on demand
|
||
```
|
||
|
||
## Persistence
|
||
|
||
The in-memory cache is persisted to `cache.db` on disk:
|
||
|
||
- **On startup**: if `cache.db` exists, it is loaded into memory.
|
||
- **Periodically**: a background thread writes a snapshot to disk every `SQLMEM_BACKUP_INTERVAL` seconds.
|
||
- **On shutdown**: a final flush via `atexit` and SIGTERM handler.
|
||
|
||
The schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.
|
||
|
||
## Manual cache control
|
||
|
||
```python
|
||
engine.invalidate("orders") # drop one table from cache; next query re-fetches it from DB
|
||
engine.reset() # wipe the whole cache (RAM + cache.db) — full clean slate
|
||
engine.refresh() # pull deltas for all delta-tracked tables now
|
||
engine.close() # flush to disk and shut down background thread
|
||
```
|
||
|
||
Use `reset()` after a **structural change** in the source (columns added/removed, values cleared in bulk without bumping the change timestamp) so the cache rebuilds from scratch. `invalidate(table)` is the targeted version for a single table.
|
||
|
||
## Runtime statistics
|
||
|
||
```python
|
||
stats = engine.stats # Stats snapshot
|
||
print(stats.hits, stats.misses, stats.refetches)
|
||
for name, t in stats.tables.items():
|
||
print(name, t.rows, t.columns, t.last_refresh)
|
||
```
|
||
|
||
## Configuration
|
||
|
||
Set via environment variables or a `.env` file:
|
||
|
||
| Variable | Default | Description |
|
||
|---|---|---|
|
||
| `SQLMEM_DEBUG` | `false` | `true` enables DEBUG-level logging |
|
||
| `SQLMEM_CACHE_DB` | `cache.db` | Path to the on-disk persistence file |
|
||
| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds |
|
||
| `SQLMEM_SQL_DIALECT` | `tsql` | sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`) |
|
||
| `SQLMEM_REFRESH_INTERVAL` | `300` | background refresh tick (seconds) — delta pulls and proactive TTL reloads |
|
||
|
||
## Exceptions
|
||
|
||
| Exception | When raised |
|
||
|---|---|
|
||
| `ReadOnlyError` | INSERT, UPDATE, or DELETE statement |
|
||
| `UnsupportedQueryError` | non-SELECT statement, `SELECT` without `FROM`, or an unqualified column in a multi-table query |
|
||
|
||
```python
|
||
from sqlmem import ReadOnlyError, UnsupportedQueryError
|
||
```
|
||
|
||
## Logging
|
||
|
||
SQLmem is silent by default. Call `add_sink()` to opt in:
|
||
|
||
```python
|
||
import sys
|
||
from sqlmem import add_sink
|
||
|
||
add_sink(sys.stderr) # INFO by default
|
||
add_sink(sys.stderr, level="DEBUG") # verbose: every query, cache hit/miss, backup
|
||
add_sink("sqlmem.log", rotation="10 MB") # to a file
|
||
```
|
||
|
||
Set `SQLMEM_DEBUG=true` in `.env` to make the default level DEBUG when no explicit `level` is passed to `add_sink()`.
|
||
|
||
## Limitations
|
||
|
||
- In a multi-table (JOIN) query, every column must be qualified with its table or alias; unqualified columns raise `UnsupportedQueryError`.
|
||
- Tables are keyed by their base name — two tables with the same name in different schemas share one cache entry.
|
||
- No distributed cache backend (Redis etc.).
|
||
- No transactional consistency guarantees; the cache trails the source DB.
|
||
- Write operations (INSERT/UPDATE/DELETE) are always blocked.
|
||
|
||
## Roadmap
|
||
|
||
- [x] **Incremental (delta) refresh** via per-table change-timestamp + key columns (see above) — the key feature for large tables.
|
||
- [x] **Primary-key auto-discovery** from the source DB (`inspect(engine).get_pk_constraint`) so `key_columns` is only needed for views.
|
||
- [x] **`engine.reset()`** — wipe RAM + `cache.db` for a clean rebuild after structural changes.
|
||
- [x] **Per-table TTL** (time-to-live) — bounded-staleness full refresh for tables without a change column.
|
||
|
||
## Dependencies
|
||
|
||
| Layer | Library |
|
||
|---|---|
|
||
| SQL parsing | `sqlglot` |
|
||
| Cache storage | `sqlite3` (stdlib) |
|
||
| Integration | SQLAlchemy 2.x |
|
||
| Logging | `loguru`, `python-dotenv` |
|
||
|
||
## License
|
||
|
||
MIT
|