Files
SQLmem/README.md
T

454 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SQLmem
Transparent in-memory cache layer between SQLAlchemy and your database. Drop it in front of any SQLAlchemy engine — `SELECT` queries are served from a fast in-memory SQLite cache, writes are rejected (read-only cache).
## Goals
SQLmem sits **between your application and the database** and behaves like a normal SQLAlchemy connection. It transparently:
1. **Intercepts every query** that passes through it and learns, from the SQL itself, **which tables and which columns** the application actually uses.
2. **Holds exactly those tables/columns locally in SQLite** — primarily in **RAM**, secondarily persisted to **disk** (`cache.db`) at regular intervals and on shutdown.
3. **Serves repeated queries from RAM** with no database round-trip.
4. **Stays in sync incrementally** (see [Incremental refresh](#incremental-delta-refresh)): for large tables you declare a *change-timestamp* column, and SQLmem only re-fetches rows that changed in the last few minutes (or since the last shutdown) instead of reloading tens of millions of rows on every start.
The application keeps calling SQL as usual — the cache is an implementation detail behind the interface.
## How it works
```mermaid
flowchart TB
App["Application (SQLAlchemy code)"]
DB[("Source database")]
subgraph SM["SQLmem - transparent cache layer"]
direction TB
P["SQL Parser (sqlglot)<br/>detect SELECT vs write<br/>extract tables + columns"]
R["Column Registry<br/>tracks tables + columns in cache"]
QE["Query Executor<br/>cache hit / miss / refetch"]
MEM[("In-memory SQLite - PRIMARY")]
DISK[("cache.db on disk - SECONDARY")]
P --> R --> QE --> MEM
MEM -->|"backup every N s + on shutdown"| DISK
DISK -->|"load on startup"| MEM
end
App -->|"execute(sql, params)"| P
QE -->|"cache miss / delta refresh only"| DB
DB -->|"rows"| MEM
MEM -->|"list of dicts"| App
```
On the first `SELECT` touching a table, SQLmem fetches the required rows from the database and stores them in the in-memory SQLite. Subsequent queries for the same columns hit RAM with no database round-trip. When a query requests a column not yet cached, SQLmem re-fetches the table with the expanded column set. Parametrized queries, JOINs and `SELECT *` are all supported; each table in a JOIN is cached independently and the JOIN runs inside the in-memory SQLite.
### Query lifecycle
```mermaid
sequenceDiagram
participant App
participant SQLmem
participant Mem as In-memory SQLite
participant DB as Source DB
App->>SQLmem: execute(SELECT a, b FROM t WHERE id = ?, params)
SQLmem->>SQLmem: parse -> table = t, columns = {a, b, id}
alt columns already cached
SQLmem->>Mem: run query in RAM (with params)
Mem-->>SQLmem: rows
else cache miss or new column
SQLmem->>DB: SELECT a, b, id FROM t (whole table, no WHERE)
DB-->>SQLmem: rows
SQLmem->>Mem: store / expand table
SQLmem->>Mem: run query in RAM (with params)
Mem-->>SQLmem: rows
end
SQLmem-->>App: list[dict]
```
Note: query **parameters are applied only to the in-memory query**, never to the source fetch — a cache load always pulls the full table so the cache can answer any later `WHERE` on those columns.
## Installation
```bash
pip install sqlmem
# or with Poetry
poetry add sqlmem
```
Requires Python 3.14.
## Quick start
```python
from sqlmem import CachingEngine
from sqlalchemy import create_engine
base_engine = create_engine("postgresql://user:pass@host/db")
engine = CachingEngine(base_engine)
# Use exactly like a regular SQLAlchemy engine:
results = engine.execute("SELECT id, name FROM users WHERE status = 'active'")
for row in results:
print(row["id"], row["name"])
# Positional parameters (?):
engine.execute("SELECT id, name FROM users WHERE id = ?", ("42",))
# Named parameters (:name):
engine.execute("SELECT id, name FROM users WHERE id = :id", {"id": "42"})
# JOINs — each table is cached independently:
engine.execute(
"SELECT u.name, o.total FROM users u "
"JOIN orders o ON o.user_id = u.id WHERE u.id = ?",
("42",),
)
# SELECT * — loads and caches the whole table:
engine.execute("SELECT * FROM users")
```
`execute()` returns a list of dicts. Parameters are passed straight through to SQLite, so positional (`?`) and named (`:name`) styles both work.
## Cache behaviour
**Column accumulation** — SQLmem learns which columns your app needs at runtime, no upfront configuration required:
```
Query 1: SELECT a, b FROM orders → cache miss → fetch orders(a, b) from DB
Query 2: SELECT a, d FROM orders → new column d → re-fetch orders(a, b, d)
Query 3: SELECT b FROM orders → cache hit, no DB query
Query 4: SELECT * FROM orders → fetches all columns, marks the table fully cached
Query 5: SELECT a FROM orders → cache hit (table already full)
```
**`SELECT *`** loads every column and marks the table as fully cached, so any later column query is a guaranteed cache hit with no re-fetch.
**Writes are blocked** — INSERT, UPDATE, and DELETE raise `ReadOnlyError`. SQLmem is a read-only cache.
## Incremental (delta) refresh
Reloading a table with tens of millions of rows on every startup is unacceptable. To avoid it, SQLmem keeps the cache in sync by pulling **only changed rows**. For each delta-tracked table you declare its **last-change timestamp** column and the **key column(s)** that identify a row:
```python
from sqlmem import CachingEngine, DeltaConfig
engine = CachingEngine(
base_engine,
delta={
"VW_P_PRATVALUES": DeltaConfig(
change_column="LAST_CHANGE_DATE", # required — the row's change timestamp
key_columns=["PRODUCT_PRODUCTNR"], # optional for base tables (auto-discovered)
),
},
)
```
**What you must configure, and what is automatic:**
| Item | Source |
|---|---|
| which **tables / columns** to cache | **automatic** — learned from the queries that pass through |
| `change_column` (timestamp) | **manual, always** — its meaning can't be inferred from the column type<sup>*</sup> |
| `key_columns` (primary key) | **auto-discovered** for real tables (`inspect(engine).get_pk_constraint`); **manual** for views, which carry no key in the DB catalog |
<sup>*</sup> The one exception is a true MSSQL `rowversion`/`timestamp`-typed column, which is unique per table and auto-maintained — that could be detected automatically. A plain `DATETIME` like `LAST_CHANGE_DATE` cannot.
If `key_columns` is omitted, SQLmem tries to read the primary key from the source DB on startup and raises a clear error if it can't (e.g. for a view) so you can supply it explicitly.
### How sync works
The boundary of "what changed since last time" is a **data-driven watermark**, not a wall-clock window. SQLmem persists, per delta-tracked table, `last_synced_at` = the **maximum `change_column` value** actually present in the cache after the previous sync (stored in `cache.db`, so it survives restarts). The next sync pulls `WHERE change_column >= last_synced_at`.
Why a watermark and not `now 5 min`:
- **No clock dependency** — it compares DB values to DB values, so app-server vs database clock skew is irrelevant.
- **Survives downtime for free** — after hours offline, `>= watermark` pulls *everything* since then; "catch up since last shutdown" needs no special case.
- **Never misses late commits** — a wall-clock window can drop a row whose timestamp falls outside the window by the time it commits.
The filter is `>=` (not `>`) so rows sharing the exact boundary timestamp are re-read; combined with **idempotent upsert by `key_columns`**, re-reading a handful of boundary rows each tick is harmless (they overwrite themselves), and no row is ever skipped. The 5-minute interval is only the **polling cadence**, never the filter boundary.
```mermaid
sequenceDiagram
participant Trigger as Startup / every 5 min
participant SQLmem
participant Mem as In-memory SQLite
participant DB as Source DB
Trigger->>SQLmem: refresh delta-tracked tables
SQLmem->>Mem: read last_synced_at for table
SQLmem->>DB: SELECT * FROM t WHERE LAST_CHANGE_DATE >= last_synced_at
DB-->>SQLmem: only rows changed since the watermark
SQLmem->>Mem: upsert rows by key_columns (INSERT OR REPLACE)
SQLmem->>Mem: last_synced_at = max(LAST_CHANGE_DATE)
```
- **First use** of a delta table → full load; the watermark is set to the table's current `max(change_column)`.
- **On startup** → for each delta table restored from disk, a single catch-up query pulls everything changed **since the last shutdown** and upserts it, bringing the cache back in sync without a full reload.
- **While running** → a background thread repeats the delta pull every `SQLMEM_REFRESH_INTERVAL` seconds (default 5 minutes), so the cache trails the source DB by at most that interval.
- Tables **without** a `DeltaConfig` keep the default behaviour: full load on miss, never auto-refreshed — unless they are given a [TTL](#time-based-refresh-tables-without-a-change-column).
### Requirements and limits of delta sync
- The `change_column` must be **set by the source DB on every insert/update** and be non-decreasing (e.g. a `DATETIME`/`rowversion`/`timestamp` maintained by a trigger or the application).
- `key_columns` must uniquely identify a row — they are used to upsert changed rows in place.
- **Updates, including "deletes by nulling"** (a row that keeps its identity but has values cleared), are handled automatically: the change timestamp bumps, the row is re-pulled and overwritten in place.
- **Structural changes are not covered by delta sync** — adding/removing attributes, or clearing values *without* bumping `change_column`, won't be picked up. For those, force a clean reload with [`engine.reset()`](#manual-cache-control) (or `invalidate()` for a single table).
- Hard `DELETE`s of whole rows are not detected by a change-timestamp; this workload doesn't delete rows, but if yours does, use a soft-delete flag column or `reset()`.
## Time-based refresh (tables without a change column)
Some tables can't be delta-synced because they have no change timestamp. For those you can set a **TTL** (max age in seconds): SQLmem keeps serving from cache and guarantees the cached copy is **never older than the TTL** by doing a full reload when it expires.
```python
engine = CachingEngine(
base_engine,
ttl={
"VW_LOOKUP_CODES": 300, # full-reload if the cache is older than 5 minutes
"VW_SETTINGS": 3600,
},
)
```
- **Read-time guarantee** — when a query touches a TTL table whose cache is older than its TTL, the table is fully reloaded *before* the query is answered, so a stale copy is never returned.
- **Proactive** — the background thread also full-reloads expired TTL tables every `SQLMEM_REFRESH_INTERVAL` seconds, keeping them warm so reads usually don't pay the reload latency.
- TTL age is measured from `last_refresh_at`, which is persisted in `cache.db`, so the guarantee holds across restarts (an expired table is reloaded on first use after start).
- A table may be in **either** `delta` **or** `ttl`, not both (delta already keeps it fresh) — supplying both raises `ValueError`.
```python
engine.refresh() # also reloads any expired TTL tables on demand
```
## Secondary indexes
To accelerate lookups, you can declare **secondary indexes** per table — they are created on the in-memory SQLite copy so `WHERE`/`JOIN` filters on those columns run as indexed searches instead of full scans:
```python
engine = CachingEngine(
base_engine,
indexes={
"VW_P_PRATVALUES": ["PRODUCT_PRODUCTNR"], # single-column index
"VW_ELEMENTS": [["ELEMENT_ID", "ELEMENTVARIANT_ID"], "ELEMENTVARIANT_NAME"],
},
)
```
Each value is a list of index definitions: a string is a single-column index, a nested list is a composite index.
- **Index columns are pulled into the cache automatically** (like delta key columns), so the index exists from the first load even if your queries don't select those columns.
- Indexes are **recreated after every (re)load** — full loads, TTL reloads, and `invalidate()` + re-fetch all rebuild them — so they're always present, and they persist in `cache.db` across restarts.
- Delta-tracked tables already get a unique index on their key columns; secondary indexes are independent and can be combined with `delta` or `ttl`.
## Persistence
By default the cache lives in an **in-memory SQLite** and is persisted to `cache.db` on disk:
- **On startup**: if `cache.db` exists, it is loaded into memory.
- **Periodically**: a background thread writes a snapshot to disk every `SQLMEM_BACKUP_INTERVAL` seconds.
- **On shutdown**: a final flush via `atexit` and SIGTERM handler.
The schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.
### Disk-backed cache (no RAM copy)
Set `in_memory=False` (or `SQLMEM_IN_MEMORY=false`) to query the on-disk `cache.db` **directly** instead of mirroring it in RAM:
```python
engine = CachingEngine(base_engine, in_memory=False)
```
- The cache can **exceed available memory** — nothing is held in RAM beyond SQLite's page cache.
- Every write **persists immediately** (WAL + `synchronous=NORMAL`), so there is no hourly backup thread, no load-into-memory step on startup, and no shutdown flush to lose.
- **Reads run concurrently** — each thread reads through its own read-only WAL connection, so a slow `SELECT` doesn't block writers (loads/upserts) or other readers.
- On open, a cache file with a mismatched schema version is wiped in place and rebuilt; `engine.reset()` drops the cached tables and `VACUUM`s the file (it does not delete the open file).
The constructor argument wins over the env var; when `in_memory` is omitted it falls back to `SQLMEM_IN_MEMORY`.
#### Tuning the SQLite layer (`pragmas=`)
For a large disk-backed cache, pass SQLite PRAGMAs to tune the read path and on-disk layout without bypassing SQLmem:
```python
engine = CachingEngine(
base_engine,
in_memory=False,
pragmas={
"mmap_size": 32 * 1024**3, # map the DB into the address space (32 GB)
"cache_size": -262144, # 256 MB page cache (negative = KiB)
"temp_store": 2, # ORDER BY / GROUP BY scratch in RAM
"page_size": 8192, # larger pages → fewer reads on range scans
"auto_vacuum": "INCREMENTAL",# reclaim free pages with vacuum() (see below)
},
)
```
- Every entry is applied as `PRAGMA <key> = <value>` when the cache connection opens. **Unknown or inapplicable pragmas are silently ignored** by SQLite, so a bad value degrades gracefully instead of crashing startup.
- **`page_size` and `auto_vacuum` are layout pragmas** — they only take effect on a *fresh* file (set before the first table). On an existing cache, `page_size` is ignored with a one-time warning; use [`hard_reset()`](#manual-cache-control) to rebuild the file with the new value.
#### INTEGER datetime columns (`datetime_columns=`)
A pure datetime column stored as an ISO `TEXT` string costs ~28 bytes per row and compares by string collation. For a large table you can store named datetime columns as **INTEGER microseconds since the Unix epoch** instead — 8 bytes, native integer comparison:
```python
engine = CachingEngine(
base_engine,
delta={"VW_P_PRATVALUES": DeltaConfig("CHANGE_DATE", ["PRATVALUE_ID"])},
datetime_columns={"VW_P_PRATVALUES": ["CHANGE_DATE"]},
)
```
- **Opt-in per column.** Only the columns you name change; everything else keeps the default lossless `TEXT` storage.
- ⚠️ **It changes the output contract for those columns**`execute()` returns them as `int` (µs since epoch), not ISO strings, and a `WHERE` on them must compare against integer µs. Don't list a column your callers read as a string.
- The delta watermark is handled transparently: it is persisted as the integer and bound back to a real `datetime` for the source query, so incremental refresh keeps working.
- ⚠️ This is a **breaking on-disk change** (`SCHEMA_VERSION` 4): an existing cache is wiped and reloaded on first start after enabling it — schedule a maintenance window for a large reload.
## Manual cache control
```python
engine.invalidate("orders") # drop one table from cache; next query re-fetches it from DB
engine.reset() # wipe the whole cache (RAM + cache.db) — full clean slate
engine.hard_reset() # disk mode: delete the file and reopen with current pragmas/page_size
engine.vacuum() # disk mode: incremental VACUUM (reclaim free pages from delta churn)
engine.refresh() # pull deltas for all delta-tracked tables now
engine.close() # flush to disk and shut down background thread
```
Use `reset()` after a **structural change** in the source (columns added/removed, values cleared in bulk without bumping the change timestamp) so the cache rebuilds from scratch. `invalidate(table)` is the targeted version for a single table.
`hard_reset()` goes further than `reset()` in disk mode: it closes every connection, deletes `cache.db` (and its `-wal`/`-shm` sidecars) and reopens from scratch — the only way to change a baked-in `page_size`/`auto_vacuum`. In memory mode it falls back to `reset()`.
`vacuum()` reclaims free pages left behind by delta `INSERT OR REPLACE` churn. Incremental (the default) is cheap and non-blocking but needs `auto_vacuum=INCREMENTAL`; `vacuum(incremental=False)` runs a full VACUUM that rewrites the file (~2× disk, blocks readers) — schedule it in a maintenance window. Both are no-ops in memory mode.
## Runtime statistics
```python
stats = engine.stats # Stats snapshot
print(stats.hits, stats.misses, stats.refetches, stats.errors)
for name, t in stats.tables.items():
print(name, t.rows, t.state, t.tracking, t.last_upsert, t.last_refresh)
if t.consecutive_failures:
print(f" {name} failing ×{t.consecutive_failures}: {t.last_error} ({t.last_error_at})")
```
`Stats.errors` is the total number of load/refresh failures since start. Each `TableStats` also carries `last_error`, `last_error_at` and `consecutive_failures` (reset to 0 on the next success) — so a delta that fails *before* streaming (which otherwise leaves `state` looking `ready`) is still visible, and the table is marked `error`.
Two timestamps distinguish *data freshness* from *liveness*:
| field | meaning |
|---|---|
| `last_upsert` | wall-clock (UTC) of the last actual **data write** — full load or a delta cycle that wrote rows. Persisted, survives restarts. Answers *"when did the data last change?"* |
| `last_refresh` | wall-clock (UTC) of the last time a **refresh cycle ran** for the table — bumped **even when it wrote nothing**. In-memory per process (`None` until the first cycle runs after start). Answers *"is the refresh loop alive?"* |
A delta table that runs every cycle but finds no new rows keeps `last_refresh` ticking while `last_upsert` stays put — that's healthy, not stuck. (Both are UTC ISO strings; the default log timestamps are local time, so expect an offset.)
Each `TableStats` reports a live processing **state** and how the table is kept fresh (**tracking**):
| `state` | Meaning |
|---|---|
| `loading` | a full load is in progress |
| `refreshing` | an incremental (delta) refresh is in progress |
| `ready` | cached and idle (up to date) |
| `stale` | a TTL table whose cache has expired; reloads on next access |
| `error` | the last load failed |
| `tracking` | Meaning |
|---|---|
| `delta` | kept in sync incrementally via a change column |
| `ttl` | full-reloaded when older than its TTL |
| `static` | loaded on demand, never auto-refreshed |
## Memory and very large tables
By default the cache is **in-memory SQLite**, so a cached table lives in RAM — it must fit in available memory. To keep huge tables manageable:
- **Use [disk-backed mode](#disk-backed-cache-no-ram-copy)** (`in_memory=False`) when the working set simply doesn't fit in RAM — queries then run against `cache.db` on disk instead of a memory copy.
- **Loads are streamed in batches** (`SQLMEM_FETCH_BATCH` rows at a time, default 10 000) into a staging table and swapped in atomically. A multi-million-row table never gets fully materialized in Python at once, so the load doesn't spike memory or crash the process, and readers keep seeing the previous copy until the swap completes.
- Use **[delta refresh](#incremental-delta-refresh)** for large tables that have a change column — after the first load only changed rows are pulled, so restarts and refreshes don't re-read the whole table.
- A **single query that returns a huge result set** (e.g. `SELECT *` over a multi-million-row cached table) still materializes that result as a list of dicts; bound it with a `WHERE`/`LIMIT` rather than selecting everything.
## Configuration
Set via environment variables or a `.env` file:
| Variable | Default | Description |
|---|---|---|
| `SQLMEM_DEBUG` | `false` | `true` enables DEBUG-level logging |
| `SQLMEM_CACHE_DB` | `cache.db` | Path to the on-disk persistence file |
| `SQLMEM_IN_MEMORY` | `true` | `false` queries `cache.db` on disk directly (no RAM copy); overridden by the `in_memory` constructor arg |
| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds (in-memory mode only) |
| `SQLMEM_SQL_DIALECT` | `tsql` | sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`) |
| `SQLMEM_REFRESH_INTERVAL` | `300` | background refresh tick (seconds) — delta pulls and proactive TTL reloads |
| `SQLMEM_FETCH_BATCH` | `10000` | rows fetched per batch when loading a table — caps peak memory for huge tables |
Most of these can also be passed **per engine** to the constructor, overriding the env default — handy for running two engines (with separate cache files) in one process, and for tests:
```python
engine = CachingEngine(
base_engine,
cache_db_path="orders_cache.db", # SQLMEM_CACHE_DB
in_memory=False, # SQLMEM_IN_MEMORY
backup_interval=3600, # SQLMEM_BACKUP_INTERVAL
refresh_interval=300, # SQLMEM_REFRESH_INTERVAL
fetch_batch=10000, # SQLMEM_FETCH_BATCH
dialect="tsql", # SQLMEM_SQL_DIALECT
pragmas={"mmap_size": 32 * 1024**3, "page_size": 8192}, # disk-mode SQLite tuning
datetime_columns={"orders": ["created_at"]}, # store these as INTEGER µs (opt-in)
blocking_startup_refresh=False, # block startup until caught up? (default: no)
)
```
By default the **startup catch-up** (delta pulls and TTL reloads for tables restored from disk) runs on the background thread so it never blocks application startup; the cache may serve slightly stale data until the first refresh completes. Set `blocking_startup_refresh=True` to catch up synchronously before the engine starts serving.
## Exceptions
| Exception | When raised |
|---|---|
| `ReadOnlyError` | INSERT, UPDATE, or DELETE statement |
| `UnsupportedQueryError` | non-SELECT statement, `SELECT` without `FROM`, or an unqualified column in a multi-table query |
```python
from sqlmem import ReadOnlyError, UnsupportedQueryError
```
## Logging
SQLmem is silent by default. Call `add_sink()` to opt in:
```python
import sys
from sqlmem import add_sink
add_sink(sys.stderr) # INFO by default
add_sink(sys.stderr, level="DEBUG") # verbose: every query, cache hit/miss, backup
add_sink("sqlmem.log", rotation="10 MB") # to a file
```
Set `SQLMEM_DEBUG=true` in `.env` to make the default level DEBUG when no explicit `level` is passed to `add_sink()`.
## Limitations
- In a multi-table (JOIN) query, every column must be qualified with its table or alias; unqualified columns raise `UnsupportedQueryError`.
- Tables are keyed by their base name — two tables with the same name in different schemas share one cache entry.
- No distributed cache backend (Redis etc.).
- No transactional consistency guarantees; the cache trails the source DB.
- Write operations (INSERT/UPDATE/DELETE) are always blocked.
## Roadmap
- [x] **Incremental (delta) refresh** via per-table change-timestamp + key columns (see above) — the key feature for large tables.
- [x] **Primary-key auto-discovery** from the source DB (`inspect(engine).get_pk_constraint`) so `key_columns` is only needed for views.
- [x] **`engine.reset()`** — wipe RAM + `cache.db` for a clean rebuild after structural changes.
- [x] **Per-table TTL** (time-to-live) — bounded-staleness full refresh for tables without a change column.
## Dependencies
| Layer | Library |
|---|---|
| SQL parsing | `sqlglot` |
| Cache storage | `sqlite3` (stdlib) |
| Integration | SQLAlchemy 2.x |
| Logging | `loguru`, `python-dotenv` |
## License
MIT