Add incremental delta refresh and fix Decimal/datetime cache binding

2026-06-05 11:09:16 +02:00
parent 530c2618cf
commit 33aa126ff6
13 changed files with 798 additions and 53 deletions
@@ -1,28 +1,70 @@
 # SQLmem

-Transparent in-memory cache layer between SQLAlchemy and your database. Drop it in front of any SQLAlchemy engine — SELECT queries are served from a fast in-memory SQLite cache, writes pass through unchanged.
+Transparent in-memory cache layer between SQLAlchemy and your database. Drop it in front of any SQLAlchemy engine — `SELECT` queries are served from a fast in-memory SQLite cache, writes are rejected (read-only cache).
+
+## Goals
+
+SQLmem sits **between your application and the database** and behaves like a normal SQLAlchemy connection. It transparently:
+
+1. **Intercepts every query** that passes through it and learns, from the SQL itself, **which tables and which columns** the application actually uses.
+2. **Holds exactly those tables/columns locally in SQLite** — primarily in **RAM**, secondarily persisted to **disk** (`cache.db`) at regular intervals and on shutdown.
+3. **Serves repeated queries from RAM** with no database round-trip.
+4. **Stays in sync incrementally** (see [Incremental refresh](#incremental-delta-refresh)): for large tables you declare a *change-timestamp* column, and SQLmem only re-fetches rows that changed in the last few minutes (or since the last shutdown) instead of reloading tens of millions of rows on every start.
+
+The application keeps calling SQL as usual — the cache is an implementation detail behind the interface.

 ## How it works

-```
-Application (SQLAlchemy)
-        │
-        ▼
-  [ SQLmem Proxy ]
-  ┌──────────────────────────────┐
-  │  SQL Parser                  │  → detects SELECT vs. write
-  │  Column Registry             │  → tracks which columns are cached per table
-  │  Cache Manager (SQLite RAM)  │  → stores data in memory
-  │  Query Executor              │  → cache hit / miss logic
-  └──────────────────────────────┘
-        │
-        ▼
-  Database (via original SQLAlchemy engine)
+```mermaid
+flowchart TB
+    App["Application (SQLAlchemy code)"]
+    DB[("Source database")]
+
+    subgraph SM["SQLmem - transparent cache layer"]
+        direction TB
+        P["SQL Parser (sqlglot)<br/>detect SELECT vs write<br/>extract tables + columns"]
+        R["Column Registry<br/>tracks tables + columns in cache"]
+        QE["Query Executor<br/>cache hit / miss / refetch"]
+        MEM[("In-memory SQLite - PRIMARY")]
+        DISK[("cache.db on disk - SECONDARY")]
+        P --> R --> QE --> MEM
+        MEM -->|"backup every N s + on shutdown"| DISK
+        DISK -->|"load on startup"| MEM
+    end
+
+    App -->|"execute(sql, params)"| P
+    QE -->|"cache miss / delta refresh only"| DB
+    DB -->|"rows"| MEM
+    MEM -->|"list of dicts"| App
 ```

-On the first SELECT for a table, SQLmem fetches the required rows from the database and stores them in an in-memory SQLite instance. Subsequent queries for the same columns hit the in-memory cache with no database round-trip. When a query requests a column not yet in cache, SQLmem re-fetches the table with the expanded column set.
+On the first `SELECT` touching a table, SQLmem fetches the required rows from the database and stores them in the in-memory SQLite. Subsequent queries for the same columns hit RAM with no database round-trip. When a query requests a column not yet cached, SQLmem re-fetches the table with the expanded column set. Parametrized queries, JOINs and `SELECT *` are all supported; each table in a JOIN is cached independently and the JOIN runs inside the in-memory SQLite.

-Parametrized queries, JOINs and `SELECT *` are all supported. Each table referenced in a JOIN is cached independently; the JOIN itself runs in the in-memory SQLite. Query parameters are applied during in-memory filtering, so cache loads always fetch the full table regardless of the `WHERE` values.
+### Query lifecycle
+
+```mermaid
+sequenceDiagram
+    participant App
+    participant SQLmem
+    participant Mem as In-memory SQLite
+    participant DB as Source DB
+
+    App->>SQLmem: execute(SELECT a, b FROM t WHERE id = ?, params)
+    SQLmem->>SQLmem: parse -> table = t, columns = {a, b, id}
+    alt columns already cached
+        SQLmem->>Mem: run query in RAM (with params)
+        Mem-->>SQLmem: rows
+    else cache miss or new column
+        SQLmem->>DB: SELECT a, b, id FROM t   (whole table, no WHERE)
+        DB-->>SQLmem: rows
+        SQLmem->>Mem: store / expand table
+        SQLmem->>Mem: run query in RAM (with params)
+        Mem-->>SQLmem: rows
+    end
+    SQLmem-->>App: list[dict]
+```
+
+Note: query **parameters are applied only to the in-memory query**, never to the source fetch — a cache load always pulls the full table so the cache can answer any later `WHERE` on those columns.

 ## Installation

@@ -38,7 +80,7 @@ Requires Python 3.14.

 ```python
 from sqlmem import CachingEngine
-from sqlalchemy import create_engine, text
+from sqlalchemy import create_engine

 base_engine = create_engine("postgresql://user:pass@host/db")
 engine = CachingEngine(base_engine)
@@ -65,7 +107,7 @@ engine.execute(
 engine.execute("SELECT * FROM users")
 ```

-`execute()` returns a list of dicts. Parameters are passed straight through to SQLite, so positional (`?`) and named (`:name`) styles both work. Results are compatible with standard iteration patterns.
+`execute()` returns a list of dicts. Parameters are passed straight through to SQLite, so positional (`?`) and named (`:name`) styles both work.

 ## Cache behaviour

@@ -83,23 +125,106 @@ Query 5: SELECT a FROM orders      → cache hit (table already full)

 **Writes are blocked** — INSERT, UPDATE, and DELETE raise `ReadOnlyError`. SQLmem is a read-only cache.

-## Persistence
+## Incremental (delta) refresh

-The in-memory cache is optionally persisted to `cache.db` on disk:
-
- **On startup**: if `cache.db` exists, it is loaded into memory.
- **Hourly**: a background thread writes a snapshot to disk.
- **On shutdown**: a final flush via `atexit` and SIGTERM handler.
-
-Schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.
-
-## Manual cache invalidation
+Reloading a table with tens of millions of rows on every startup is unacceptable. To avoid it, SQLmem keeps the cache in sync by pulling **only changed rows**. For each delta-tracked table you declare its **last-change timestamp** column and the **key column(s)** that identify a row:

 ```python
-engine.invalidate("orders")   # drops the table from cache; next query re-fetches from DB
+from sqlmem import CachingEngine, DeltaConfig
+
+engine = CachingEngine(
+    base_engine,
+    delta={
+        "VW_P_PRATVALUES": DeltaConfig(
+            change_column="LAST_CHANGE_DATE",   # required — the row's change timestamp
+            key_columns=["PRODUCT_PRODUCTNR"],  # optional for base tables (auto-discovered)
+        ),
+    },
+)
+```
+
+**What you must configure, and what is automatic:**
+
+| Item | Source |
+|---|---|
+| which **tables / columns** to cache | **automatic** — learned from the queries that pass through |
+| `change_column` (timestamp) | **manual, always** — its meaning can't be inferred from the column type<sup>*</sup> |
+| `key_columns` (primary key) | **auto-discovered** for real tables (`inspect(engine).get_pk_constraint`); **manual** for views, which carry no key in the DB catalog |
+
+<sup>*</sup> The one exception is a true MSSQL `rowversion`/`timestamp`-typed column, which is unique per table and auto-maintained — that could be detected automatically. A plain `DATETIME` like `LAST_CHANGE_DATE` cannot.
+
+If `key_columns` is omitted, SQLmem tries to read the primary key from the source DB on startup and raises a clear error if it can't (e.g. for a view) so you can supply it explicitly.
+
+### How sync works
+
+The boundary of "what changed since last time" is a **data-driven watermark**, not a wall-clock window. SQLmem persists, per delta-tracked table, `last_synced_at` = the **maximum `change_column` value** actually present in the cache after the previous sync (stored in `cache.db`, so it survives restarts). The next sync pulls `WHERE change_column >= last_synced_at`.
+
+Why a watermark and not `now − 5 min`:
+
+- **No clock dependency** — it compares DB values to DB values, so app-server vs database clock skew is irrelevant.
+- **Survives downtime for free** — after hours offline, `>= watermark` pulls *everything* since then; "catch up since last shutdown" needs no special case.
+- **Never misses late commits** — a wall-clock window can drop a row whose timestamp falls outside the window by the time it commits.
+
+The filter is `>=` (not `>`) so rows sharing the exact boundary timestamp are re-read; combined with **idempotent upsert by `key_columns`**, re-reading a handful of boundary rows each tick is harmless (they overwrite themselves), and no row is ever skipped. The 5-minute interval is only the **polling cadence**, never the filter boundary.
+
+```mermaid
+sequenceDiagram
+    participant Trigger as Startup / every 5 min
+    participant SQLmem
+    participant Mem as In-memory SQLite
+    participant DB as Source DB
+
+    Trigger->>SQLmem: refresh delta-tracked tables
+    SQLmem->>Mem: read last_synced_at for table
+    SQLmem->>DB: SELECT * FROM t WHERE LAST_CHANGE_DATE >= last_synced_at
+    DB-->>SQLmem: only rows changed since the watermark
+    SQLmem->>Mem: upsert rows by key_columns (INSERT OR REPLACE)
+    SQLmem->>Mem: last_synced_at = max(LAST_CHANGE_DATE)
+```
+
+- **First use** of a delta table → full load; the watermark is set to the table's current `max(change_column)`.
+- **On startup** → for each delta table restored from disk, a single catch-up query pulls everything changed **since the last shutdown** and upserts it, bringing the cache back in sync without a full reload.
+- **While running** → a background thread repeats the delta pull every `SQLMEM_REFRESH_INTERVAL` seconds (default 5 minutes), so the cache trails the source DB by at most that interval.
+- Tables **without** a `DeltaConfig` keep the current behaviour: full load on miss, never auto-refreshed.
+
+### Requirements and limits of delta sync
+
+- The `change_column` must be **set by the source DB on every insert/update** and be non-decreasing (e.g. a `DATETIME`/`rowversion`/`timestamp` maintained by a trigger or the application).
+- `key_columns` must uniquely identify a row — they are used to upsert changed rows in place.
+- **Updates, including "deletes by nulling"** (a row that keeps its identity but has values cleared), are handled automatically: the change timestamp bumps, the row is re-pulled and overwritten in place.
+- **Structural changes are not covered by delta sync** — adding/removing attributes, or clearing values *without* bumping `change_column`, won't be picked up. For those, force a clean reload with [`engine.reset()`](#manual-cache-control) (or `invalidate()` for a single table).
+- Hard `DELETE`s of whole rows are not detected by a change-timestamp; this workload doesn't delete rows, but if yours does, use a soft-delete flag column or `reset()`.
+
+## Persistence
+
+The in-memory cache is persisted to `cache.db` on disk:
+
+- **On startup**: if `cache.db` exists, it is loaded into memory.
+- **Periodically**: a background thread writes a snapshot to disk every `SQLMEM_BACKUP_INTERVAL` seconds.
+- **On shutdown**: a final flush via `atexit` and SIGTERM handler.
+
+The schema version is checked on load — if it does not match, the stale file is discarded and the cache is rebuilt from the database.
+
+## Manual cache control
+
+```python
+engine.invalidate("orders")   # drop one table from cache; next query re-fetches it from DB
+engine.reset()                # wipe the whole cache (RAM + cache.db) — full clean slate
+engine.refresh()              # pull deltas for all delta-tracked tables now
 engine.close()                # flush to disk and shut down background thread
 ```

+Use `reset()` after a **structural change** in the source (columns added/removed, values cleared in bulk without bumping the change timestamp) so the cache rebuilds from scratch. `invalidate(table)` is the targeted version for a single table.
+
+## Runtime statistics
+
+```python
+stats = engine.stats          # Stats snapshot
+print(stats.hits, stats.misses, stats.refetches)
+for name, t in stats.tables.items():
+    print(name, t.rows, t.columns, t.last_refresh)
+```
+
 ## Configuration

 Set via environment variables or a `.env` file:
@@ -108,8 +233,9 @@ Set via environment variables or a `.env` file:
 |---|---|---|
 | `SQLMEM_DEBUG` | `false` | `true` enables DEBUG-level logging |
 | `SQLMEM_CACHE_DB` | `cache.db` | Path to the on-disk persistence file |
-| `SQLMEM_BACKUP_INTERVAL` | `3600` | Backup interval in seconds |
+| `SQLMEM_BACKUP_INTERVAL` | `3600` | Disk backup interval in seconds |
 | `SQLMEM_SQL_DIALECT` | `tsql` | sqlglot dialect used to parse incoming SQL (e.g. `tsql`, `postgres`, `mysql`) |
+| `SQLMEM_REFRESH_INTERVAL` | `300` | delta-refresh interval in seconds for delta-tracked tables |

 ## Exceptions

@@ -132,7 +258,7 @@ from sqlmem import add_sink

 add_sink(sys.stderr)                      # INFO by default
 add_sink(sys.stderr, level="DEBUG")       # verbose: every query, cache hit/miss, backup
-add_sink("sqlmem.log", rotation="10 MB") # to a file
+add_sink("sqlmem.log", rotation="10 MB")  # to a file
 ```

 Set `SQLMEM_DEBUG=true` in `.env` to make the default level DEBUG when no explicit `level` is passed to `add_sink()`.
@@ -142,9 +268,16 @@ Set `SQLMEM_DEBUG=true` in `.env` to make the default level DEBUG when no explic
 - In a multi-table (JOIN) query, every column must be qualified with its table or alias; unqualified columns raise `UnsupportedQueryError`.
 - Tables are keyed by their base name — two tables with the same name in different schemas share one cache entry.
 - No distributed cache backend (Redis etc.).
- No transactional consistency guarantees.
+- No transactional consistency guarantees; the cache trails the source DB.
 - Write operations (INSERT/UPDATE/DELETE) are always blocked.

+## Roadmap
+
+- [x] **Incremental (delta) refresh** via per-table change-timestamp + key columns (see above) — the key feature for large tables.
+- [x] **Primary-key auto-discovery** from the source DB (`inspect(engine).get_pk_constraint`) so `key_columns` is only needed for views.
+- [x] **`engine.reset()`** — wipe RAM + `cache.db` for a clean rebuild after structural changes.
+- [ ] Per-table TTL (time-to-live) expiry.
+
 ## Dependencies

 | Layer | Library |