Honza/SQLmem

Fork 0

Files

T

Jan Doubravský 46370fe651 Fix cache stampede with double-checked locking in load_table

2026-06-11 13:03:22 +02:00

23 KiB

Raw Blame History

Changelog

All notable changes to this project will be documented in this file.

[Unreleased]

[1.15.0] - 2026-06-11

Fixed

Cache stampede (thundering herd) on cold loads — the decision to load a table was made before the load lock was taken, and load_table never re-checked after acquiring it. During a slow cold load of a large table (observed: 212M rows, ~2 h), a second query for the same table passed the pre-lock "not cached" check, queued on the load lock, and then ran a redundant second full reload instead of seeing the first had finished — doubling a multi-hour load. load_table now does double-checked locking: after acquiring the load lock it re-evaluates a caller-supplied predicate (table cached, all needed columns present, not TTL-expired) and skips the load when it is already satisfied. Invisible on small tables; on large ones it removes hours of redundant indexing under concurrent cold-start traffic.

Changed

pyproject.toml — bumped version to 1.15.0.
CacheManager.load_table gained an optional recheck callback (the double-check predicate); QueryExecutor supplies it for both column and SELECT * loads.

[1.14.0] - 2026-06-10

Follow-up to 1.12.0 from running datetime_columns in production: the feature was only half-wired (writes were coerced, reads and query params were not).

Fixed

WHERE on an INTEGER-µs datetime_columns column silently returned 0 rows — execute_in_memory() coerced query params with to_sqlite(), which leaves an ISO string a string. Comparing the stored INTEGER against a TEXT param is always false under SQLite affinity, so WHERE CHANGE_DATE > '2026-05-01T…' matched nothing. Params for a query that touches a datetime_columns table are now coerced to epoch µs (datetime objects and ISO-datetime strings alike), so the comparison matches the stored integer. Scoped to the query's tables, so non-datetime queries are unaffected.

Added

Read-time coercion — datetime_columns come back as datetime — execute() now returns those columns as real datetime objects (UTC) instead of the raw INTEGER µs, restoring the transparent-proxy contract (you get the same type a direct source query would give). Opt out with CachingEngine(..., return_datetime=False) to get the raw integers.
Stats.db_size_bytes — on-disk size of the cache file (0 in memory mode), so engine.stats exposes cache growth for monitoring without an external file check.
Public datetime_to_epoch_us helper — from sqlmem import datetime_to_epoch_us exposes the same datetime→epoch-µs conversion used internally, so callers building WHERE change_col > ? params don't have to re-implement it.

Changed

pyproject.toml — bumped version to 1.14.0.
vacuum(incremental=True) now warns instead of silently doing nothing when the cache was not created with auto_vacuum=INCREMENTAL (the only mode in which incremental vacuum can reclaim pages); it logs how to fix it (hard_reset() with the pragma, or a full vacuum(incremental=False)) and returns.
CacheManager.execute_in_memory() gained an optional tables argument (the query's tables) used to scope datetime param/result coercion; CacheManager/CachingEngine gained a return_datetime flag.

[1.12.0] - 2026-06-09

⚠️ Breaking

SCHEMA_VERSION bumped 3 → 4 — on upgrade the existing cache is wiped automatically (disk mode wipes the file in place, in-memory discards the backup) and reloaded from the source on next use. For a large cache (e.g. a multi-hundred-million-row table) the full reload can take a while; deploy in a maintenance window.
datetime_columns change the public output contract for the chosen columns — a column listed in datetime_columns is stored and returned as an INTEGER (microseconds since the Unix epoch, UTC), not an ISO TEXT string. This is opt-in per column, so no table is affected unless you name its columns; consumers that read or filter such a column must adapt (compare against integer µs, or convert on read).

Added

datetime_columns= parameter on CachingEngine / CacheManager — datetime_columns={"VW_X": ["CHANGE_DATE"]} stores the named datetime columns as INTEGER µs-since-epoch instead of ~28-byte ISO TEXT. Saves ~20 bytes per row and makes index comparisons on the column operate on native integers instead of string collation — worthwhile for a pure datetime column on a very large table (e.g. a delta change column that is also range-scanned).
- _coerce.to_sqlite_datetime() converts datetimes (and ISO/date values) to exact integer microseconds via integer arithmetic (no float rounding); a naive datetime is treated as UTC, None passes through.
- load_table declares those columns INTEGER and upsert_rows coerces them the same way, so full loads and delta upserts agree on the on-disk representation.
- The delta high-watermark for such a column is the stored integer; delta._bind_watermark(..., epoch_us=True) reconstructs a real UTC datetime before binding, so the source still receives a typed timestamp (and the watermark fix from 1.8.0 keeps holding).

Changed

pyproject.toml — bumped version to 1.12.0.
CacheManager.max_value / set_last_synced_at now accept/return int watermarks alongside str (the INTEGER-µs watermark round-trips through the last_synced_at TEXT column as its digit string).

[1.11.0] - 2026-06-09

Added

pragmas= parameter on CachingEngine / CacheManager — pass a dict of SQLite PRAGMAs (e.g. mmap_size, cache_size, temp_store, page_size, auto_vacuum) applied to the cache connection at open time, so disk-backed caches can be tuned for the host's I/O profile without bypassing CacheManager. Unknown/inapplicable pragmas are silently ignored by SQLite (graceful degradation, no startup crash).
- page_size is a layout pragma: it is applied only on a fresh file (set before WAL / the first table). On an existing cache with a different page size the request is ignored and a one-time warning is logged — the new value takes effect only after hard_reset() or a rebuild.
- auto_vacuum is set before the database header is materialized (before switching to WAL) on a fresh file, so INCREMENTAL/FULL actually stick instead of silently reverting to NONE.
CachingEngine.hard_reset() / CacheManager.hard_reset() — close every connection, delete the on-disk cache file (and its -wal/-shm sidecars) and reopen from scratch with all current pragmas applied. Unlike reset() (which drops tables but keeps the open file), this lets page_size/auto_vacuum change, since those are baked into the file at creation. Disk mode only — falls back to reset() in memory mode. All tables reload on next use.
CachingEngine.vacuum(incremental=True, pages=10_000) / CacheManager.vacuum(...) — run maintenance VACUUM on the on-disk cache to reclaim free pages left by delta INSERT OR REPLACE churn. Incremental (default) reclaims up to pages pages without blocking readers or extra disk (requires auto_vacuum=INCREMENTAL); incremental=False runs a full VACUUM (rewrites the file, ~2× disk, blocks readers — maintenance window only). No-op in memory mode.

Changed

pyproject.toml — bumped version to 1.11.0.
ColumnRegistry gained rebind() so it follows the cache connection swap performed by hard_reset() (the registry previously captured the connection for the process lifetime).

[1.10.0] - 2026-06-09

Added

last_upsert (persisted write) vs last_refresh (run/liveness) in stats — TableStats.last_refresh previously came from the persisted last_refresh_at column, which is only written when rows are actually written (a delta cycle with total == 0 early-returns and leaves it unchanged). A healthy delta that keeps finding no new rows therefore looked frozen. The single value is now split:
- last_upsert — wall-clock (UTC) of the last actual data write (full load / delta with rows). Persisted, survives restarts (this is the existing last_refresh_at column, surfaced under a clearer name).
- last_refresh — wall-clock (UTC) of the last time a refresh cycle ran for the table, even when it wrote nothing. In-memory per process (None until the first cycle after start), tracked like _states/_errors — so no schema change and no cache wipe.
- CacheManager gained mark_refresh_ran() / get_last_runs(); an empty delta cycle now records a run. TTL staleness still uses the last write (seconds_since_refresh reads last_refresh_at), so behaviour is unchanged.

Changed

pyproject.toml — bumped version to 1.10.0.
TableStats.last_refresh is now str | None (was str) and a new required last_upsert: str | None field is added. Consumers reading last_refresh for "when did data change?" should switch to last_upsert.

[1.8.0] - 2026-06-08

Fixed

Frozen delta watermark on datetime change columns — the delta high-watermark is read back from the cache as an ISO TEXT string (e.g. '2026-06-05T14:54:24.823000') and was bound straight back to the source. SQL Server then had to implicitly convert that nvarchar to datetime and failed (T-separated ISO with 6 fractional digits exceeds datetime's 3 — error 241 / SQLSTATE 22007), so every delta refresh and the startup catch-up died before streaming and the watermark never advanced (the cache silently froze at the last full load). The watermark is now parsed back to a real datetime (delta._bind_watermark) so the driver sends a typed timestamp and the comparison runs natively; non-datetime change columns (e.g. integer rowversions) pass through unchanged. Regression tests added.

Added

Refresh/load failures are now visible in stats — TableStats gained last_error, last_error_at and consecutive_failures, and Stats gained a total errors counter. A delta that fails before streaming (e.g. the watermark bug above) previously left state = ready, hiding the problem; it now also marks the table error and records the message. consecutive_failures resets to 0 on the next success.
Per-engine configuration — CachingEngine accepts cache_db_path, backup_interval, refresh_interval, fetch_batch and dialect (each defaults to its env var / config global when omitted), so two engines with independent cache files can run in one process and config is testable without env vars.
blocking_startup_refresh flag (default False) — the startup catch-up (deltas/TTL reloads for tables restored from disk) now runs on the background thread by default, so it never blocks application startup. Pass blocking_startup_refresh=True to catch up synchronously before serving.

Changed

SQL identifiers are quoted — table/column names are now quoted everywhere they are interpolated into statements (SQLite double-quote for the cache, the configured dialect — e.g. T-SQL [brackets] — for the source), so reserved words or names with spaces work and the f-string interpolation is hardened.
Source connection opened lazily — execute() no longer opens a source connection on every call; a pure cache hit never touches the source (and never occupies a pool slot). The misleading cast(sqlite3.Connection, …) on the source handle was removed (it is a pyodbc connection in production).
Concurrent reads in disk mode — disk-backed reads now use a per-thread read-only WAL connection instead of sharing the single write connection under a lock, so a slow SELECT no longer blocks writers (loads/upserts) or other readers. In-memory mode is unchanged (a :memory: database can't be shared across connections).
add_sink is idempotent — calling it again for the same sink is a no-op, so a double import no longer duplicates every log line.
pyproject.toml — bumped version to 1.8.0; added a scoped pytest filterwarnings for the SQLite test source's legacy datetime-adapter deprecation.

Note

Cache type fidelity (returning real datetime/Decimal/numeric types from execute() instead of TEXT strings, and giving numeric columns proper affinity) was evaluated but deferred — it changes the public output contract that consumers currently rely on (and that test_coerce.py pins). Decimal/datetime stay stored as exact, lossless TEXT.

[1.7.0] - 2026-06-08

Added

Disk-backed cache mode — CachingEngine(engine, in_memory=False) (or env SQLMEM_IN_MEMORY=false) queries the on-disk cache.db directly instead of loading it into an in-memory SQLite. Every write persists immediately (no hourly backup thread, no load-on-startup copy, no atexit/SIGTERM flush needed), and the cache may exceed available RAM. The disk connection uses WAL + synchronous=NORMAL for write throughput. In-memory mode (backed up to disk periodically) remains the default. in_memory defaults to the SQLMEM_IN_MEMORY config when omitted.
- On open, a disk cache with a mismatched schema_version is wiped in place and rebuilt.
- engine.reset() in disk mode drops the cached tables and VACUUMs the file (it does not unlink the open file).
SQLMEM_IN_MEMORY env var (default true).

Changed

pyproject.toml — bumped version to 1.7.0
cache.py — CacheManager gained an in_memory flag; the cache connection (_mem_conn → _conn) is opened either on :memory: or directly on the on-disk file. Disk mode skips the load-on-startup copy, backup thread, and shutdown flush, and reset() VACUUMs in place instead of unlinking the open file.
.gitignore — ignore cache.db and its WAL sidecars (cache.db-wal, cache.db-shm).

[1.6.0] - 2026-06-05

Added

Secondary indexes — CachingEngine(engine, indexes={"VW_X": ["col", ["a", "b"]]}) creates indexes on the in-memory cache to accelerate WHERE/JOIN lookups. Index columns are auto-loaded so the index exists from the first load, and indexes are recreated after every (re)load and persist in cache.db. Combines freely with delta and ttl.

Changed

pyproject.toml — bumped version to 1.6.0

[1.5.0] - 2026-06-05

Added

Per-table processing state in stats — TableStats now carries state (loading / refreshing / ready / stale / error) and tracking (delta / ttl / static), so callers can see whether each table is up to date or being processed. In-progress first loads and failed loads also surface in stats.tables.
SQLMEM_FETCH_BATCH env var (default 10000) — rows fetched per batch when loading a table.

Changed

pyproject.toml — bumped version to 1.5.0
Large-table loads are streamed in batches — load_table no longer fetchall()s the whole table (which double-buffered every row in Python and could OOM/crash on tens of millions of rows). Rows are now fetched SQLMEM_FETCH_BATCH at a time into a staging table and swapped in atomically, so peak memory stays bounded, the previous copy stays queryable during a reload, and the network fetch no longer holds the cache lock. Delta catch-ups are streamed the same way.
Orphan staging tables left by an interrupted load (crash/backup mid-load) are dropped on startup.
Delta upserts compute row_count once per refresh instead of a full COUNT(*) after every batch (avoids O(rows×batches) work on large catch-ups).

[1.4.0] - 2026-06-05

Fixed

decimal.Decimal (and datetime) binding error — NUMERIC/DECIMAL/MONEY columns from SQL Server (pyodbc) arrive as decimal.Decimal, which sqlite3 cannot bind, crashing the cache load with type 'decimal.Decimal' is not supported. Values are now coerced to sqlite-bindable types (Decimal→str, datetime/date/time→ISO, uuid.UUID→str, bytearray→bytes) at the cache boundary — on full load, on delta upsert, and for WHERE parameters. Coercion is local (no global sqlite3.register_adapter), so the host application's sqlite3 behaviour is untouched. Cache columns are TEXT, so the conversion is lossless and exact (no rounding).

Added

Incremental (delta) refresh — CachingEngine(engine, delta={...}) with DeltaConfig(change_column, key_columns). Delta-tracked tables are kept in sync by pulling only changed rows (WHERE change_column >= watermark) and upserting them by key, instead of full reloads.
- Data-driven high-watermark = max(change_column) cached, persisted in cache.db; >= overlap + idempotent upsert so no row is missed and boundary rows are harmlessly re-read.
- Catch-up on startup (since last shutdown) and a background thread refreshing every SQLMEM_REFRESH_INTERVAL seconds (default 300); engine.refresh() triggers a pull on demand.
- Primary key is auto-discovered from the source DB (inspect(engine).get_pk_constraint) when key_columns is omitted; required explicitly for views (raises ValueError).
Per-table TTL (time-based refresh) — CachingEngine(engine, ttl={"VW_X": 300}) for tables with no change column that can't be delta-synced. The cached copy is guaranteed never older than the TTL: a query touching an expired table triggers a full reload before it is answered (read-time guarantee), and the background thread proactively reloads expired tables. TTL age uses the persisted last_refresh_at, so the bound holds across restarts. A table in both delta and ttl raises ValueError.
DeltaConfig exported from the public API.
engine.reset() — wipes the whole cache (RAM + cache.db) for a clean rebuild after structural source changes.
SQLMEM_REFRESH_INTERVAL env var (default 300) — background refresh tick for delta pulls and proactive TTL reloads.

Changed

pyproject.toml — bumped version to 1.4.0
cache.py — schema version bumped to 3; _sqlmem_tables gained a last_synced_at watermark column. New methods: execute_in_memory (lock-serialized read), get_table_columns, create_unique_index, get/set_last_synced_at, max_value, upsert_rows, seconds_since_refresh, reset. Existing on-disk caches are discarded and rebuilt on load.
executor.py — delta-tracked tables augment their column set with key/change columns (unique key index + initial watermark); TTL-tracked tables full-reload at read time when expired; in-memory reads go through the cache lock.

[1.2.0] - 2026-06-04

Added

Parametrized queries (R1) — execute(sql, params) accepts positional (? tuple/list) and named (:name dict) parameters; passed straight to SQLite during in-memory filtering. Cache loads still fetch the full table (parameters are not applied to source fetches).
JOIN support (R2) — multi-table SELECTs are parsed into per-table column sets; each table is cached independently and the JOIN runs in the in-memory SQLite. Columns in a multi-table query must be qualified by table or alias.
SELECT * support (R3) — wildcard (and alias.*) queries discover all columns from the source DB, cache the whole table, and mark it is_full so later column queries are guaranteed cache hits without re-fetch.
Three-part table names (R4) — [catalog].[schema].[table] is parsed to its base name for caching; the in-memory query is rewritten to strip catalog/schema prefixes so it runs under SQLite.
SQLMEM_SQL_DIALECT env var (default tsql) — sqlglot dialect used to parse incoming SQL; T-SQL also accepts ANSI SQL and MSSQL bracket quoting.
CacheManager.discover_columns() and CacheManager.is_table_full(); load_table() gained a full flag.

Changed

pyproject.toml — bumped version to 1.2.0
parser.py — ParsedQuery.table: str replaced by tables: list[str] plus columns_by_table, sqlite_sql, params, and wildcard_tables; SQL is parsed with the configured dialect and rendered to SQLite for execution.
executor.py — loads each referenced table independently and applies query parameters during in-memory execution.
cache.py — schema version bumped to 2; _sqlmem_tables gained an is_full column (existing on-disk caches are discarded and rebuilt on load).

[1.1.0] - 2026-06-03

Added

Stats and TableStats frozen dataclasses — snapshot of runtime cache statistics (hit/miss/refetch counts, per-table row count, columns, last refresh timestamp)
StatsCollector — internal thread-safe counter; increments on every cache hit, miss, and re-fetch
engine.stats property — returns a Stats snapshot at any point in time
Stats and TableStats exported from the public API

Changed

pyproject.toml — bumped version to 1.1.0

[1.0.0] - 2026-06-03

Changed

pyproject.toml — bumped version to 1.0.0

[0.4.0] - 2026-06-03

Added

add_sink(sink, *, level, **kwargs) — public API for routing sqlmem log records to any loguru-compatible sink (stream, file, callable); supports all loguru logger.add() kwargs including rotation, retention, etc.

Changed

pyproject.toml — bumped version to 0.4.0
config.py — replaced destructive logger.remove() + forced default sink with logger.disable("sqlmem"); sqlmem is now silent by default and does not interfere with the host application's logging setup

[0.3.0] - 2026-06-03

Added

README.md — full project documentation: architecture overview, quick start, cache behaviour, persistence, configuration, exceptions, logging, and limitations

Changed

pyproject.toml — bumped version to 0.3.0
parser.py — _extract_columns now deduplicates column names while preserving order
.gitignore — added .env and .env.* to prevent accidental commit of environment files

Security

Removed .env from git tracking (git rm --cached)

[0.2.0] - 2026-06-01

Added

Project specification in project.md — architecture, API design, cache backend, metadata schema, logging strategy, and TODO for future features (JOIN, SELECT * support)
.gitignore for Python/Poetry project
pyproject.toml dependencies: sqlglot, sqlalchemy, loguru, python-dotenv; dev dependencies: pytest, ruff, mypy
src/sqlmem/ package structure with src layout
src/sqlmem/exceptions.py — ReadOnlyError (blocks INSERT/UPDATE/DELETE), UnsupportedQueryError (blocks JOIN and SELECT *)
src/sqlmem/config.py — loads .env, configures loguru with DEBUG/INFO level based on SQLMEM_DEBUG
src/sqlmem/_meta.py — package version constant
src/sqlmem/parser.py — SQL Parser using sqlglot; extracts table and columns from SELECT, raises on writes/JOIN/wildcard
src/sqlmem/registry.py — Column Registry; accumulates requested columns per table, detects missing columns requiring re-fetch
src/sqlmem/cache.py — Cache Manager; SQLite in-memory storage, load from cache.db on startup (with schema version check), hourly backup thread, atexit/SIGTERM flush, metadata tables (_sqlmem_meta, _sqlmem_tables, _sqlmem_columns)
src/sqlmem/executor.py — Query Executor; cache hit/miss logic, re-fetch on new columns with WARNING log
src/sqlmem/engine.py — CachingEngine wrapper; public API compatible with SQLAlchemy, invalidate(table) for manual cache clearing
src/sqlmem/__init__.py — public exports: CachingEngine, ReadOnlyError, UnsupportedQueryError
tests/test_parser.py — parser tests: SELECT parsing, ReadOnlyError, UnsupportedQueryError
tests/test_cache.py — cache tests: load, data correctness, metadata, disk backup/reload
tests/test_registry.py — registry tests: accumulation, needs_refetch, table isolation

23 KiB Raw Blame History Unescape Escape

Changelog

[Unreleased]

[1.15.0] - 2026-06-11

Fixed

Changed

[1.14.0] - 2026-06-10

Fixed

Added

Changed

[1.12.0] - 2026-06-09

⚠️ Breaking

Added

Changed

[1.11.0] - 2026-06-09

Added

Changed

[1.10.0] - 2026-06-09

Added

Changed

[1.8.0] - 2026-06-08

Fixed

Added

Changed

Note

[1.7.0] - 2026-06-08

Added

Changed

[1.6.0] - 2026-06-05

Added

Changed

[1.5.0] - 2026-06-05

Added

Changed

[1.4.0] - 2026-06-05

Fixed

Added

Changed

[1.2.0] - 2026-06-04

Added

Changed

[1.1.0] - 2026-06-03

Added

Changed

[1.0.0] - 2026-06-03

Changed

[0.4.0] - 2026-06-03

Added

Changed

[0.3.0] - 2026-06-03

Added

Changed

Security

[0.2.0] - 2026-06-01

Added

23 KiB

Raw Blame History