Honza/SQLmem

Fork 0

Files

T

Jan Doubravský 4a86b2282f Add declarative TableSpec API with preload and fail-fast; fix shared-connection race

2026-06-11 13:39:56 +02:00

26 KiB

Raw Permalink Blame History

Changelog

All notable changes to this project will be documented in this file.

[Unreleased]

[1.16.0] - 2026-06-11

Added

Declarative table specs — CachingEngine(tables=[TableSpec(...)]) — declare each cached table up front (columns, indexes, refresh strategy, datetime columns, preload) instead of letting the engine learn columns lazily from queries. New public types TableSpec, TTL, Delta (a friendly alias of DeltaConfig) and exception UndeclaredError.
- Background preload — preload=True tables are loaded at startup (on the background thread by default, so startup isn't blocked; blocking_startup_refresh=True loads them synchronously). A copy already fresh in the persistent cache is skipped via the same double-checked locking added in 1.15.0, so a warm restart is instant.
- Fail-fast on undeclared access — in declarative mode a query referencing a table that has no TableSpec, or a column outside a spec's declared columns (including SELECT * on a column-restricted table), raises UndeclaredError instead of silently triggering an expensive lazy load / column-expansion. Declare columns=None to cache the whole table and allow any column.
- Solves the lazy second-reload — because columns are declared, a first query for a previously unseen column no longer forces a full re-fetch.
executor.ensure_loaded(table, columns) — preloads a table into the cache (reusing the full load path: delta/index augmentation, registry, watermark, double-checked locking) without materializing any rows.

Fixed

Race on the shared cache connection — the metadata reads (is_table_cached, is_table_full, seconds_since_refresh, get_table_columns, get_last_synced_at, max_value, count_rows) touched the single shared SQLite connection without the connection lock, so a query thread reading while the background refresh/preload thread wrote could raise sqlite3.InterfaceError. These reads now take the lock. More likely to surface now that startup preload adds background-thread activity.

Changed

pyproject.toml — bumped version to 1.16.0.
Fully backward compatible — omit tables= and the legacy delta=/ttl=/indexes=/datetime_columns= kwargs behave exactly as before (lazy mode, no fail-fast). Passing both tables= and any of those kwargs raises ValueError; tables= is internally converted to the same config.

[1.15.0] - 2026-06-11

Fixed

Cache stampede (thundering herd) on cold loads — the decision to load a table was made before the load lock was taken, and load_table never re-checked after acquiring it. During a slow cold load of a large table (observed: 212M rows, ~2 h), a second query for the same table passed the pre-lock "not cached" check, queued on the load lock, and then ran a redundant second full reload instead of seeing the first had finished — doubling a multi-hour load. load_table now does double-checked locking: after acquiring the load lock it re-evaluates a caller-supplied predicate (table cached, all needed columns present, not TTL-expired) and skips the load when it is already satisfied. Invisible on small tables; on large ones it removes hours of redundant indexing under concurrent cold-start traffic.

Changed

pyproject.toml — bumped version to 1.15.0.
CacheManager.load_table gained an optional recheck callback (the double-check predicate); QueryExecutor supplies it for both column and SELECT * loads.

[1.14.0] - 2026-06-10

Follow-up to 1.12.0 from running datetime_columns in production: the feature was only half-wired (writes were coerced, reads and query params were not).

Fixed

WHERE on an INTEGER-µs datetime_columns column silently returned 0 rows — execute_in_memory() coerced query params with to_sqlite(), which leaves an ISO string a string. Comparing the stored INTEGER against a TEXT param is always false under SQLite affinity, so WHERE CHANGE_DATE > '2026-05-01T…' matched nothing. Params for a query that touches a datetime_columns table are now coerced to epoch µs (datetime objects and ISO-datetime strings alike), so the comparison matches the stored integer. Scoped to the query's tables, so non-datetime queries are unaffected.

Added

Read-time coercion — datetime_columns come back as datetime — execute() now returns those columns as real datetime objects (UTC) instead of the raw INTEGER µs, restoring the transparent-proxy contract (you get the same type a direct source query would give). Opt out with CachingEngine(..., return_datetime=False) to get the raw integers.
Stats.db_size_bytes — on-disk size of the cache file (0 in memory mode), so engine.stats exposes cache growth for monitoring without an external file check.
Public datetime_to_epoch_us helper — from sqlmem import datetime_to_epoch_us exposes the same datetime→epoch-µs conversion used internally, so callers building WHERE change_col > ? params don't have to re-implement it.

Changed

pyproject.toml — bumped version to 1.14.0.
vacuum(incremental=True) now warns instead of silently doing nothing when the cache was not created with auto_vacuum=INCREMENTAL (the only mode in which incremental vacuum can reclaim pages); it logs how to fix it (hard_reset() with the pragma, or a full vacuum(incremental=False)) and returns.
CacheManager.execute_in_memory() gained an optional tables argument (the query's tables) used to scope datetime param/result coercion; CacheManager/CachingEngine gained a return_datetime flag.

[1.12.0] - 2026-06-09

⚠️ Breaking

SCHEMA_VERSION bumped 3 → 4 — on upgrade the existing cache is wiped automatically (disk mode wipes the file in place, in-memory discards the backup) and reloaded from the source on next use. For a large cache (e.g. a multi-hundred-million-row table) the full reload can take a while; deploy in a maintenance window.
datetime_columns change the public output contract for the chosen columns — a column listed in datetime_columns is stored and returned as an INTEGER (microseconds since the Unix epoch, UTC), not an ISO TEXT string. This is opt-in per column, so no table is affected unless you name its columns; consumers that read or filter such a column must adapt (compare against integer µs, or convert on read).

Added

datetime_columns= parameter on CachingEngine / CacheManager — datetime_columns={"VW_X": ["CHANGE_DATE"]} stores the named datetime columns as INTEGER µs-since-epoch instead of ~28-byte ISO TEXT. Saves ~20 bytes per row and makes index comparisons on the column operate on native integers instead of string collation — worthwhile for a pure datetime column on a very large table (e.g. a delta change column that is also range-scanned).
- _coerce.to_sqlite_datetime() converts datetimes (and ISO/date values) to exact integer microseconds via integer arithmetic (no float rounding); a naive datetime is treated as UTC, None passes through.
- load_table declares those columns INTEGER and upsert_rows coerces them the same way, so full loads and delta upserts agree on the on-disk representation.
- The delta high-watermark for such a column is the stored integer; delta._bind_watermark(..., epoch_us=True) reconstructs a real UTC datetime before binding, so the source still receives a typed timestamp (and the watermark fix from 1.8.0 keeps holding).

Changed

pyproject.toml — bumped version to 1.12.0.
CacheManager.max_value / set_last_synced_at now accept/return int watermarks alongside str (the INTEGER-µs watermark round-trips through the last_synced_at TEXT column as its digit string).

[1.11.0] - 2026-06-09

Added

pragmas= parameter on CachingEngine / CacheManager — pass a dict of SQLite PRAGMAs (e.g. mmap_size, cache_size, temp_store, page_size, auto_vacuum) applied to the cache connection at open time, so disk-backed caches can be tuned for the host's I/O profile without bypassing CacheManager. Unknown/inapplicable pragmas are silently ignored by SQLite (graceful degradation, no startup crash).
- page_size is a layout pragma: it is applied only on a fresh file (set before WAL / the first table). On an existing cache with a different page size the request is ignored and a one-time warning is logged — the new value takes effect only after hard_reset() or a rebuild.
- auto_vacuum is set before the database header is materialized (before switching to WAL) on a fresh file, so INCREMENTAL/FULL actually stick instead of silently reverting to NONE.
CachingEngine.hard_reset() / CacheManager.hard_reset() — close every connection, delete the on-disk cache file (and its -wal/-shm sidecars) and reopen from scratch with all current pragmas applied. Unlike reset() (which drops tables but keeps the open file), this lets page_size/auto_vacuum change, since those are baked into the file at creation. Disk mode only — falls back to reset() in memory mode. All tables reload on next use.
CachingEngine.vacuum(incremental=True, pages=10_000) / CacheManager.vacuum(...) — run maintenance VACUUM on the on-disk cache to reclaim free pages left by delta INSERT OR REPLACE churn. Incremental (default) reclaims up to pages pages without blocking readers or extra disk (requires auto_vacuum=INCREMENTAL); incremental=False runs a full VACUUM (rewrites the file, ~2× disk, blocks readers — maintenance window only). No-op in memory mode.

Changed

pyproject.toml — bumped version to 1.11.0.
ColumnRegistry gained rebind() so it follows the cache connection swap performed by hard_reset() (the registry previously captured the connection for the process lifetime).

[1.10.0] - 2026-06-09

Added

last_upsert (persisted write) vs last_refresh (run/liveness) in stats — TableStats.last_refresh previously came from the persisted last_refresh_at column, which is only written when rows are actually written (a delta cycle with total == 0 early-returns and leaves it unchanged). A healthy delta that keeps finding no new rows therefore looked frozen. The single value is now split:
- last_upsert — wall-clock (UTC) of the last actual data write (full load / delta with rows). Persisted, survives restarts (this is the existing last_refresh_at column, surfaced under a clearer name).
- last_refresh — wall-clock (UTC) of the last time a refresh cycle ran for the table, even when it wrote nothing. In-memory per process (None until the first cycle after start), tracked like _states/_errors — so no schema change and no cache wipe.
- CacheManager gained mark_refresh_ran() / get_last_runs(); an empty delta cycle now records a run. TTL staleness still uses the last write (seconds_since_refresh reads last_refresh_at), so behaviour is unchanged.

Changed

pyproject.toml — bumped version to 1.10.0.
TableStats.last_refresh is now str | None (was str) and a new required last_upsert: str | None field is added. Consumers reading last_refresh for "when did data change?" should switch to last_upsert.

[1.8.0] - 2026-06-08

Fixed

Frozen delta watermark on datetime change columns — the delta high-watermark is read back from the cache as an ISO TEXT string (e.g. '2026-06-05T14:54:24.823000') and was bound straight back to the source. SQL Server then had to implicitly convert that nvarchar to datetime and failed (T-separated ISO with 6 fractional digits exceeds datetime's 3 — error 241 / SQLSTATE 22007), so every delta refresh and the startup catch-up died before streaming and the watermark never advanced (the cache silently froze at the last full load). The watermark is now parsed back to a real datetime (delta._bind_watermark) so the driver sends a typed timestamp and the comparison runs natively; non-datetime change columns (e.g. integer rowversions) pass through unchanged. Regression tests added.

Added

Refresh/load failures are now visible in stats — TableStats gained last_error, last_error_at and consecutive_failures, and Stats gained a total errors counter. A delta that fails before streaming (e.g. the watermark bug above) previously left state = ready, hiding the problem; it now also marks the table error and records the message. consecutive_failures resets to 0 on the next success.
Per-engine configuration — CachingEngine accepts cache_db_path, backup_interval, refresh_interval, fetch_batch and dialect (each defaults to its env var / config global when omitted), so two engines with independent cache files can run in one process and config is testable without env vars.
blocking_startup_refresh flag (default False) — the startup catch-up (deltas/TTL reloads for tables restored from disk) now runs on the background thread by default, so it never blocks application startup. Pass blocking_startup_refresh=True to catch up synchronously before serving.

Changed

SQL identifiers are quoted — table/column names are now quoted everywhere they are interpolated into statements (SQLite double-quote for the cache, the configured dialect — e.g. T-SQL [brackets] — for the source), so reserved words or names with spaces work and the f-string interpolation is hardened.
Source connection opened lazily — execute() no longer opens a source connection on every call; a pure cache hit never touches the source (and never occupies a pool slot). The misleading cast(sqlite3.Connection, …) on the source handle was removed (it is a pyodbc connection in production).
Concurrent reads in disk mode — disk-backed reads now use a per-thread read-only WAL connection instead of sharing the single write connection under a lock, so a slow SELECT no longer blocks writers (loads/upserts) or other readers. In-memory mode is unchanged (a :memory: database can't be shared across connections).
add_sink is idempotent — calling it again for the same sink is a no-op, so a double import no longer duplicates every log line.
pyproject.toml — bumped version to 1.8.0; added a scoped pytest filterwarnings for the SQLite test source's legacy datetime-adapter deprecation.

Note

Cache type fidelity (returning real datetime/Decimal/numeric types from execute() instead of TEXT strings, and giving numeric columns proper affinity) was evaluated but deferred — it changes the public output contract that consumers currently rely on (and that test_coerce.py pins). Decimal/datetime stay stored as exact, lossless TEXT.

[1.7.0] - 2026-06-08

Added

Disk-backed cache mode — CachingEngine(engine, in_memory=False) (or env SQLMEM_IN_MEMORY=false) queries the on-disk cache.db directly instead of loading it into an in-memory SQLite. Every write persists immediately (no hourly backup thread, no load-on-startup copy, no atexit/SIGTERM flush needed), and the cache may exceed available RAM. The disk connection uses WAL + synchronous=NORMAL for write throughput. In-memory mode (backed up to disk periodically) remains the default. in_memory defaults to the SQLMEM_IN_MEMORY config when omitted.
- On open, a disk cache with a mismatched schema_version is wiped in place and rebuilt.
- engine.reset() in disk mode drops the cached tables and VACUUMs the file (it does not unlink the open file).
SQLMEM_IN_MEMORY env var (default true).

Changed

pyproject.toml — bumped version to 1.7.0
cache.py — CacheManager gained an in_memory flag; the cache connection (_mem_conn → _conn) is opened either on :memory: or directly on the on-disk file. Disk mode skips the load-on-startup copy, backup thread, and shutdown flush, and reset() VACUUMs in place instead of unlinking the open file.
.gitignore — ignore cache.db and its WAL sidecars (cache.db-wal, cache.db-shm).

[1.6.0] - 2026-06-05

Added

Secondary indexes — CachingEngine(engine, indexes={"VW_X": ["col", ["a", "b"]]}) creates indexes on the in-memory cache to accelerate WHERE/JOIN lookups. Index columns are auto-loaded so the index exists from the first load, and indexes are recreated after every (re)load and persist in cache.db. Combines freely with delta and ttl.

Changed

pyproject.toml — bumped version to 1.6.0

[1.5.0] - 2026-06-05

Added

Per-table processing state in stats — TableStats now carries state (loading / refreshing / ready / stale / error) and tracking (delta / ttl / static), so callers can see whether each table is up to date or being processed. In-progress first loads and failed loads also surface in stats.tables.
SQLMEM_FETCH_BATCH env var (default 10000) — rows fetched per batch when loading a table.

Changed

pyproject.toml — bumped version to 1.5.0
Large-table loads are streamed in batches — load_table no longer fetchall()s the whole table (which double-buffered every row in Python and could OOM/crash on tens of millions of rows). Rows are now fetched SQLMEM_FETCH_BATCH at a time into a staging table and swapped in atomically, so peak memory stays bounded, the previous copy stays queryable during a reload, and the network fetch no longer holds the cache lock. Delta catch-ups are streamed the same way.
Orphan staging tables left by an interrupted load (crash/backup mid-load) are dropped on startup.
Delta upserts compute row_count once per refresh instead of a full COUNT(*) after every batch (avoids O(rows×batches) work on large catch-ups).

[1.4.0] - 2026-06-05

Fixed

decimal.Decimal (and datetime) binding error — NUMERIC/DECIMAL/MONEY columns from SQL Server (pyodbc) arrive as decimal.Decimal, which sqlite3 cannot bind, crashing the cache load with type 'decimal.Decimal' is not supported. Values are now coerced to sqlite-bindable types (Decimal→str, datetime/date/time→ISO, uuid.UUID→str, bytearray→bytes) at the cache boundary — on full load, on delta upsert, and for WHERE parameters. Coercion is local (no global sqlite3.register_adapter), so the host application's sqlite3 behaviour is untouched. Cache columns are TEXT, so the conversion is lossless and exact (no rounding).

Added

Incremental (delta) refresh — CachingEngine(engine, delta={...}) with DeltaConfig(change_column, key_columns). Delta-tracked tables are kept in sync by pulling only changed rows (WHERE change_column >= watermark) and upserting them by key, instead of full reloads.
- Data-driven high-watermark = max(change_column) cached, persisted in cache.db; >= overlap + idempotent upsert so no row is missed and boundary rows are harmlessly re-read.
- Catch-up on startup (since last shutdown) and a background thread refreshing every SQLMEM_REFRESH_INTERVAL seconds (default 300); engine.refresh() triggers a pull on demand.
- Primary key is auto-discovered from the source DB (inspect(engine).get_pk_constraint) when key_columns is omitted; required explicitly for views (raises ValueError).
Per-table TTL (time-based refresh) — CachingEngine(engine, ttl={"VW_X": 300}) for tables with no change column that can't be delta-synced. The cached copy is guaranteed never older than the TTL: a query touching an expired table triggers a full reload before it is answered (read-time guarantee), and the background thread proactively reloads expired tables. TTL age uses the persisted last_refresh_at, so the bound holds across restarts. A table in both delta and ttl raises ValueError.
DeltaConfig exported from the public API.
engine.reset() — wipes the whole cache (RAM + cache.db) for a clean rebuild after structural source changes.
SQLMEM_REFRESH_INTERVAL env var (default 300) — background refresh tick for delta pulls and proactive TTL reloads.

Changed

pyproject.toml — bumped version to 1.4.0
cache.py — schema version bumped to 3; _sqlmem_tables gained a last_synced_at watermark column. New methods: execute_in_memory (lock-serialized read), get_table_columns, create_unique_index, get/set_last_synced_at, max_value, upsert_rows, seconds_since_refresh, reset. Existing on-disk caches are discarded and rebuilt on load.
executor.py — delta-tracked tables augment their column set with key/change columns (unique key index + initial watermark); TTL-tracked tables full-reload at read time when expired; in-memory reads go through the cache lock.

[1.2.0] - 2026-06-04

Added

Parametrized queries (R1) — execute(sql, params) accepts positional (? tuple/list) and named (:name dict) parameters; passed straight to SQLite during in-memory filtering. Cache loads still fetch the full table (parameters are not applied to source fetches).
JOIN support (R2) — multi-table SELECTs are parsed into per-table column sets; each table is cached independently and the JOIN runs in the in-memory SQLite. Columns in a multi-table query must be qualified by table or alias.
SELECT * support (R3) — wildcard (and alias.*) queries discover all columns from the source DB, cache the whole table, and mark it is_full so later column queries are guaranteed cache hits without re-fetch.
Three-part table names (R4) — [catalog].[schema].[table] is parsed to its base name for caching; the in-memory query is rewritten to strip catalog/schema prefixes so it runs under SQLite.
SQLMEM_SQL_DIALECT env var (default tsql) — sqlglot dialect used to parse incoming SQL; T-SQL also accepts ANSI SQL and MSSQL bracket quoting.
CacheManager.discover_columns() and CacheManager.is_table_full(); load_table() gained a full flag.

Changed

pyproject.toml — bumped version to 1.2.0
parser.py — ParsedQuery.table: str replaced by tables: list[str] plus columns_by_table, sqlite_sql, params, and wildcard_tables; SQL is parsed with the configured dialect and rendered to SQLite for execution.
executor.py — loads each referenced table independently and applies query parameters during in-memory execution.
cache.py — schema version bumped to 2; _sqlmem_tables gained an is_full column (existing on-disk caches are discarded and rebuilt on load).

[1.1.0] - 2026-06-03

Added

Stats and TableStats frozen dataclasses — snapshot of runtime cache statistics (hit/miss/refetch counts, per-table row count, columns, last refresh timestamp)
StatsCollector — internal thread-safe counter; increments on every cache hit, miss, and re-fetch
engine.stats property — returns a Stats snapshot at any point in time
Stats and TableStats exported from the public API

Changed

pyproject.toml — bumped version to 1.1.0

[1.0.0] - 2026-06-03

Changed

pyproject.toml — bumped version to 1.0.0

[0.4.0] - 2026-06-03

Added

add_sink(sink, *, level, **kwargs) — public API for routing sqlmem log records to any loguru-compatible sink (stream, file, callable); supports all loguru logger.add() kwargs including rotation, retention, etc.

Changed

pyproject.toml — bumped version to 0.4.0
config.py — replaced destructive logger.remove() + forced default sink with logger.disable("sqlmem"); sqlmem is now silent by default and does not interfere with the host application's logging setup

[0.3.0] - 2026-06-03

Added

README.md — full project documentation: architecture overview, quick start, cache behaviour, persistence, configuration, exceptions, logging, and limitations

Changed

pyproject.toml — bumped version to 0.3.0
parser.py — _extract_columns now deduplicates column names while preserving order
.gitignore — added .env and .env.* to prevent accidental commit of environment files

Security

Removed .env from git tracking (git rm --cached)

[0.2.0] - 2026-06-01

Added

Project specification in project.md — architecture, API design, cache backend, metadata schema, logging strategy, and TODO for future features (JOIN, SELECT * support)
.gitignore for Python/Poetry project
pyproject.toml dependencies: sqlglot, sqlalchemy, loguru, python-dotenv; dev dependencies: pytest, ruff, mypy
src/sqlmem/ package structure with src layout
src/sqlmem/exceptions.py — ReadOnlyError (blocks INSERT/UPDATE/DELETE), UnsupportedQueryError (blocks JOIN and SELECT *)
src/sqlmem/config.py — loads .env, configures loguru with DEBUG/INFO level based on SQLMEM_DEBUG
src/sqlmem/_meta.py — package version constant
src/sqlmem/parser.py — SQL Parser using sqlglot; extracts table and columns from SELECT, raises on writes/JOIN/wildcard
src/sqlmem/registry.py — Column Registry; accumulates requested columns per table, detects missing columns requiring re-fetch
src/sqlmem/cache.py — Cache Manager; SQLite in-memory storage, load from cache.db on startup (with schema version check), hourly backup thread, atexit/SIGTERM flush, metadata tables (_sqlmem_meta, _sqlmem_tables, _sqlmem_columns)
src/sqlmem/executor.py — Query Executor; cache hit/miss logic, re-fetch on new columns with WARNING log
src/sqlmem/engine.py — CachingEngine wrapper; public API compatible with SQLAlchemy, invalidate(table) for manual cache clearing
src/sqlmem/__init__.py — public exports: CachingEngine, ReadOnlyError, UnsupportedQueryError
tests/test_parser.py — parser tests: SELECT parsing, ReadOnlyError, UnsupportedQueryError
tests/test_cache.py — cache tests: load, data correctness, metadata, disk backup/reload
tests/test_registry.py — registry tests: accumulation, needs_refetch, table isolation

26 KiB Raw Permalink Blame History Unescape Escape

Changelog

[Unreleased]

[1.16.0] - 2026-06-11

Added

Fixed

Changed

[1.15.0] - 2026-06-11

Fixed

Changed

[1.14.0] - 2026-06-10

Fixed

Added

Changed

[1.12.0] - 2026-06-09

⚠️ Breaking

Added

Changed

[1.11.0] - 2026-06-09

Added

Changed

[1.10.0] - 2026-06-09

Added

Changed

[1.8.0] - 2026-06-08

Fixed

Added

Changed

Note

[1.7.0] - 2026-06-08

Added

Changed

[1.6.0] - 2026-06-05

Added

Changed

[1.5.0] - 2026-06-05

Added

Changed

[1.4.0] - 2026-06-05

Fixed

Added

Changed

[1.2.0] - 2026-06-04

Added

Changed

[1.1.0] - 2026-06-03

Added

Changed

[1.0.0] - 2026-06-03

Changed

[0.4.0] - 2026-06-03

Added

Changed

[0.3.0] - 2026-06-03

Added

Changed

Security

[0.2.0] - 2026-06-01

Added

26 KiB

Raw Permalink Blame History