Files
SQLmem/CHANGELOG.md
T

204 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Changelog
All notable changes to this project will be documented in this file.
## [Unreleased]
---
## [1.11.0] - 2026-06-09
### Added
- **`pragmas=` parameter on `CachingEngine` / `CacheManager`** — pass a dict of SQLite PRAGMAs (e.g. `mmap_size`, `cache_size`, `temp_store`, `page_size`, `auto_vacuum`) applied to the cache connection at open time, so disk-backed caches can be tuned for the host's I/O profile without bypassing `CacheManager`. Unknown/inapplicable pragmas are silently ignored by SQLite (graceful degradation, no startup crash).
- **`page_size`** is a layout pragma: it is applied only on a *fresh* file (set before WAL / the first table). On an existing cache with a different page size the request is ignored and a one-time warning is logged — the new value takes effect only after `hard_reset()` or a rebuild.
- **`auto_vacuum`** is set before the database header is materialized (before switching to WAL) on a fresh file, so `INCREMENTAL`/`FULL` actually stick instead of silently reverting to `NONE`.
- **`CachingEngine.hard_reset()` / `CacheManager.hard_reset()`** — close every connection, delete the on-disk cache file (and its `-wal`/`-shm` sidecars) and reopen from scratch with all current pragmas applied. Unlike `reset()` (which drops tables but keeps the open file), this lets `page_size`/`auto_vacuum` change, since those are baked into the file at creation. Disk mode only — falls back to `reset()` in memory mode. All tables reload on next use.
- **`CachingEngine.vacuum(incremental=True, pages=10_000)` / `CacheManager.vacuum(...)`** — run maintenance VACUUM on the on-disk cache to reclaim free pages left by delta `INSERT OR REPLACE` churn. Incremental (default) reclaims up to `pages` pages without blocking readers or extra disk (requires `auto_vacuum=INCREMENTAL`); `incremental=False` runs a full VACUUM (rewrites the file, ~2× disk, blocks readers — maintenance window only). No-op in memory mode.
### Changed
- `pyproject.toml` — bumped version to `1.11.0`.
- `ColumnRegistry` gained `rebind()` so it follows the cache connection swap performed by `hard_reset()` (the registry previously captured the connection for the process lifetime).
---
## [1.10.0] - 2026-06-09
### Added
- **`last_upsert` (persisted write) vs `last_refresh` (run/liveness) in `stats`** — `TableStats.last_refresh` previously came from the persisted `last_refresh_at` column, which is only written when rows are actually written (a delta cycle with `total == 0` early-returns and leaves it unchanged). A healthy delta that keeps finding no new rows therefore *looked* frozen. The single value is now split:
- `last_upsert` — wall-clock (UTC) of the last actual data write (full load / delta with rows). Persisted, survives restarts (this is the existing `last_refresh_at` column, surfaced under a clearer name).
- `last_refresh` — wall-clock (UTC) of the last time a refresh cycle *ran* for the table, **even when it wrote nothing**. In-memory per process (`None` until the first cycle after start), tracked like `_states`/`_errors` — so **no schema change and no cache wipe**.
- `CacheManager` gained `mark_refresh_ran()` / `get_last_runs()`; an empty delta cycle now records a run. TTL staleness still uses the last *write* (`seconds_since_refresh` reads `last_refresh_at`), so behaviour is unchanged.
### Changed
- `pyproject.toml` — bumped version to `1.10.0`.
- **`TableStats.last_refresh` is now `str | None`** (was `str`) and a new required `last_upsert: str | None` field is added. Consumers reading `last_refresh` for "when did data change?" should switch to `last_upsert`.
---
## [1.8.0] - 2026-06-08
### Fixed
- **Frozen delta watermark on `datetime` change columns** — the delta high-watermark is read back from the cache as an ISO `TEXT` string (e.g. `'2026-06-05T14:54:24.823000'`) and was bound straight back to the source. SQL Server then had to implicitly convert that `nvarchar` to `datetime` and **failed** (`T`-separated ISO with 6 fractional digits exceeds `datetime`'s 3 — error 241 / SQLSTATE 22007), so every delta refresh and the startup catch-up died before streaming and the watermark never advanced (the cache silently froze at the last full load). The watermark is now parsed back to a real `datetime` (`delta._bind_watermark`) so the driver sends a typed timestamp and the comparison runs natively; non-datetime change columns (e.g. integer rowversions) pass through unchanged. Regression tests added.
### Added
- **Refresh/load failures are now visible in `stats`** — `TableStats` gained `last_error`, `last_error_at` and `consecutive_failures`, and `Stats` gained a total `errors` counter. A delta that fails *before* streaming (e.g. the watermark bug above) previously left `state = ready`, hiding the problem; it now also marks the table `error` and records the message. `consecutive_failures` resets to 0 on the next success.
- **Per-engine configuration** — `CachingEngine` accepts `cache_db_path`, `backup_interval`, `refresh_interval`, `fetch_batch` and `dialect` (each defaults to its env var / config global when omitted), so two engines with independent cache files can run in one process and config is testable without env vars.
- **`blocking_startup_refresh` flag** (default `False`) — the startup catch-up (deltas/TTL reloads for tables restored from disk) now runs on the background thread by default, so it never blocks application startup. Pass `blocking_startup_refresh=True` to catch up synchronously before serving.
### Changed
- **SQL identifiers are quoted** — table/column names are now quoted everywhere they are interpolated into statements (SQLite double-quote for the cache, the configured dialect — e.g. T-SQL `[brackets]` — for the source), so reserved words or names with spaces work and the f-string interpolation is hardened.
- **Source connection opened lazily** — `execute()` no longer opens a source connection on every call; a pure cache hit never touches the source (and never occupies a pool slot). The misleading `cast(sqlite3.Connection, …)` on the source handle was removed (it is a pyodbc connection in production).
- **Concurrent reads in disk mode** — disk-backed reads now use a per-thread read-only WAL connection instead of sharing the single write connection under a lock, so a slow `SELECT` no longer blocks writers (loads/upserts) or other readers. In-memory mode is unchanged (a `:memory:` database can't be shared across connections).
- **`add_sink` is idempotent** — calling it again for the same sink is a no-op, so a double import no longer duplicates every log line.
- `pyproject.toml` — bumped version to `1.8.0`; added a scoped pytest `filterwarnings` for the SQLite test source's legacy datetime-adapter deprecation.
### Note
- Cache type fidelity (returning real `datetime`/`Decimal`/numeric types from `execute()` instead of `TEXT` strings, and giving numeric columns proper affinity) was evaluated but **deferred** — it changes the public output contract that consumers currently rely on (and that `test_coerce.py` pins). Decimal/datetime stay stored as exact, lossless `TEXT`.
---
## [1.7.0] - 2026-06-08
### Added
- **Disk-backed cache mode** — `CachingEngine(engine, in_memory=False)` (or env `SQLMEM_IN_MEMORY=false`) queries the on-disk `cache.db` directly instead of loading it into an in-memory SQLite. Every write persists immediately (no hourly backup thread, no load-on-startup copy, no `atexit`/`SIGTERM` flush needed), and the cache may exceed available RAM. The disk connection uses WAL + `synchronous=NORMAL` for write throughput. In-memory mode (backed up to disk periodically) remains the default. `in_memory` defaults to the `SQLMEM_IN_MEMORY` config when omitted.
- On open, a disk cache with a mismatched `schema_version` is wiped in place and rebuilt.
- `engine.reset()` in disk mode drops the cached tables and `VACUUM`s the file (it does not unlink the open file).
- `SQLMEM_IN_MEMORY` env var (default `true`).
### Changed
- `pyproject.toml` — bumped version to `1.7.0`
- `cache.py``CacheManager` gained an `in_memory` flag; the cache connection (`_mem_conn``_conn`) is opened either on `:memory:` or directly on the on-disk file. Disk mode skips the load-on-startup copy, backup thread, and shutdown flush, and `reset()` VACUUMs in place instead of unlinking the open file.
- `.gitignore` — ignore `cache.db` and its WAL sidecars (`cache.db-wal`, `cache.db-shm`).
---
## [1.6.0] - 2026-06-05
### Added
- **Secondary indexes** — `CachingEngine(engine, indexes={"VW_X": ["col", ["a", "b"]]})` creates indexes on the in-memory cache to accelerate `WHERE`/`JOIN` lookups. Index columns are auto-loaded so the index exists from the first load, and indexes are recreated after every (re)load and persist in `cache.db`. Combines freely with `delta` and `ttl`.
### Changed
- `pyproject.toml` — bumped version to `1.6.0`
---
## [1.5.0] - 2026-06-05
### Added
- **Per-table processing state in `stats`** — `TableStats` now carries `state` (`loading` / `refreshing` / `ready` / `stale` / `error`) and `tracking` (`delta` / `ttl` / `static`), so callers can see whether each table is up to date or being processed. In-progress first loads and failed loads also surface in `stats.tables`.
- `SQLMEM_FETCH_BATCH` env var (default `10000`) — rows fetched per batch when loading a table.
### Changed
- `pyproject.toml` — bumped version to `1.5.0`
- **Large-table loads are streamed in batches** — `load_table` no longer `fetchall()`s the whole table (which double-buffered every row in Python and could OOM/crash on tens of millions of rows). Rows are now fetched `SQLMEM_FETCH_BATCH` at a time into a staging table and swapped in atomically, so peak memory stays bounded, the previous copy stays queryable during a reload, and the network fetch no longer holds the cache lock. Delta catch-ups are streamed the same way.
- Orphan staging tables left by an interrupted load (crash/backup mid-load) are dropped on startup.
- Delta upserts compute `row_count` once per refresh instead of a full `COUNT(*)` after every batch (avoids O(rows×batches) work on large catch-ups).
---
## [1.4.0] - 2026-06-05
### Fixed
- **`decimal.Decimal` (and `datetime`) binding error** — `NUMERIC`/`DECIMAL`/`MONEY` columns from SQL Server (pyodbc) arrive as `decimal.Decimal`, which `sqlite3` cannot bind, crashing the cache load with `type 'decimal.Decimal' is not supported`. Values are now coerced to sqlite-bindable types (`Decimal``str`, `datetime`/`date`/`time`→ISO, `uuid.UUID``str`, `bytearray``bytes`) at the cache boundary — on full load, on delta upsert, and for WHERE parameters. Coercion is local (no global `sqlite3.register_adapter`), so the host application's `sqlite3` behaviour is untouched. Cache columns are `TEXT`, so the conversion is lossless and exact (no rounding).
### Added
- **Incremental (delta) refresh** — `CachingEngine(engine, delta={...})` with `DeltaConfig(change_column, key_columns)`. Delta-tracked tables are kept in sync by pulling only changed rows (`WHERE change_column >= watermark`) and upserting them by key, instead of full reloads.
- Data-driven high-watermark = `max(change_column)` cached, persisted in `cache.db`; `>=` overlap + idempotent upsert so no row is missed and boundary rows are harmlessly re-read.
- Catch-up on startup (since last shutdown) and a background thread refreshing every `SQLMEM_REFRESH_INTERVAL` seconds (default 300); `engine.refresh()` triggers a pull on demand.
- Primary key is auto-discovered from the source DB (`inspect(engine).get_pk_constraint`) when `key_columns` is omitted; required explicitly for views (raises `ValueError`).
- **Per-table TTL (time-based refresh)** — `CachingEngine(engine, ttl={"VW_X": 300})` for tables with no change column that can't be delta-synced. The cached copy is guaranteed never older than the TTL: a query touching an expired table triggers a full reload before it is answered (read-time guarantee), and the background thread proactively reloads expired tables. TTL age uses the persisted `last_refresh_at`, so the bound holds across restarts. A table in both `delta` and `ttl` raises `ValueError`.
- `DeltaConfig` exported from the public API.
- `engine.reset()` — wipes the whole cache (RAM + `cache.db`) for a clean rebuild after structural source changes.
- `SQLMEM_REFRESH_INTERVAL` env var (default `300`) — background refresh tick for delta pulls and proactive TTL reloads.
### Changed
- `pyproject.toml` — bumped version to `1.4.0`
- `cache.py` — schema version bumped to `3`; `_sqlmem_tables` gained a `last_synced_at` watermark column. New methods: `execute_in_memory` (lock-serialized read), `get_table_columns`, `create_unique_index`, `get/set_last_synced_at`, `max_value`, `upsert_rows`, `seconds_since_refresh`, `reset`. Existing on-disk caches are discarded and rebuilt on load.
- `executor.py` — delta-tracked tables augment their column set with key/change columns (unique key index + initial watermark); TTL-tracked tables full-reload at read time when expired; in-memory reads go through the cache lock.
---
## [1.2.0] - 2026-06-04
### Added
- **Parametrized queries (R1)** — `execute(sql, params)` accepts positional (`?` tuple/list) and named (`:name` dict) parameters; passed straight to SQLite during in-memory filtering. Cache loads still fetch the full table (parameters are not applied to source fetches).
- **JOIN support (R2)** — multi-table SELECTs are parsed into per-table column sets; each table is cached independently and the JOIN runs in the in-memory SQLite. Columns in a multi-table query must be qualified by table or alias.
- **`SELECT *` support (R3)** — wildcard (and `alias.*`) queries discover all columns from the source DB, cache the whole table, and mark it `is_full` so later column queries are guaranteed cache hits without re-fetch.
- **Three-part table names (R4)** — `[catalog].[schema].[table]` is parsed to its base name for caching; the in-memory query is rewritten to strip catalog/schema prefixes so it runs under SQLite.
- `SQLMEM_SQL_DIALECT` env var (default `tsql`) — sqlglot dialect used to parse incoming SQL; T-SQL also accepts ANSI SQL and MSSQL bracket quoting.
- `CacheManager.discover_columns()` and `CacheManager.is_table_full()`; `load_table()` gained a `full` flag.
### Changed
- `pyproject.toml` — bumped version to `1.2.0`
- `parser.py``ParsedQuery.table: str` replaced by `tables: list[str]` plus `columns_by_table`, `sqlite_sql`, `params`, and `wildcard_tables`; SQL is parsed with the configured dialect and rendered to SQLite for execution.
- `executor.py` — loads each referenced table independently and applies query parameters during in-memory execution.
- `cache.py` — schema version bumped to `2`; `_sqlmem_tables` gained an `is_full` column (existing on-disk caches are discarded and rebuilt on load).
---
## [1.1.0] - 2026-06-03
### Added
- `Stats` and `TableStats` frozen dataclasses — snapshot of runtime cache statistics (hit/miss/refetch counts, per-table row count, columns, last refresh timestamp)
- `StatsCollector` — internal thread-safe counter; increments on every cache hit, miss, and re-fetch
- `engine.stats` property — returns a `Stats` snapshot at any point in time
- `Stats` and `TableStats` exported from the public API
### Changed
- `pyproject.toml` — bumped version to `1.1.0`
---
## [1.0.0] - 2026-06-03
### Changed
- `pyproject.toml` — bumped version to `1.0.0`
---
## [0.4.0] - 2026-06-03
### Added
- `add_sink(sink, *, level, **kwargs)` — public API for routing sqlmem log records to any loguru-compatible sink (stream, file, callable); supports all loguru `logger.add()` kwargs including `rotation`, `retention`, etc.
### Changed
- `pyproject.toml` — bumped version to `0.4.0`
- `config.py` — replaced destructive `logger.remove()` + forced default sink with `logger.disable("sqlmem")`; sqlmem is now silent by default and does not interfere with the host application's logging setup
---
## [0.3.0] - 2026-06-03
### Added
- `README.md` — full project documentation: architecture overview, quick start, cache behaviour, persistence, configuration, exceptions, logging, and limitations
### Changed
- `pyproject.toml` — bumped version to `0.3.0`
- `parser.py``_extract_columns` now deduplicates column names while preserving order
- `.gitignore` — added `.env` and `.env.*` to prevent accidental commit of environment files
### Security
- Removed `.env` from git tracking (`git rm --cached`)
---
## [0.2.0] - 2026-06-01
### Added
- Project specification in `project.md` — architecture, API design, cache backend, metadata schema, logging strategy, and TODO for future features (JOIN, SELECT * support)
- `.gitignore` for Python/Poetry project
- `pyproject.toml` dependencies: `sqlglot`, `sqlalchemy`, `loguru`, `python-dotenv`; dev dependencies: `pytest`, `ruff`, `mypy`
- `src/sqlmem/` package structure with src layout
- `src/sqlmem/exceptions.py``ReadOnlyError` (blocks INSERT/UPDATE/DELETE), `UnsupportedQueryError` (blocks JOIN and SELECT *)
- `src/sqlmem/config.py` — loads `.env`, configures `loguru` with DEBUG/INFO level based on `SQLMEM_DEBUG`
- `src/sqlmem/_meta.py` — package version constant
- `src/sqlmem/parser.py` — SQL Parser using `sqlglot`; extracts table and columns from SELECT, raises on writes/JOIN/wildcard
- `src/sqlmem/registry.py` — Column Registry; accumulates requested columns per table, detects missing columns requiring re-fetch
- `src/sqlmem/cache.py` — Cache Manager; SQLite in-memory storage, load from `cache.db` on startup (with schema version check), hourly backup thread, `atexit`/SIGTERM flush, metadata tables (`_sqlmem_meta`, `_sqlmem_tables`, `_sqlmem_columns`)
- `src/sqlmem/executor.py` — Query Executor; cache hit/miss logic, re-fetch on new columns with WARNING log
- `src/sqlmem/engine.py``CachingEngine` wrapper; public API compatible with SQLAlchemy, `invalidate(table)` for manual cache clearing
- `src/sqlmem/__init__.py` — public exports: `CachingEngine`, `ReadOnlyError`, `UnsupportedQueryError`
- `tests/test_parser.py` — parser tests: SELECT parsing, ReadOnlyError, UnsupportedQueryError
- `tests/test_cache.py` — cache tests: load, data correctness, metadata, disk backup/reload
- `tests/test_registry.py` — registry tests: accumulation, needs_refetch, table isolation