SQLmem/CHANGELOG.md

# Changelog

All notable changes to this project will be documented in this file.

## [Unreleased]

---

## [1.11.0] - 2026-06-09

### Added
- **`pragmas=` parameter on `CachingEngine` / `CacheManager`** — pass a dict of SQLite PRAGMAs (e.g. `mmap_size`, `cache_size`, `temp_store`, `page_size`, `auto_vacuum`) applied to the cache connection at open time, so disk-backed caches can be tuned for the host's I/O profile without bypassing `CacheManager`. Unknown/inapplicable pragmas are silently ignored by SQLite (graceful degradation, no startup crash).
  - **`page_size`** is a layout pragma: it is applied only on a *fresh* file (set before WAL / the first table). On an existing cache with a different page size the request is ignored and a one-time warning is logged — the new value takes effect only after `hard_reset()` or a rebuild.
  - **`auto_vacuum`** is set before the database header is materialized (before switching to WAL) on a fresh file, so `INCREMENTAL`/`FULL` actually stick instead of silently reverting to `NONE`.
- **`CachingEngine.hard_reset()` / `CacheManager.hard_reset()`** — close every connection, delete the on-disk cache file (and its `-wal`/`-shm` sidecars) and reopen from scratch with all current pragmas applied. Unlike `reset()` (which drops tables but keeps the open file), this lets `page_size`/`auto_vacuum` change, since those are baked into the file at creation. Disk mode only — falls back to `reset()` in memory mode. All tables reload on next use.
- **`CachingEngine.vacuum(incremental=True, pages=10_000)` / `CacheManager.vacuum(...)`** — run maintenance VACUUM on the on-disk cache to reclaim free pages left by delta `INSERT OR REPLACE` churn. Incremental (default) reclaims up to `pages` pages without blocking readers or extra disk (requires `auto_vacuum=INCREMENTAL`); `incremental=False` runs a full VACUUM (rewrites the file, ~2× disk, blocks readers — maintenance window only). No-op in memory mode.

### Changed
- `pyproject.toml` — bumped version to `1.11.0`.
- `ColumnRegistry` gained `rebind()` so it follows the cache connection swap performed by `hard_reset()` (the registry previously captured the connection for the process lifetime).

---

## [1.10.0] - 2026-06-09

### Added
- **`last_upsert` (persisted write) vs `last_refresh` (run/liveness) in `stats`** — `TableStats.last_refresh` previously came from the persisted `last_refresh_at` column, which is only written when rows are actually written (a delta cycle with `total == 0` early-returns and leaves it unchanged). A healthy delta that keeps finding no new rows therefore *looked* frozen. The single value is now split:
  - `last_upsert` — wall-clock (UTC) of the last actual data write (full load / delta with rows). Persisted, survives restarts (this is the existing `last_refresh_at` column, surfaced under a clearer name).
  - `last_refresh` — wall-clock (UTC) of the last time a refresh cycle *ran* for the table, **even when it wrote nothing**. In-memory per process (`None` until the first cycle after start), tracked like `_states`/`_errors` — so **no schema change and no cache wipe**.
  - `CacheManager` gained `mark_refresh_ran()` / `get_last_runs()`; an empty delta cycle now records a run. TTL staleness still uses the last *write* (`seconds_since_refresh` reads `last_refresh_at`), so behaviour is unchanged.

### Changed
- `pyproject.toml` — bumped version to `1.10.0`.
- **`TableStats.last_refresh` is now `str | None`** (was `str`) and a new required `last_upsert: str | None` field is added. Consumers reading `last_refresh` for "when did data change?" should switch to `last_upsert`.

---

## [1.8.0] - 2026-06-08

### Fixed
- **Frozen delta watermark on `datetime` change columns** — the delta high-watermark is read back from the cache as an ISO `TEXT` string (e.g. `'2026-06-05T14:54:24.823000'`) and was bound straight back to the source. SQL Server then had to implicitly convert that `nvarchar` to `datetime` and **failed** (`T`-separated ISO with 6 fractional digits exceeds `datetime`'s 3 — error 241 / SQLSTATE 22007), so every delta refresh and the startup catch-up died before streaming and the watermark never advanced (the cache silently froze at the last full load). The watermark is now parsed back to a real `datetime` (`delta._bind_watermark`) so the driver sends a typed timestamp and the comparison runs natively; non-datetime change columns (e.g. integer rowversions) pass through unchanged. Regression tests added.

### Added
- **Refresh/load failures are now visible in `stats`** — `TableStats` gained `last_error`, `last_error_at` and `consecutive_failures`, and `Stats` gained a total `errors` counter. A delta that fails *before* streaming (e.g. the watermark bug above) previously left `state = ready`, hiding the problem; it now also marks the table `error` and records the message. `consecutive_failures` resets to 0 on the next success.
- **Per-engine configuration** — `CachingEngine` accepts `cache_db_path`, `backup_interval`, `refresh_interval`, `fetch_batch` and `dialect` (each defaults to its env var / config global when omitted), so two engines with independent cache files can run in one process and config is testable without env vars.
- **`blocking_startup_refresh` flag** (default `False`) — the startup catch-up (deltas/TTL reloads for tables restored from disk) now runs on the background thread by default, so it never blocks application startup. Pass `blocking_startup_refresh=True` to catch up synchronously before serving.

### Changed
- **SQL identifiers are quoted** — table/column names are now quoted everywhere they are interpolated into statements (SQLite double-quote for the cache, the configured dialect — e.g. T-SQL `[brackets]` — for the source), so reserved words or names with spaces work and the f-string interpolation is hardened.
- **Source connection opened lazily** — `execute()` no longer opens a source connection on every call; a pure cache hit never touches the source (and never occupies a pool slot). The misleading `cast(sqlite3.Connection, …)` on the source handle was removed (it is a pyodbc connection in production).
- **Concurrent reads in disk mode** — disk-backed reads now use a per-thread read-only WAL connection instead of sharing the single write connection under a lock, so a slow `SELECT` no longer blocks writers (loads/upserts) or other readers. In-memory mode is unchanged (a `:memory:` database can't be shared across connections).
- **`add_sink` is idempotent** — calling it again for the same sink is a no-op, so a double import no longer duplicates every log line.
- `pyproject.toml` — bumped version to `1.8.0`; added a scoped pytest `filterwarnings` for the SQLite test source's legacy datetime-adapter deprecation.

### Note
- Cache type fidelity (returning real `datetime`/`Decimal`/numeric types from `execute()` instead of `TEXT` strings, and giving numeric columns proper affinity) was evaluated but **deferred** — it changes the public output contract that consumers currently rely on (and that `test_coerce.py` pins). Decimal/datetime stay stored as exact, lossless `TEXT`.

---

## [1.7.0] - 2026-06-08

### Added
- **Disk-backed cache mode** — `CachingEngine(engine, in_memory=False)` (or env `SQLMEM_IN_MEMORY=false`) queries the on-disk `cache.db` directly instead of loading it into an in-memory SQLite. Every write persists immediately (no hourly backup thread, no load-on-startup copy, no `atexit`/`SIGTERM` flush needed), and the cache may exceed available RAM. The disk connection uses WAL + `synchronous=NORMAL` for write throughput. In-memory mode (backed up to disk periodically) remains the default. `in_memory` defaults to the `SQLMEM_IN_MEMORY` config when omitted.
  - On open, a disk cache with a mismatched `schema_version` is wiped in place and rebuilt.
  - `engine.reset()` in disk mode drops the cached tables and `VACUUM`s the file (it does not unlink the open file).
- `SQLMEM_IN_MEMORY` env var (default `true`).

### Changed
- `pyproject.toml` — bumped version to `1.7.0`
- `cache.py` — `CacheManager` gained an `in_memory` flag; the cache connection (`_mem_conn` → `_conn`) is opened either on `:memory:` or directly on the on-disk file. Disk mode skips the load-on-startup copy, backup thread, and shutdown flush, and `reset()` VACUUMs in place instead of unlinking the open file.
- `.gitignore` — ignore `cache.db` and its WAL sidecars (`cache.db-wal`, `cache.db-shm`).

---

## [1.6.0] - 2026-06-05

### Added
- **Secondary indexes** — `CachingEngine(engine, indexes={"VW_X": ["col", ["a", "b"]]})` creates indexes on the in-memory cache to accelerate `WHERE`/`JOIN` lookups. Index columns are auto-loaded so the index exists from the first load, and indexes are recreated after every (re)load and persist in `cache.db`. Combines freely with `delta` and `ttl`.

### Changed
- `pyproject.toml` — bumped version to `1.6.0`

---

## [1.5.0] - 2026-06-05

### Added
- **Per-table processing state in `stats`** — `TableStats` now carries `state` (`loading` / `refreshing` / `ready` / `stale` / `error`) and `tracking` (`delta` / `ttl` / `static`), so callers can see whether each table is up to date or being processed. In-progress first loads and failed loads also surface in `stats.tables`.
- `SQLMEM_FETCH_BATCH` env var (default `10000`) — rows fetched per batch when loading a table.

### Changed
- `pyproject.toml` — bumped version to `1.5.0`
- **Large-table loads are streamed in batches** — `load_table` no longer `fetchall()`s the whole table (which double-buffered every row in Python and could OOM/crash on tens of millions of rows). Rows are now fetched `SQLMEM_FETCH_BATCH` at a time into a staging table and swapped in atomically, so peak memory stays bounded, the previous copy stays queryable during a reload, and the network fetch no longer holds the cache lock. Delta catch-ups are streamed the same way.
- Orphan staging tables left by an interrupted load (crash/backup mid-load) are dropped on startup.
- Delta upserts compute `row_count` once per refresh instead of a full `COUNT(*)` after every batch (avoids O(rows×batches) work on large catch-ups).

---

## [1.4.0] - 2026-06-05

### Fixed
- **`decimal.Decimal` (and `datetime`) binding error** — `NUMERIC`/`DECIMAL`/`MONEY` columns from SQL Server (pyodbc) arrive as `decimal.Decimal`, which `sqlite3` cannot bind, crashing the cache load with `type 'decimal.Decimal' is not supported`. Values are now coerced to sqlite-bindable types (`Decimal`→`str`, `datetime`/`date`/`time`→ISO, `uuid.UUID`→`str`, `bytearray`→`bytes`) at the cache boundary — on full load, on delta upsert, and for WHERE parameters. Coercion is local (no global `sqlite3.register_adapter`), so the host application's `sqlite3` behaviour is untouched. Cache columns are `TEXT`, so the conversion is lossless and exact (no rounding).

### Added
- **Incremental (delta) refresh** — `CachingEngine(engine, delta={...})` with `DeltaConfig(change_column, key_columns)`. Delta-tracked tables are kept in sync by pulling only changed rows (`WHERE change_column >= watermark`) and upserting them by key, instead of full reloads.
  - Data-driven high-watermark = `max(change_column)` cached, persisted in `cache.db`; `>=` overlap + idempotent upsert so no row is missed and boundary rows are harmlessly re-read.
  - Catch-up on startup (since last shutdown) and a background thread refreshing every `SQLMEM_REFRESH_INTERVAL` seconds (default 300); `engine.refresh()` triggers a pull on demand.
  - Primary key is auto-discovered from the source DB (`inspect(engine).get_pk_constraint`) when `key_columns` is omitted; required explicitly for views (raises `ValueError`).
- **Per-table TTL (time-based refresh)** — `CachingEngine(engine, ttl={"VW_X": 300})` for tables with no change column that can't be delta-synced. The cached copy is guaranteed never older than the TTL: a query touching an expired table triggers a full reload before it is answered (read-time guarantee), and the background thread proactively reloads expired tables. TTL age uses the persisted `last_refresh_at`, so the bound holds across restarts. A table in both `delta` and `ttl` raises `ValueError`.
- `DeltaConfig` exported from the public API.
- `engine.reset()` — wipes the whole cache (RAM + `cache.db`) for a clean rebuild after structural source changes.
- `SQLMEM_REFRESH_INTERVAL` env var (default `300`) — background refresh tick for delta pulls and proactive TTL reloads.

### Changed
- `pyproject.toml` — bumped version to `1.4.0`
- `cache.py` — schema version bumped to `3`; `_sqlmem_tables` gained a `last_synced_at` watermark column. New methods: `execute_in_memory` (lock-serialized read), `get_table_columns`, `create_unique_index`, `get/set_last_synced_at`, `max_value`, `upsert_rows`, `seconds_since_refresh`, `reset`. Existing on-disk caches are discarded and rebuilt on load.
- `executor.py` — delta-tracked tables augment their column set with key/change columns (unique key index + initial watermark); TTL-tracked tables full-reload at read time when expired; in-memory reads go through the cache lock.

---

## [1.2.0] - 2026-06-04

### Added
- **Parametrized queries (R1)** — `execute(sql, params)` accepts positional (`?` tuple/list) and named (`:name` dict) parameters; passed straight to SQLite during in-memory filtering. Cache loads still fetch the full table (parameters are not applied to source fetches).
- **JOIN support (R2)** — multi-table SELECTs are parsed into per-table column sets; each table is cached independently and the JOIN runs in the in-memory SQLite. Columns in a multi-table query must be qualified by table or alias.
- **`SELECT *` support (R3)** — wildcard (and `alias.*`) queries discover all columns from the source DB, cache the whole table, and mark it `is_full` so later column queries are guaranteed cache hits without re-fetch.
- **Three-part table names (R4)** — `[catalog].[schema].[table]` is parsed to its base name for caching; the in-memory query is rewritten to strip catalog/schema prefixes so it runs under SQLite.
- `SQLMEM_SQL_DIALECT` env var (default `tsql`) — sqlglot dialect used to parse incoming SQL; T-SQL also accepts ANSI SQL and MSSQL bracket quoting.
- `CacheManager.discover_columns()` and `CacheManager.is_table_full()`; `load_table()` gained a `full` flag.

### Changed
- `pyproject.toml` — bumped version to `1.2.0`
- `parser.py` — `ParsedQuery.table: str` replaced by `tables: list[str]` plus `columns_by_table`, `sqlite_sql`, `params`, and `wildcard_tables`; SQL is parsed with the configured dialect and rendered to SQLite for execution.
- `executor.py` — loads each referenced table independently and applies query parameters during in-memory execution.
- `cache.py` — schema version bumped to `2`; `_sqlmem_tables` gained an `is_full` column (existing on-disk caches are discarded and rebuilt on load).

---

## [1.1.0] - 2026-06-03

### Added
- `Stats` and `TableStats` frozen dataclasses — snapshot of runtime cache statistics (hit/miss/refetch counts, per-table row count, columns, last refresh timestamp)
- `StatsCollector` — internal thread-safe counter; increments on every cache hit, miss, and re-fetch
- `engine.stats` property — returns a `Stats` snapshot at any point in time
- `Stats` and `TableStats` exported from the public API

### Changed
- `pyproject.toml` — bumped version to `1.1.0`

---

## [1.0.0] - 2026-06-03

### Changed
- `pyproject.toml` — bumped version to `1.0.0`

---

## [0.4.0] - 2026-06-03

### Added
- `add_sink(sink, *, level, **kwargs)` — public API for routing sqlmem log records to any loguru-compatible sink (stream, file, callable); supports all loguru `logger.add()` kwargs including `rotation`, `retention`, etc.

### Changed
- `pyproject.toml` — bumped version to `0.4.0`
- `config.py` — replaced destructive `logger.remove()` + forced default sink with `logger.disable("sqlmem")`; sqlmem is now silent by default and does not interfere with the host application's logging setup

---

## [0.3.0] - 2026-06-03

### Added
- `README.md` — full project documentation: architecture overview, quick start, cache behaviour, persistence, configuration, exceptions, logging, and limitations

### Changed
- `pyproject.toml` — bumped version to `0.3.0`
- `parser.py` — `_extract_columns` now deduplicates column names while preserving order
- `.gitignore` — added `.env` and `.env.*` to prevent accidental commit of environment files

### Security
- Removed `.env` from git tracking (`git rm --cached`)

---

## [0.2.0] - 2026-06-01

### Added
- Project specification in `project.md` — architecture, API design, cache backend, metadata schema, logging strategy, and TODO for future features (JOIN, SELECT * support)
- `.gitignore` for Python/Poetry project
- `pyproject.toml` dependencies: `sqlglot`, `sqlalchemy`, `loguru`, `python-dotenv`; dev dependencies: `pytest`, `ruff`, `mypy`
- `src/sqlmem/` package structure with src layout
- `src/sqlmem/exceptions.py` — `ReadOnlyError` (blocks INSERT/UPDATE/DELETE), `UnsupportedQueryError` (blocks JOIN and SELECT *)
- `src/sqlmem/config.py` — loads `.env`, configures `loguru` with DEBUG/INFO level based on `SQLMEM_DEBUG`
- `src/sqlmem/_meta.py` — package version constant
- `src/sqlmem/parser.py` — SQL Parser using `sqlglot`; extracts table and columns from SELECT, raises on writes/JOIN/wildcard
- `src/sqlmem/registry.py` — Column Registry; accumulates requested columns per table, detects missing columns requiring re-fetch
- `src/sqlmem/cache.py` — Cache Manager; SQLite in-memory storage, load from `cache.db` on startup (with schema version check), hourly backup thread, `atexit`/SIGTERM flush, metadata tables (`_sqlmem_meta`, `_sqlmem_tables`, `_sqlmem_columns`)
- `src/sqlmem/executor.py` — Query Executor; cache hit/miss logic, re-fetch on new columns with WARNING log
- `src/sqlmem/engine.py` — `CachingEngine` wrapper; public API compatible with SQLAlchemy, `invalidate(table)` for manual cache clearing
- `src/sqlmem/__init__.py` — public exports: `CachingEngine`, `ReadOnlyError`, `UnsupportedQueryError`
- `tests/test_parser.py` — parser tests: SELECT parsing, ReadOnlyError, UnsupportedQueryError
- `tests/test_cache.py` — cache tests: load, data correctness, metadata, disk backup/reload
- `tests/test_registry.py` — registry tests: accumulation, needs_refetch, table isolation