Honza/SQLmem

Fork 0

Files

T

Jan Doubravský 209ae667ab Add disk-backed SQLite cache mode as an alternative to in-memory

2026-06-08 11:39:04 +02:00

11 KiB

Raw Blame History

Changelog

All notable changes to this project will be documented in this file.

[Unreleased]

[1.7.0] - 2026-06-08

Added

Disk-backed cache mode — CachingEngine(engine, in_memory=False) (or env SQLMEM_IN_MEMORY=false) queries the on-disk cache.db directly instead of loading it into an in-memory SQLite. Every write persists immediately (no hourly backup thread, no load-on-startup copy, no atexit/SIGTERM flush needed), and the cache may exceed available RAM. The disk connection uses WAL + synchronous=NORMAL for write throughput. In-memory mode (backed up to disk periodically) remains the default. in_memory defaults to the SQLMEM_IN_MEMORY config when omitted.
- On open, a disk cache with a mismatched schema_version is wiped in place and rebuilt.
- engine.reset() in disk mode drops the cached tables and VACUUMs the file (it does not unlink the open file).
SQLMEM_IN_MEMORY env var (default true).

Changed

pyproject.toml — bumped version to 1.7.0
cache.py — CacheManager gained an in_memory flag; the cache connection (_mem_conn → _conn) is opened either on :memory: or directly on the on-disk file. Disk mode skips the load-on-startup copy, backup thread, and shutdown flush, and reset() VACUUMs in place instead of unlinking the open file.
.gitignore — ignore cache.db and its WAL sidecars (cache.db-wal, cache.db-shm).

[1.6.0] - 2026-06-05

Added

Secondary indexes — CachingEngine(engine, indexes={"VW_X": ["col", ["a", "b"]]}) creates indexes on the in-memory cache to accelerate WHERE/JOIN lookups. Index columns are auto-loaded so the index exists from the first load, and indexes are recreated after every (re)load and persist in cache.db. Combines freely with delta and ttl.

Changed

pyproject.toml — bumped version to 1.6.0

[1.5.0] - 2026-06-05

Added

Per-table processing state in stats — TableStats now carries state (loading / refreshing / ready / stale / error) and tracking (delta / ttl / static), so callers can see whether each table is up to date or being processed. In-progress first loads and failed loads also surface in stats.tables.
SQLMEM_FETCH_BATCH env var (default 10000) — rows fetched per batch when loading a table.

Changed

pyproject.toml — bumped version to 1.5.0
Large-table loads are streamed in batches — load_table no longer fetchall()s the whole table (which double-buffered every row in Python and could OOM/crash on tens of millions of rows). Rows are now fetched SQLMEM_FETCH_BATCH at a time into a staging table and swapped in atomically, so peak memory stays bounded, the previous copy stays queryable during a reload, and the network fetch no longer holds the cache lock. Delta catch-ups are streamed the same way.
Orphan staging tables left by an interrupted load (crash/backup mid-load) are dropped on startup.
Delta upserts compute row_count once per refresh instead of a full COUNT(*) after every batch (avoids O(rows×batches) work on large catch-ups).

[1.4.0] - 2026-06-05

Fixed

decimal.Decimal (and datetime) binding error — NUMERIC/DECIMAL/MONEY columns from SQL Server (pyodbc) arrive as decimal.Decimal, which sqlite3 cannot bind, crashing the cache load with type 'decimal.Decimal' is not supported. Values are now coerced to sqlite-bindable types (Decimal→str, datetime/date/time→ISO, uuid.UUID→str, bytearray→bytes) at the cache boundary — on full load, on delta upsert, and for WHERE parameters. Coercion is local (no global sqlite3.register_adapter), so the host application's sqlite3 behaviour is untouched. Cache columns are TEXT, so the conversion is lossless and exact (no rounding).

Added

Incremental (delta) refresh — CachingEngine(engine, delta={...}) with DeltaConfig(change_column, key_columns). Delta-tracked tables are kept in sync by pulling only changed rows (WHERE change_column >= watermark) and upserting them by key, instead of full reloads.
- Data-driven high-watermark = max(change_column) cached, persisted in cache.db; >= overlap + idempotent upsert so no row is missed and boundary rows are harmlessly re-read.
- Catch-up on startup (since last shutdown) and a background thread refreshing every SQLMEM_REFRESH_INTERVAL seconds (default 300); engine.refresh() triggers a pull on demand.
- Primary key is auto-discovered from the source DB (inspect(engine).get_pk_constraint) when key_columns is omitted; required explicitly for views (raises ValueError).
Per-table TTL (time-based refresh) — CachingEngine(engine, ttl={"VW_X": 300}) for tables with no change column that can't be delta-synced. The cached copy is guaranteed never older than the TTL: a query touching an expired table triggers a full reload before it is answered (read-time guarantee), and the background thread proactively reloads expired tables. TTL age uses the persisted last_refresh_at, so the bound holds across restarts. A table in both delta and ttl raises ValueError.
DeltaConfig exported from the public API.
engine.reset() — wipes the whole cache (RAM + cache.db) for a clean rebuild after structural source changes.
SQLMEM_REFRESH_INTERVAL env var (default 300) — background refresh tick for delta pulls and proactive TTL reloads.

Changed

pyproject.toml — bumped version to 1.4.0
cache.py — schema version bumped to 3; _sqlmem_tables gained a last_synced_at watermark column. New methods: execute_in_memory (lock-serialized read), get_table_columns, create_unique_index, get/set_last_synced_at, max_value, upsert_rows, seconds_since_refresh, reset. Existing on-disk caches are discarded and rebuilt on load.
executor.py — delta-tracked tables augment their column set with key/change columns (unique key index + initial watermark); TTL-tracked tables full-reload at read time when expired; in-memory reads go through the cache lock.

[1.2.0] - 2026-06-04

Added

Parametrized queries (R1) — execute(sql, params) accepts positional (? tuple/list) and named (:name dict) parameters; passed straight to SQLite during in-memory filtering. Cache loads still fetch the full table (parameters are not applied to source fetches).
JOIN support (R2) — multi-table SELECTs are parsed into per-table column sets; each table is cached independently and the JOIN runs in the in-memory SQLite. Columns in a multi-table query must be qualified by table or alias.
SELECT * support (R3) — wildcard (and alias.*) queries discover all columns from the source DB, cache the whole table, and mark it is_full so later column queries are guaranteed cache hits without re-fetch.
Three-part table names (R4) — [catalog].[schema].[table] is parsed to its base name for caching; the in-memory query is rewritten to strip catalog/schema prefixes so it runs under SQLite.
SQLMEM_SQL_DIALECT env var (default tsql) — sqlglot dialect used to parse incoming SQL; T-SQL also accepts ANSI SQL and MSSQL bracket quoting.
CacheManager.discover_columns() and CacheManager.is_table_full(); load_table() gained a full flag.

Changed

pyproject.toml — bumped version to 1.2.0
parser.py — ParsedQuery.table: str replaced by tables: list[str] plus columns_by_table, sqlite_sql, params, and wildcard_tables; SQL is parsed with the configured dialect and rendered to SQLite for execution.
executor.py — loads each referenced table independently and applies query parameters during in-memory execution.
cache.py — schema version bumped to 2; _sqlmem_tables gained an is_full column (existing on-disk caches are discarded and rebuilt on load).

[1.1.0] - 2026-06-03

Added

Stats and TableStats frozen dataclasses — snapshot of runtime cache statistics (hit/miss/refetch counts, per-table row count, columns, last refresh timestamp)
StatsCollector — internal thread-safe counter; increments on every cache hit, miss, and re-fetch
engine.stats property — returns a Stats snapshot at any point in time
Stats and TableStats exported from the public API

Changed

pyproject.toml — bumped version to 1.1.0

[1.0.0] - 2026-06-03

Changed

pyproject.toml — bumped version to 1.0.0

[0.4.0] - 2026-06-03

Added

add_sink(sink, *, level, **kwargs) — public API for routing sqlmem log records to any loguru-compatible sink (stream, file, callable); supports all loguru logger.add() kwargs including rotation, retention, etc.

Changed

pyproject.toml — bumped version to 0.4.0
config.py — replaced destructive logger.remove() + forced default sink with logger.disable("sqlmem"); sqlmem is now silent by default and does not interfere with the host application's logging setup

[0.3.0] - 2026-06-03

Added

README.md — full project documentation: architecture overview, quick start, cache behaviour, persistence, configuration, exceptions, logging, and limitations

Changed

pyproject.toml — bumped version to 0.3.0
parser.py — _extract_columns now deduplicates column names while preserving order
.gitignore — added .env and .env.* to prevent accidental commit of environment files

Security

Removed .env from git tracking (git rm --cached)

[0.2.0] - 2026-06-01

Added

Project specification in project.md — architecture, API design, cache backend, metadata schema, logging strategy, and TODO for future features (JOIN, SELECT * support)
.gitignore for Python/Poetry project
pyproject.toml dependencies: sqlglot, sqlalchemy, loguru, python-dotenv; dev dependencies: pytest, ruff, mypy
src/sqlmem/ package structure with src layout
src/sqlmem/exceptions.py — ReadOnlyError (blocks INSERT/UPDATE/DELETE), UnsupportedQueryError (blocks JOIN and SELECT *)
src/sqlmem/config.py — loads .env, configures loguru with DEBUG/INFO level based on SQLMEM_DEBUG
src/sqlmem/_meta.py — package version constant
src/sqlmem/parser.py — SQL Parser using sqlglot; extracts table and columns from SELECT, raises on writes/JOIN/wildcard
src/sqlmem/registry.py — Column Registry; accumulates requested columns per table, detects missing columns requiring re-fetch
src/sqlmem/cache.py — Cache Manager; SQLite in-memory storage, load from cache.db on startup (with schema version check), hourly backup thread, atexit/SIGTERM flush, metadata tables (_sqlmem_meta, _sqlmem_tables, _sqlmem_columns)
src/sqlmem/executor.py — Query Executor; cache hit/miss logic, re-fetch on new columns with WARNING log
src/sqlmem/engine.py — CachingEngine wrapper; public API compatible with SQLAlchemy, invalidate(table) for manual cache clearing
src/sqlmem/__init__.py — public exports: CachingEngine, ReadOnlyError, UnsupportedQueryError
tests/test_parser.py — parser tests: SELECT parsing, ReadOnlyError, UnsupportedQueryError
tests/test_cache.py — cache tests: load, data correctness, metadata, disk backup/reload
tests/test_registry.py — registry tests: accumulation, needs_refetch, table isolation

11 KiB Raw Blame History Unescape Escape

Changelog

[Unreleased]

[1.7.0] - 2026-06-08

Added

Changed

[1.6.0] - 2026-06-05

Added

Changed

[1.5.0] - 2026-06-05

Added

Changed

[1.4.0] - 2026-06-05

Fixed

Added

Changed

[1.2.0] - 2026-06-04

Added

Changed

[1.1.0] - 2026-06-03

Added

Changed

[1.0.0] - 2026-06-03

Changed

[0.4.0] - 2026-06-03

Added

Changed

[0.3.0] - 2026-06-03

Added

Changed

Security

[0.2.0] - 2026-06-01

Added

11 KiB

Raw Blame History