Postgres
- Row-based database
- Open source
Data Warehouse
Extensions
- Citus
- Swarm64
Citus
- Columnar storage
- Sem suporte para update e delete à data
Tips
- don't use the same server as your production system
- upgrade to pg 12+ (or avoid common table expressions in your queries)
- go easy on indexes – less is more
- consider partitioning long tables
- ensure you're not I/O-bound
- vacuum analyze after bulk insertion
- explore parallel queries
- increase statistics sampling
- use fewer columns on frequently-queried tables
Reasons not to use indexes
- For many analytics queries it's faster for Postgres to do a table scan than an index scan
- Indexes increase the size of the table. The smaller the table, the more will fit in memory.
- Indexes add additional cost on every insert / update
Types of Index
https://www.postgresql.org/docs/current/indexes-types.html
- B-tree
- Hash
- GiST
- SP-GiST
- GIN
- Used for text-search
- Could be used for general json search
- BRIN
Partial Indexes
Index of only some rows of a given table source.
CREATE INDEX access_log_client_ip_ix ON access_log (client_ip)
WHERE NOT (client_ip > inet '192.168.100.0' AND client_ip < inet '192.168.100.255');
Partitions
Divide uma tabela em várias o que pode reduzir o tempo de respota se uma tiver que aceder apenas a uma query. No entanto tem um trabalho extra de manutenção (as tabelas não são criadas automaticamente).
Replication
Postgres has 2 modes:
- Physical
- Logica (more robust)
MVCC
Multi-version concurrency control utilizado para garantir ACID em transações concorrentes
Explain Analyze
- Explain shows the query execution plan
- Analyze actually runs the query
Options
- Analyze
- buffers
- Verbose
- Settings
- Wal: shows the wall usage on DML
EXPLAIN (ANALYZE, BUFFERS) /* SQL statement */;
Costs
1 unit Cost = read 8KB page during sequential scan
Startup cost: cost to return first row
Total cost: cost to return all rows
How to check?
- Finds the nodes with the most execution times
- find the lowest node where the estimate is bigger thant the plant (usually 10x more)
- find long running sequential scans with a filter condition. This can be improved with an index
Tools
Import data
- COPY is faster than INSERT
- Increase checkpoints is better
- Logged table are slower than unlogged tables but can crash and doesn't have replication
- Create indexes after inserting data
- Avoid triggers when writing
- Improving column order and space consumption
- Fixed sized columns first
Notification/Listen
- Can one subscribe to changes? (alternative to streaming system)
Postgres 14
- Perf improvements
- Adds multirange type
Postgres 16
- Allows for logical replication from secondary instances
- Many performance improvements due to parallelization
- Adds the pg_stats_io for monitoring IO