See also:

  • Datalog
  • concurrency

Key Value Store

log structured storage a log is a append only store LSM - log structured merge trees. In memory table for writes. Flushed to disk. Multiple read only written to disk, coalesced in background. sstable Tombstone records for deletes. https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/

What’s the big deal about key-value databases like FoundationDB and RocksDB? lobster comments https://lobste.rs/s/avljlh/what_s_big_deal_about_embedded_key_value#c_rx0oid

wide-column store key/value store

Embedded key value store. Backing engines. MySql has support for many backing engines

Algorithms

B-trees

OLTP online transaction processing OLAP online analytical processing hyperloglog bloom filters cuckoo filter

Theory

Conjunctive Queries

Query containment

  • See finite model theory

descriptive complexity NC^0 bounded fan in AC^0 https://en.wikipedia.org/wiki/AC0 unbounded fan in circuit. Constant height https://en.wikipedia.org/wiki/Circuit_complexity

https://pages.cs.wisc.edu/~paris/cs838-s16/lecture-notes/lecture1.pdf

Foundations of database

Conjunctive Query Fun queures to solve NP problems. another angle on the bdd = gj thing

Schema

https://en.wikipedia.org/wiki/Database_normalization

schema is finite set of relation symbol names an instance is a set of concrete relations with those symbol names. Sometimes also called a structure

Functional Dependencies

Armstrong axioms

Normal Formals

Tuple Generating dependencies

Query Optimization

Cascades framework https://github.com/egraphs-good/egg/discussions/189

Zetasql calcite

The Chase

Equality Generating Dependencies The Chase Procedure and its Applications in Data Exchange

Yisu: query optimization data integration querying incomplete databases benchmarking the chase chasebench

Chasefun, DEMOo, Graal, llunatic, pdg, pegasus, dlv, e, rdfox

Stratgeies - (restricted, unrestricted, parallel, skolem, fresh-null

Chase Strategies vs SIPS

The power of the terminating chase

Is the chase meant to be applied to actual databases, symbolic databases / schema, or other dependencies? Is it fair the say that the restricted chase for full dependencies is datalog?

Alice book chapter 8-11

Graal - https://github.com/hamhec/DEFT https://hamhec.github.io/DEFT/ defeasible programming http://lidia.cs.uns.edu.ar/delp_client/ Something about extra negation power? Defeatable rules if something contradicts them Pure is part of graal

llunatic - https://github.com/donatellosantoro/Llunatic

RDfox - https://docs.oxfordsemantic.tech/

dlgp - datalog plus format. Allows variables in head = existentials. Variables in facts. Notion of constraint ! :- and notion of query. Hmm.

Direct modelling of union find in z3? homomorphism is union find

SQL

The core SQL stuff is just a query of the form

SELECT columns and expressions FROM a as alias1, couple as alias2, tables as alias3 
WHERE alias2.col1 = 7 AND alias4.col7 = alias1.foo

It really almost isn’t a programming language. It just so happens that there are enough slightly off the beaten path features that you can do some neat stuff with it. This can ever be useful, because serializing results over the network is probably very bad performance wise.

Sometimes you want to INSERT INTO or DELETE FROM these results rather than just returns them

Some other weird stuff:

You can use it as a calculator to just evaluate expressions.

SELECT 40 + 2;

Creating tables and adding concrete values.

CREATE TABLE T (a int PRIMARY KEY, -- implies not null
 b bool, c text, d int);

-- CREATE TYPE mytype AS (a bool, b text);

INSERT INTO T VALUES
(1,true, "hi", 3),
(2,true, "hi", 3)
;

-- INSERT INTO T TABLE T;

SELECT myrow.* -- 2 returns row variable
FROM T AS myrow;-- 1 binds myrow


SELECT myrow.* -- 2 returns row variable
FROM T AS myrow WHERE myrow.a = myrow.a;

DROP TABLE IF EXISTS T;

--SELECT * FROM T;

-- can label columns
SELECT 40 + 2 AS firstcol, "dog" || "store" AS secondcol;

VALUES (10), (20); -- values may be used anywhere sql expects a table


SELECT * FROM (VALUES (10,20), (0,10)) AS myrow(x,y); 

Scalar subqueries - subqueries that return a single row may be considered as scalar values

From binds below, even though it’s kind of a for loop. [row for row in table] I guess this also reverses order.

Order by expressions. So we coukd have many more ordering constraints than columns for xample

Select distinct on. Returns first row in each group.

agregates bool_and bool_or (forall and exists)

Group by - wadler. Changing type of row entry to bag(row entry)

ALL bag semantics, no all is set semantics

WITH RECURSIVE 
  series(i) as (
    VALUES (0)
    UNION
    SELECT t.i + 1 FROM
      series as t where t.i < 10
  )
 SELECT * FROM series;

WITH RECURSIVE
  root(i,j) AS (
    SELECT foo.i, max(foo.j) 
    FROM (VALUES (1,1), (2,1), (3,3)) AS foo(i,j)
          --UNION 
          --(SELECT i, k FROM root AS (i,j), root as (j1,k) where j = j1))
          )
    SELECT * from root;

SELECT *
  FROM (VALUES (1,1), (2,1), (3,3)) AS foo(i,j);

SELECT (SELECT 42) * 2; -- this works. There is broadcasting of sorts

sql injection https://ctf101.org/web-exploitation/sql-injection/what-is-sql-injection/ everything is foreign keys? Interning

Recursive tables let you do datalog like stuff.

CREATE TABLE edge(a INTEGER, b INTEGER);
INSERT INTO edge(a,b)
VALUES
    (1,2),
    (2,3),
    (3,4);
SELECT a,b FROM edge;

CREATE TABLE path(a INTEGER, b INTEGER);
--INSERT INTO path
--SELECT * FROM edge;

-- path(x,z) :- edge(x,y), path(y,z).
WITH RECURSIVE
  path0(x,y) AS
    -- SELECT 1,2
    (SELECT a,b FROM edge UNION SELECT edge.a, path0.y FROM edge, path0 WHERE path0.x = edge.b )
  INSERT INTO path SELECT x,y FROM path0;
      
SELECT a,b FROM path;
.quit

UF

WITH RECURSIVE 
  parent(x,y) AS
  SELECT a, min(b) (SELECT (a,b) FROM eq UNION eq, parent)

python sqlite3 in stdlib

import sqlite3
con = sqlite3.connect(':memory:')
cur = con.cursor()
# Create table
cur.execute('''CREATE TABLE stocks
               (date text, trans text, symbol text, qty real, price real)''')

# Insert a row of data
cur.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")

#cur.executemany("insert into characters(c) values (?)", theIter)
for row in cur.execute('SELECT * FROM stocks ORDER BY price'):
  print(row)

adapters to python types https://en.wikipedia.org/wiki/Materialized_view

sqlite loadable extensions



indices

Building good indices can be important for good query performance.

views

Saved queries that act as virtual tables

triggers

This is interesting

Aggregate functions

Window Functions

Ontology Formats

graph database OWL RDF sparql sparql slides shacl -

semantic web

Knowdlege representation handbook Course https://web.stanford.edu/class/cs227/Lectures/lec02.pdf very similar to bap knoweldge base

Optimal Joins

worst case optimal join algorithm leapfrog triejoin https://github.com/frankmcsherry/blog/blob/master/posts/2018-05-19.md Dovetail join - relational ai unpublished. Julia specific ish? https://relational.ai/blog/dovetail-join use sparsity of all relations to narrow down search Worst case optiomal join Ngo pods 2012 leapfrog triejoin simpel worst case icdt 2015 worst case optimal join for sparql worst case optimal graph joins in almost no space Correlated subqueries: unnesting arbitrary queries How materializr and other databases optimize sql subqueries

genlteish intro to worst case optimal joins

Adopting Worst-Case Optimal Joins in Relational Database Systems tries The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases Persistent Storage of Adaptive Radix Trees in DuckDB

oltp indices 2

umbra spiritual successor to hyper. Hybridizes an in memory system to also work off ssd.

Vectorized Execution

cmu adavanced course lecture Rethinking SIMD Vectorization for In-Memory Databases

masked/selective load masked/selective store scatter gather

selection: branched vs branchless branched checks condition to see if should copy row out branchless writes but only increments index of storage by one if condition is met. I mean. There is a “branch” in this. But I see your point

EmptyHeaded: A Relational Engine for Graph Processing “generalized hypertree decomposition” ? https://github.com/HazyResearch/EmptyHeaded

levelheaded linear algerba stuff?

Multi Version Concurrency Control

https://en.wikipedia.org/wiki/Multiversion_concurrency_control

SQLite

SQLite is an embedded in process database. Has a WASM version It’s a single drop in C file with no dependencies. That means it’s kind of available everywhere It isn’t good for concurrent writers.

Performance tips: WAL mode

sqlite commands that are interesting

  • .help
  • .dump
  • .tables
  • .schema
  • .indexes
  • .expert suggests indices?

Duckdb

https://duckdb.org/ sqlite for olap columnar

import duckdb
con = duckdb.connect(database=':memory:')
import pandas as pd
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
con.execute('SELECT * FROM test_df')
#print(con.fetchall())
print(con.fetchdf())

add_df = pd.DataFrame(columns=["x","y","z"])
print(add_df)
counter = 0
def add(x,y):
  global counter, add_df
  cond = add_df["x"] == x #& add_df["y"] == y
  df = add_df[cond]
  if not df.empty:
    return add_df["z"][0]
  else:
    z = counter
    add_df.append((x,y,z))
    counter += 1
    return z

print(add(-1,-2))

import duckdb
con = duckdb.connect(database=':memory:')
con.execute("CREATE TABLE root (x INTEGER, y INTEGER);")
# "don't use execute many"
con.executemany("INSERT INTO root VALUES (?, ?)", [(1,1),(2,2),(3,3),(1,2),(2,3)])
con.execute("""
SELECT x, max(y)
    FROM root
    GROUP BY x;""")
print(con.fetchall())



#con.execute("""
#UPDATE root a
#  INNER JOIN root b 
#  ON a.y = b.x
#  SET a.y = b.y""")
#print(con.fetchall())

#con.execute("""
#UPDATE root c
#  SET y = max(b.y)
#    FROM root a
#    INNER JOIN root b ON a.x = c.x AND a.y = b.x
#    """)
#print(con.fetchall())

con.execute("""
WITH root2(x1,y1) AS (
  SELECT a.x, max(b.y)
    FROM root a, root b
    WHERE a.y = b.x
    GROUP BY a.x
)
UPDATE root
  SET y = max(b.y)
  FROM root a
  INNER JOIN root b
  ON a.y = b.x
  GROUP BY a.x;
    """)
print(con.fetchall())

con.execute("""
SELECT a.x, max(b.y)
    FROM root a, root b
    WHERE a.y = b.x
    GROUP BY a.x;""")
print(con.fetchall())


catalog multiversion concrruncy control cimpressed execution binder

Postgres

sudo -u postgres psql Very often you need to be the postgres user on the default install

help \h for sql commands \? for

\c connect \dt look at tables

Relational AI

https://www.youtube.com/watch?v=WRHy7M30mM4&ab_channel=CMUDatabaseGroup

snowflake databricks bigquery dbt fivetran

data apps - dapps

lookml sigma legend

Resposnive compilter - matsakis salsa.jl umbra/leanstore

incremental COnvergence of datalog over presmeirings differential dataflor cidr2013 reconciling idfferences 2011 Green F-IVM incrmenetal view mantinance with triple lock fotrization benefits

systemml vecame apache systemds https://systemds.apache.org/

Semantic optimization FAW question asked frequence : Ngo Rudra PODS 2016 What do shannon type ineuqlaities submodular width and disjunctive datalog have to do with one another pods 2017 precise complexity analysis for efficient datalog queries ppdp 2010 functional aggregate queries with additive inequalities convergence of dtalog over pr-esemirign

Relational machine learning Layered aggregate engine for analystics worloads schelich olteanu khamis leanring models over relational data using sparse tenosrs The relational data borg is learning olteanu vldb keynote sturcture aware machine learning over multi relational database relational know graphs as the ofundation for artifical intelligence km-means: fast clustering for relational data https://arxiv.org/abs/1911.06577 Learning Models over Relational Data: A Brief Tutorial

duckdb for sql support calcite postgresql parser

Fortress library traits. OPtimization and parallelism https://relational.ai/blog/categories/research

https://arxiv.org/abs/2004.03716 triangle view mantenance

Streaming

streaming 101 unbounded data

https://en.wikipedia.org/wiki/Stream_processing

lambda architecture - low latency inaccurate, then batch provides accurate

event time vs processing time

Watermark

Flink Apache Beam millwheel spark streaming

https://materialize.com/blog

Data Structures

B Tree

Bw-tree The B-Tree, LSM-Tree, and the Bw-Tree in Between open bw-tree 2018

Radix Trie

Meta Techniques

There are certain good ideas that I don’t even know how to classify really

Timestamps

https://en.wikipedia.org/wiki/Lamport_timestamp logical timestamps

https://en.wikipedia.org/wiki/Logical_clock

Tombstones

https://en.wikipedia.org/wiki/Tombstone_(data_store)

https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/archTombstones.html

Rather than deleting immediately, have a table that marks things as deleted. Or a deleted column. Perhaps with deletion time

This goes some ways towards make a persistent data structure. / Maybe you can keep some data read only

https://dba.stackexchange.com/questions/14402/tombstone-table-vs-deleted-flag-in-database-syncronization-soft-delete-scenari

CRDTs

Conflict Free replicated datatypes https://crdt.tech/ martin Kleppmann

CRDT of string - consider fractional positions. Tie breaking. Bad interleaving problem unique identifiers

  • LSeq
  • RGA
  • TreeSeq

https://www.inkandswitch.com/peritext/ crdt rich text

https://github.com/josephg/diamond-types https://josephg.com/blog/crdts-go-brrr/

https://github.com/yjs/yjs

automerge: library of data structures for collab applications in javascript https://mobiuk.org/abstract/S4-P5-Kleppmann-Automerge.pdf local first. use local persistent storage. git for your app’s data. rust implementation?

isabelle crdt I was wrong. CRDTs are the future

Conflict-free Replicated Data Types” “A comprehensive study of Convergent and Commutative Replicated Data Types

Operational Transformation - sequences of insert and delete. Moves possibly.

delta-based vs state-based https://bartoszsypytkowski.com/the-state-of-a-state-based-crdts/

counters

json crdt for vibes patches?

Tree move op. Create delete subtrees.

Synthesizing CRDTs from Sequential Data Types with Verified Lifting https://arxiv.org/abs/2205.12425

Big Data

SLOG: Serializable, Low-latency, Geo-replicated Transactions spanner and calvin

Spark Hadoop MapReduce Dask Flink Storm

Mahout Vowpal Wabbit

hadboop

Giraph

Spark

https://en.wikipedia.org/wiki/Apache_Spark Databricks - company bigdatalog https://www.cis.upenn.edu/~susan/cis700/Papers/BigDataAnalyticsSPARK.pdf https://github.com/ashkapsky/BigDatalog MLlib spark streaming graphx

Message brokrs

RabbitMQ Kafka

Services

BigQuery Snowflake Azure AWS

Graph systems

It isn’t that relational systems can’t express graph problems. But maybe graph systems are more optimized for the problem neo4j Giraph Powergraph graphrex graphx myria graphchi xsteam gridgraph graphlab

SQL

  • create table
  • create index
  • explain query plan I saw explain analyze elsewhere
  • select
  • vacuum - defrag and gabrage collect the db
  • begin transaction

https://github.com/tobymao/sqlglot/blob/main/posts/python_sql_engine.md https://news.ycombinator.com/item?id=34233697 Writing a Python SQL engine from scratch

Resources

Conferences

  • SIGMOD PODS https://sigmod.org/pods-home/ pods uutorials https://sigmod.org/pods-home/pods-tutorials/ Testy of time awards Cool stuff in here.
  • VLDB
  • HYTRADBOI https://www.hytradboi.com/ also very cool stuff.

    Misc

    SQL/DB learning resources

use the index luke sqlbolt = interactive sql tutorial

the art of postgresql a book. select star sql

schemaverse a space battle game written in sql

SQLite: Past, Present, and Future

Datavases, types, and the relational model The third manifesto

how query engines work andy grove

database internals book

database design and implementation

duckdb embedded like sqlite?

https://xtdb.com/

Conjunctive-query containment and constraint satisfaction

Designing Data intensive systems martin kleppmann

scalability but at what cost? big systems vs laptops.

Data integration the relational logic approach

postgres indexes for newbies postgres tutorial raytracer in sql [advent of code sql(https://news.ycombinator.com/item?id=29467671)] sqllancer detecting lgoic bugs in dbms

  • Differential Datalog
  • CRDTs
  • Differential Dataflow
  • Nyberg Accumulators
  • Verkle Trees
  • Cryptrees
  • Byzantine Eventual Consistency
  • Self-renewable hash chains
  • Binary pebbling

https://github.com/dbuenzli/rel

Ezra Cooper. The Script-Writer’s Dream: How to Write Great SQL in Your Own Language, and Be Sure It Will Succeed. 2009. Full text

James Cheney et al. A practical theory of language-integrated query. 2013. Full text

Suzuki et al. Finally, safely-extensible and efficient language-integrated query. 2016. Full text

Oleg Kiselyov et al. Sound and Efficient Language-Integrated Query – Maintaining the ORDER. 2017. Full text

DBSP: Automatic Incremental View Maintenance for Rich Query Languages - mcsherry et al

pavlo advanced databases

awesome database learning

database architects blogs

https://www.reddit.com/r/databasedevelopment/

https://twitter.com/phil_eaton

database internals

Ask HN: What could a modern database do that PostgreSQL and MySQL can’t

postgres internals book

Sqlite virtual tables osquery osquery https://github.com/frabert/ClangQL qerying C++ databases advanced sql course

roaring bitmaps https://vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/ Switches out storage method and different scales and density.

nocodb It’s like a spreadsheet that attaches to dbs. Open source airtable?

Does sql need help

Views

postgres

sudo -u postgres psql