DuckDB Essentials: Getting Started
DuckDB is an embedded SQL OLAP database management system designed for analytical workloads. Key features include:
Core Characteristics
Embedded Architecture: Runs within your application process, eliminating inter-process communication overhead
High Performance: Optimized for complex queries on large datasets
Lightweight: Minimal memory footprint ideal for resource-constrained environments
Cross-Platform: Supports Windows, macOS, Linux, Android
Full SQL Support: Aggregations, window functions, joins, and UDFs
Installation Methods
Source Compilation:
# Install dependencies
yum -y install gcc gcc-c++ make cmake
# Clone repository
git clone https://github.com/duckdb/duckdb.git
# Build
cd duckdb
make -j8
Binary Installation:
wget https://github.com/duckdb/duckdb/releases/download/v0.9.2/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
./duckdb
Data Import Techniques
CSV Import
-- Auto-detect schema
SELECT * FROM read_csv_auto('test.csv');
-- Manual schema definition
COPY test_csv FROM 'test.csv' (AUTO_DETECT true);
Parquet Integration
# Python conversion
import pandas as pd
df = pd.read_csv('test.csv')
df.to_parquet('test.parquet')
-- Query Parquet
SELECT * FROM read_parquet('test.parquet');
JSON Handling
-- Structured import
SELECT * FROM read_json_auto('test.json');
-- Unstructured analysis
SELECT * FROM read_json_auto('test.json', format='unstructured');
SQL Operations & Extensions
Basic Queries
CREATE TABLE employees (
first_name VARCHAR,
last_name VARCHAR,
age INT
);
INSERT INTO employees VALUES
('Zhang', 'San', 57),
('Li', 'Si', 48);
SELECT * FROM employees;
Extensions
-- Install HTTP/S3 extension
INSTALL httpfs;
LOAD httpfs;
-- Query remote data
SELECT * FROM 'http://example.com/data.csv';
Python API
import duckdb
con = duckdb.connect()
con.sql("SELECT * FROM 'test.csv'").show()
Export & Management
-- Export entire database
EXPORT DATABASE 'my_backup';
-- Attach existing database
ATTACH 'production.db';
SHOW DATABASES;
Performance Note: DuckDB processes complex aggregations 3-5x faster than traditional row-based databases on analytical workloads.