A comprehensive R toolkit for processing, validating, and analysing Australian Parliamentary Hansard CSV data. Designed for computational social science research with robust data handling, validation, and structured database storage.
Features
- 🗃️ Database Creation: Structured SQLite storage with optimised indexes and triggers
- ✅ File Validation: Comprehensive CSV structure and integrity checking
- 📊 Batch Processing: Efficient import of large datasets with progress tracking
- 🔍 Advanced Querying: dplyr-compatible database interface for analysis
- 🔎 Full-Text Search: SQLite FTS5 powered content search with highlighting
- 📈 Built-in Analytics: Speaker statistics, temporal analysis, and content metrics
- ⚡ Performance Optimized: Strategic indexing and database triggers for speed
- 🛠️ Modular Design: Separate validation, processing, and analysis workflows
- 📚 Rich Documentation: Comprehensive vignettes and examples
Installation
Development Version (Recommended)
Install the latest development version from GitHub:
pak::pak("Australian-Parliamentary-Speech/hansardR")
System Requirements
R Version: Requires R ≥ 4.1.0 (for native pipe operator |>
)
Dependencies: The package will automatically install required dependencies: - DBI
, RSQLite
- Database interface - dplyr
, readr
, purrr
, stringr
, tibble
- Data manipulation - progress
(optional) - Progress bars for batch operations
Quick Start
library(hansardR)
# Create database
con <- create_hansard_database("hansard.db")
# Import single file
import_hansard_file("2025-02-04_edit_step7.csv", con)
# Get table references for analysis
tbls <- get_hansard_tables(con)
# Analyse with dplyr
top_speakers <- tbls$speeches |>
left_join(tbls$members, by = "member_id") |>
count(full_name, party, sort = TRUE) |>
collect()
print(top_speakers)
Sample Data
The package includes sample data for testing and learning:
# Explore included sample data
hansard_sample_info()
# Use sample data in examples
sample_path <- hansard_sample_data()
con <- create_hansard_database(tempfile(fileext = ".db"))
import_hansard_year(file.path(sample_path, "sample_2025"), con)
Comprehensive Workflow
For detailed usage instructions, see the complete workflow vignette:
# View the comprehensive workflow guide
vignette("hansard-workflow", package = "hansardR")
# Or browse online
browseVignettes("hansardR")
Key Functions
Function | Purpose |
---|---|
create_hansard_database() |
Create structured SQLite database with optimizations |
optimize_hansard_database() |
Apply performance optimizations to existing database |
validate_csv_structure() |
Check file integrity before processing |
import_hansard_file() |
Import single CSV file |
import_hansard_batch() |
Import multiple files with progress tracking |
import_hansard_year() |
Import all files from a year directory |
get_hansard_tables() |
Get dplyr table references for analysis |
get_top_speakers() |
Built-in speaker activity analysis |
search_speech_content() |
Full-text search of speech content with highlighting |
search_speeches_advanced() |
Advanced search combining content and metadata filters |
get_content_statistics() |
Content analysis and search capability statistics |
Data Structure
The package creates a normalised database schema:
- sessions: Parliamentary sitting days
- members: MPs with party affiliations and electorates
-
debates: Major topics discussed in each session
- speeches: Individual contributions, questions, answers, and interjections
Database Schema
The package creates a normalised relational database optimised for parliamentary data analysis:
erDiagram
SESSIONS {
int session_id PK
date session_date UK
int year "Generated"
int chamber_type
text source_file
text file_hash
datetime created_at
}
MEMBERS {
int member_id PK
text name_id UK
text full_name
text electorate
text party
text role
text status "active/inactive/unknown"
date first_seen_date
date last_seen_date
date first_parliament_date
date last_parliament_date
datetime created_at
}
DEBATES {
int debate_id PK
int session_id FK
int parent_debate_id FK "References debate_id for hierarchy"
text debate_title
int debate_order
int debate_level "Hierarchy level"
datetime created_at
}
SPEECHES {
int speech_id PK
int session_id FK
int debate_id FK
int member_id FK
int speaker_no
time speech_time
real page_no
text content
text subdebate_info
text xml_path
bool is_question
bool is_answer
bool is_interjection
bool is_speech
bool is_stage_direction
int content_length "Generated"
datetime created_at
}
SPEECHES_FTS {
int rowid FK "Links to speech_id"
text content "Full-text indexed"
}
SESSIONS ||--o{ DEBATES : "has"
SESSIONS ||--o{ SPEECHES : "contains"
MEMBERS ||--o{ SPEECHES : "gives"
DEBATES ||--o{ SPEECHES : "includes"
SPEECHES ||--|| SPEECHES_FTS : "indexed_by"
Data processing workflow
flowchart TD
A[CSV Files<br/>2025-02-04_edit_step7.csv] --> B{File Validation}
B -->|Valid| C[Load & Clean Data]
B -->|Invalid| D[Report Issues]
C --> E[Extract Session Info<br/>Date, Chamber, Source]
C --> F[Extract Members<br/>Name, Party, Electorate]
C --> G[Extract Debates<br/>Topics, Order]
C --> H[Extract Speeches<br/>Content, Flags, Metadata]
E --> I[(Sessions Table)]
F --> J[(Members Table)]
G --> K[(Debates Table)]
H --> L[(Speeches Table)]
I --> M[Analysis Ready Database]
J --> M
K --> M
L --> M
M --> N[dplyr Queries]
M --> O[SQL Queries]
M --> P[Text Mining]
N --> Q[Speaker Statistics]
O --> R[Temporal Analysis]
P --> S[Content Analysis]
style A fill:#e1f5fe
style M fill:#f3e5f5
style Q fill:#e8f5e8
style R fill:#e8f5e8
style S fill:#e8f5e8
Example Analyses
Speaker Activity Analysis
# Most active speakers
top_speakers <- get_top_speakers(con, limit = 10)
# Questions by party over time
party_questions <- tbls$speeches |>
filter(is_question == 1) |>
left_join(tbls$members, by = "member_id") |>
left_join(tbls$sessions, by = "session_id") |>
count(party, year, sort = TRUE) |>
collect()
Temporal Trends
# Parliamentary activity over time
monthly_activity <- tbls$speeches |>
left_join(tbls$sessions, by = "session_id") |>
mutate(month = substr(session_date, 1, 7)) |>
count(month, sort = TRUE) |>
collect()
Full-Text Search
# Search for climate change mentions
climate_speeches <- search_speech_content(con, "climate change", limit = 50)
# Advanced search: ALP members discussing renewable energy since 2024
renewable_search <- search_speeches_advanced(
con,
content_query = "renewable energy",
party = "ALP",
date_from = "2024-01-01",
limit = 25
)
# Get search statistics
search_stats <- get_content_statistics(con)
print(search_stats$basic)
File Format
The package expects CSV files with Australian Parliamentary Hansard structure:
- Filename format:
YYYY-MM-DD_edit_step7.csv
- Required columns:
question_flag
,answer_flag
,speech_flag
,name
,name.id
,party
,content
, etc. - Encoding: UTF-8
Performance
Designed for large-scale parliamentary data with enterprise-level optimizations:
- ✅ Handles 100+ years of parliamentary data (1901-2025)
- ✅ Efficient batch processing with transaction management
- ✅ Strategic database indexing with 15+ optimized indexes
- ✅ Full-text search with SQLite FTS5 and BM25 ranking
- ✅ Automatic database triggers for data validation and maintenance
- ✅ Memory-efficient streaming for large files
- ✅ Progress tracking for long-running operations
- ✅ Database optimization functions for existing installations
Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
Related Projects
- Australian Parliamentary Speech Project - Broader research initiative
- ParlSpeech - Comparative parliamentary speech data
- quanteda - Text analysis framework
License
GPL (>= 3) - see LICENSE file for details.
Support
- 📖 Package Documentation
- 🐛 Report Issues
- 💬 Discussions
- 📧 Contact: kbenoit@smu.edu.sg
Developed by the Australian Parliamentary Speech Project