Process Australian Parliamentary Hansard Data from a gazillion csv files • hansardR

A comprehensive R toolkit for processing, validating, and analysing Australian Parliamentary Hansard CSV data. Designed for computational social science research with robust data handling, validation, and structured database storage.

Features

🗃️ Database Creation: Structured SQLite storage with optimised indexes and triggers
✅ File Validation: Comprehensive CSV structure and integrity checking
📊 Batch Processing: Efficient import of large datasets with progress tracking
🔍 Advanced Querying: dplyr-compatible database interface for analysis
🔎 Full-Text Search: SQLite FTS5 powered content search with highlighting
📈 Built-in Analytics: Speaker statistics, temporal analysis, and content metrics
⚡ Performance Optimized: Strategic indexing and database triggers for speed
🛠️ Modular Design: Separate validation, processing, and analysis workflows
📚 Rich Documentation: Comprehensive vignettes and examples

Installation

Development Version (Recommended)

Install the latest development version from GitHub:

pak::pak("Australian-Parliamentary-Speech/hansardR")

System Requirements

R Version: Requires R ≥ 4.1.0 (for native pipe operator |>)

Dependencies: The package will automatically install required dependencies: - DBI, RSQLite - Database interface - dplyr, readr, purrr, stringr, tibble - Data manipulation - progress (optional) - Progress bars for batch operations

Quick Start

library(hansardR)

# Create database
con <- create_hansard_database("hansard.db")

# Import single file
import_hansard_file("2025-02-04_edit_step7.csv", con)

# Get table references for analysis
tbls <- get_hansard_tables(con)

# Analyse with dplyr
top_speakers <- tbls$speeches |>
  left_join(tbls$members, by = "member_id") |>
  count(full_name, party, sort = TRUE) |>
  collect()

print(top_speakers)

Sample Data

The package includes sample data for testing and learning:

# Explore included sample data
hansard_sample_info()

# Use sample data in examples
sample_path <- hansard_sample_data()
con <- create_hansard_database(tempfile(fileext = ".db"))
import_hansard_year(file.path(sample_path, "sample_2025"), con)

Comprehensive Workflow

For detailed usage instructions, see the complete workflow vignette:

# View the comprehensive workflow guide
vignette("hansard-workflow", package = "hansardR")

# Or browse online
browseVignettes("hansardR")

Key Functions

Function	Purpose
`create_hansard_database()`	Create structured SQLite database with optimizations
`optimize_hansard_database()`	Apply performance optimizations to existing database
`validate_csv_structure()`	Check file integrity before processing
`import_hansard_file()`	Import single CSV file
`import_hansard_batch()`	Import multiple files with progress tracking
`import_hansard_year()`	Import all files from a year directory
`get_hansard_tables()`	Get dplyr table references for analysis
`get_top_speakers()`	Built-in speaker activity analysis
`search_speech_content()`	Full-text search of speech content with highlighting
`search_speeches_advanced()`	Advanced search combining content and metadata filters
`get_content_statistics()`	Content analysis and search capability statistics

Data Structure

The package creates a normalised database schema:

sessions: Parliamentary sitting days
members: MPs with party affiliations and electorates
debates: Major topics discussed in each session
speeches: Individual contributions, questions, answers, and interjections

Database Schema

The package creates a normalised relational database optimised for parliamentary data analysis:

erDiagram
    SESSIONS {
        int session_id PK
        date session_date UK
        int year "Generated"
        int chamber_type
        text source_file
        text file_hash
        datetime created_at
    }

    MEMBERS {
        int member_id PK
        text name_id UK
        text full_name
        text electorate
        text party
        text role
        text status "active/inactive/unknown"
        date first_seen_date
        date last_seen_date
        date first_parliament_date
        date last_parliament_date
        datetime created_at
    }

    DEBATES {
        int debate_id PK
        int session_id FK
        int parent_debate_id FK "References debate_id for hierarchy"
        text debate_title
        int debate_order
        int debate_level "Hierarchy level"
        datetime created_at
    }

    SPEECHES {
        int speech_id PK
        int session_id FK
        int debate_id FK
        int member_id FK
        int speaker_no
        time speech_time
        real page_no
        text content
        text subdebate_info
        text xml_path
        bool is_question
        bool is_answer
        bool is_interjection
        bool is_speech
        bool is_stage_direction
        int content_length "Generated"
        datetime created_at
    }

    SPEECHES_FTS {
        int rowid FK "Links to speech_id"
        text content "Full-text indexed"
    }

    SESSIONS ||--o{ DEBATES : "has"
    SESSIONS ||--o{ SPEECHES : "contains"
    MEMBERS ||--o{ SPEECHES : "gives"
    DEBATES ||--o{ SPEECHES : "includes"
    SPEECHES ||--|| SPEECHES_FTS : "indexed_by"

Data processing workflow

flowchart TD

    A[CSV Files<br/>2025-02-04_edit_step7.csv] --> B{File Validation}
    B -->|Valid| C[Load & Clean Data]
    B -->|Invalid| D[Report Issues]

    C --> E[Extract Session Info<br/>Date, Chamber, Source]
    C --> F[Extract Members<br/>Name, Party, Electorate]
    C --> G[Extract Debates<br/>Topics, Order]
    C --> H[Extract Speeches<br/>Content, Flags, Metadata]

    E --> I[(Sessions Table)]
    F --> J[(Members Table)]
    G --> K[(Debates Table)]
    H --> L[(Speeches Table)]

    I --> M[Analysis Ready Database]
    J --> M
    K --> M
    L --> M

    M --> N[dplyr Queries]
    M --> O[SQL Queries]
    M --> P[Text Mining]

    N --> Q[Speaker Statistics]
    O --> R[Temporal Analysis]
    P --> S[Content Analysis]

    style A fill:#e1f5fe
    style M fill:#f3e5f5
    style Q fill:#e8f5e8
    style R fill:#e8f5e8
    style S fill:#e8f5e8

Example Analyses

Speaker Activity Analysis

# Most active speakers
top_speakers <- get_top_speakers(con, limit = 10)

# Questions by party over time
party_questions <- tbls$speeches |>
  filter(is_question == 1) |>
  left_join(tbls$members, by = "member_id") |>
  left_join(tbls$sessions, by = "session_id") |>
  count(party, year, sort = TRUE) |>
  collect()

Content Analysis

# Average speech length by party
speech_patterns <- tbls$speeches |>
  left_join(tbls$members, by = "member_id") |>
  group_by(party) |>
  summarise(
    avg_length = mean(content_length, na.rm = TRUE),
    total_speeches = n(),
    questions = sum(is_question, na.rm = TRUE)
  ) |>
  collect()

Temporal Trends

# Parliamentary activity over time
monthly_activity <- tbls$speeches |>
  left_join(tbls$sessions, by = "session_id") |>
  mutate(month = substr(session_date, 1, 7)) |>
  count(month, sort = TRUE) |>
  collect()

Full-Text Search

# Search for climate change mentions
climate_speeches <- search_speech_content(con, "climate change", limit = 50)

# Advanced search: ALP members discussing renewable energy since 2024
renewable_search <- search_speeches_advanced(
  con,
  content_query = "renewable energy",
  party = "ALP", 
  date_from = "2024-01-01",
  limit = 25
)

# Get search statistics
search_stats <- get_content_statistics(con)
print(search_stats$basic)

File Format

The package expects CSV files with Australian Parliamentary Hansard structure:

Filename format: YYYY-MM-DD_edit_step7.csv
Required columns: question_flag, answer_flag, speech_flag, name, name.id, party, content, etc.
Encoding: UTF-8

Performance

Designed for large-scale parliamentary data with enterprise-level optimizations:

✅ Handles 100+ years of parliamentary data (1901-2025)
✅ Efficient batch processing with transaction management
✅ Strategic database indexing with 15+ optimized indexes
✅ Full-text search with SQLite FTS5 and BM25 ranking
✅ Automatic database triggers for data validation and maintenance
✅ Memory-efficient streaming for large files
✅ Progress tracking for long-running operations
✅ Database optimization functions for existing installations

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Clone the repository
git clone https://github.com/Australian-Parliamentary-Speech/hansardR.git
cd hansardR

# Install development dependencies
devtools::install_dev_deps()

# Run tests
devtools::test()

# Check package
devtools::check()

Citation

If you use hansardR in your research, please cite:

Benoit, Kenneth (2025). hansardR: Processing Australian Parliamentary Hansard Data. 
R package version 0.1.0. https://github.com/Australian-Parliamentary-Speech/hansardR

Australian Parliamentary Speech Project - Broader research initiative
ParlSpeech - Comparative parliamentary speech data
quanteda - Text analysis framework

License

GPL (>= 3) - see LICENSE file for details.

Support

Developed by the Australian Parliamentary Speech Project

hansardR