Introduction
When working with large brain connectome datasets, storing all
matrices in memory as R lists can be inefficient. The
riemtan package now supports Apache Arrow Parquet format
for efficient storage and lazy loading of SPD matrices.
This vignette demonstrates how to:
- Export connectomes to Parquet format
- Load connectomes using the Parquet backend
- Work with CSample and CSuperSample using Parquet storage
Benefits of Parquet Storage
- Memory Efficiency: Matrices are loaded on-demand rather than all at once
- Fast Access: Columnar storage format optimized for numerical data
- Metadata Support: Store subject IDs, provenance, and other metadata
- Interoperability: Standard format compatible with Python, Julia, and other tools
Basic Workflow
1. Writing Connectomes to Parquet
First, let’s create some example SPD matrices and save them to Parquet format:
library(riemtan)
library(Matrix)
# Create example connectomes (4x4 SPD matrices)
set.seed(42)
connectomes <- lapply(1:50, function(i) {
mat <- diag(4) + matrix(rnorm(16, 0, 0.1), 4, 4)
mat <- (mat + t(mat)) / 2 # Make symmetric
mat <- mat + diag(4) * 0.5 # Ensure positive definite
Matrix::pack(Matrix::Matrix(mat, sparse = FALSE))
})
# Write to Parquet format
write_connectomes_to_parquet(
connectomes,
output_dir = "my_connectomes",
subject_ids = paste0("subject_", 1:50),
provenance = list(
study = "Example Study",
acquisition_date = "2024-01-01",
preprocessing = "Standard pipeline v1.0"
)
)This creates a directory structure:
my_connectomes/
├── metadata.json
├── matrix_0001.parquet
├── matrix_0002.parquet
├── ...
└── matrix_0050.parquet
2. Validating Parquet Directory
Before using a Parquet directory, you can validate its structure:
# Detailed validation with verbose output
validate_parquet_directory("my_connectomes", verbose = TRUE)3. Creating a CSample with Parquet Backend
Now use the Parquet backend to create a CSample:
# Load AIRM metric
data(airm)
# Create Parquet backend (default cache size: 10 matrices)
backend <- create_parquet_backend(
"my_connectomes",
cache_size = 10
)
# Create CSample with the backend
sample <- CSample$new(
backend = backend,
metric_obj = airm
)
# Sample info
print(paste("Sample size:", sample$sample_size))
print(paste("Matrix dimension:", sample$matrix_size))4. Computing with Parquet-backed CSample
All standard CSample operations work transparently with the Parquet backend:
5. Lazy Loading Behavior
The Parquet backend loads matrices on-demand:
# Access specific connectome (loads from disk if not cached)
conn_1 <- sample$connectomes[[1]]
# Access all connectomes (loads all from disk)
all_conns <- sample$connectomes
# Cache management
backend$get_cache_size() # Check current cache usage
backend$clear_cache() # Clear cache to free memoryWorking with CSuperSample
CSuperSample works seamlessly with Parquet-backed samples:
# Create two samples with different backends
backend1 <- create_parquet_backend("study1_connectomes")
backend2 <- create_parquet_backend("study2_connectomes")
sample1 <- CSample$new(backend = backend1, metric_obj = airm)
sample2 <- CSample$new(backend = backend2, metric_obj = airm)
# Create super sample
super_sample <- CSuperSample$new(list(sample1, sample2))
# Gather all connectomes
super_sample$gather()
# Compute statistics
super_sample$compute_fmean()
super_sample$compute_variation()Backwards Compatibility
The Parquet backend is fully compatible with the traditional list-based approach:
# Traditional approach (all in memory)
sample_memory <- CSample$new(
conns = connectomes,
metric_obj = airm
)
# Parquet approach (lazy loading)
backend <- create_parquet_backend("my_connectomes")
sample_parquet <- CSample$new(
backend = backend,
metric_obj = airm
)
# Both work identically
sample_memory$compute_fmean()
sample_parquet$compute_fmean()Performance Tips
Cache Size Tuning
Adjust cache size based on available memory:
# Small cache (memory-constrained environments)
backend_small <- ParquetBackend$new("my_connectomes", cache_size = 5)
# Large cache (memory-rich environments)
backend_large <- ParquetBackend$new("my_connectomes", cache_size = 50)Parallel Processing with Parquet
New in v0.2.5: riemtan now supports cross-platform parallel processing that works seamlessly with the Parquet backend.
Enabling Parallel Processing
Configure parallel processing using the futureverse framework:
library(riemtan)
# Enable parallel processing (works on all platforms including Windows!)
set_parallel_plan("multisession", workers = 4)
# Check status
is_parallel_enabled() # TRUE
get_n_workers() # 4
# Create Parquet-backed sample
backend <- create_parquet_backend("large_dataset", cache_size = 20)
sample <- CSample$new(backend = backend, metric_obj = airm)Parallel I/O and Computation
All major operations automatically use parallel processing when beneficial:
# Parallel tangent computations with progress bar
sample$compute_tangents(progress = TRUE) # 3-8x faster
# Parallel vectorization
sample$compute_vecs(progress = TRUE) # 2-4x speedup
# Parallel Frechet mean computation
sample$compute_fmean(progress = TRUE) # 2-5x faster for large samplesBatch Loading with Parallel I/O
For very large Parquet datasets, use batch loading with parallel I/O and memory management:
# Load specific subset in batches
subset_conns <- sample$load_connectomes_batched(
indices = 1:500, # Load first 500 matrices
batch_size = 50, # 50 matrices per batch
progress = TRUE # Show progress
)
# This loads 500 matrices in 10 batches, clearing cache between batches
# Each batch is loaded in parallel for 5-10x speedupPerformance Comparison
# Sequential (default if not configured)
set_parallel_plan("sequential")
system.time(sample$compute_tangents()) # Baseline
# Parallel with 4 workers
set_parallel_plan("multisession", workers = 4)
system.time(sample$compute_tangents()) # 3-4x faster
# Parallel with 8 workers
set_parallel_plan("multisession", workers = 8)
system.time(sample$compute_tangents()) # 6-8x fasterProgress Reporting
Enable progress bars to monitor long-running operations:
# Install progressr for progress bars (optional)
install.packages("progressr")
# Enable parallel processing
set_parallel_plan("multisession", workers = 4)
# All operations support progress parameter
sample$compute_tangents(progress = TRUE)
sample$compute_vecs(progress = TRUE)
sample$compute_fmean(progress = TRUE)
# Batch loading with progress
conns <- sample$load_connectomes_batched(
indices = 1:1000,
batch_size = 100,
progress = TRUE # Shows "Batch 1/10: loading matrices 1-100"
)Best Practices
1. Choose appropriate worker count:
# Conservative (leave cores for system)
set_parallel_plan("multisession", workers = parallel::detectCores() - 1)
# Maximum performance (use all cores)
set_parallel_plan("multisession", workers = parallel::detectCores())2. Memory management with parallelization:
# Each worker loads its own data copy
# For 4 workers with 100 matrices of 200x200, expect:
# Memory = 4 workers × cache_size × matrix_size ≈ 4 × 20 × 320 KB ≈ 25 MB
# Use smaller cache with more workers
backend <- create_parquet_backend("dataset", cache_size = 10)
set_parallel_plan("multisession", workers = 8)3. Reset when done:
# Compute with parallelization
set_parallel_plan("multisession", workers = 4)
sample$compute_tangents(progress = TRUE)
sample$compute_fmean(progress = TRUE)
# Reset to free worker resources
reset_parallel_plan()Auto-Detection
Parallel processing is automatically disabled for small datasets to avoid overhead:
# Small dataset (n < 10): Uses sequential processing automatically
small_sample <- CSample$new(conns = small_connectomes[1:5], metric_obj = airm)
small_sample$compute_tangents() # Sequential (no overhead)
# Large dataset (n >= 10): Uses parallel processing automatically
large_sample <- CSample$new(backend = backend, metric_obj = airm)
large_sample$compute_tangents() # Parallel (if plan is active)Parallel Strategies
Different strategies for different environments:
# multisession: Works on all platforms (Windows, Mac, Linux)
set_parallel_plan("multisession", workers = 4)
# multicore: Unix only, lower overhead
set_parallel_plan("multicore", workers = 4) # Auto-fallback to multisession on Windows
# cluster: For remote/distributed computing
set_parallel_plan("cluster", workers = c("node1", "node2"))Advanced: Custom File Patterns
Customize the Parquet file naming pattern:
write_connectomes_to_parquet(
connectomes,
output_dir = "custom_naming",
file_pattern = "conn_%03d.parquet"
)
# Files will be named: conn_001.parquet, conn_002.parquet, ...Troubleshooting
Large Datasets
For very large datasets (1000+ matrices):
# Use minimal cache
backend <- ParquetBackend$new("large_dataset", cache_size = 3)
# Compute statistics without loading all matrices at once
sample <- CSample$new(backend = backend, metric_obj = airm)
sample$compute_tangents()
sample$compute_vecs()
# Operations that don't need all matrices in memory
sample$compute_fmean(batch_size = 32) # Uses batchingSummary
The Parquet backend provides:
- Scalability: Handle datasets too large for memory
- Efficiency: Lazy loading minimizes memory footprint
- Flexibility: Fully compatible with existing riemtan workflows
- Metadata: Rich metadata support for reproducible research
For small to medium datasets that fit in memory, the traditional list approach remains perfectly valid. For large brain connectome studies, the Parquet backend enables analysis at scale.