cosinorage.datahandlers Module

Module Contents

This module provides the functionality to load Accelerometer data or minute-level ENMO data from CSV files and process this data to obtain a dataframe containing minute-level ENMO data.

Classes

class DataHandler[source]

Bases: object

A base class for data handlers that process and store ENMO data at the minute level.

This class provides a common interface for data handlers with methods to load data, retrieve processed ENMO values, and save data. The load_data and save_data methods are intended to be overridden by subclasses.

raw_data

Raw accelerometer data loaded from the source.

Type:

pd.DataFrame or None

sf_data

Filtered and processed accelerometer data.

Type:

pd.DataFrame or None

ml_data

Minute-level ENMO data calculated from processed data.

Type:

pd.DataFrame or None

meta_dict

Dictionary storing metadata about the data processing.

Type:

dict

__init__()[source]

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

save_data(output_path)[source]

Save minute-level ENMO data to a specified output path.

This method is intended to be implemented by subclasses, specifying the format and structure for saving data.

Parameters:

output_path (str) – The file path where the minute-level ENMO data will be saved.

get_raw_data()[source]

Retrieve the raw data.

Returns:

A DataFrame containing the raw data.

Return type:

pd.DataFrame

get_sf_data()[source]

Retrieve the filtered data.

Returns:

A DataFrame containing the filtered data.

Return type:

pd.DataFrame

get_ml_data()[source]

Retrieve the minute-level ENMO values.

Returns:

A DataFrame containing the minute-level ENMO values.

Return type:

pd.DataFrame

get_meta_data()[source]

Retrieve the metadata.

Returns:

A dictionary containing the metadata.

Return type:

dict

class NHANESDataHandler(nhanes_file_dir, seqn=None, verbose=False)[source]

Bases: DataHandler

Data handler for NHANES accelerometer data.

This class handles loading, filtering, and processing of NHANES accelerometer data.

Parameters:
  • nhanes_file_dir (str)

  • seqn (int)

  • verbose (bool)

nhanes_file_dir

Directory containing NHANES data files.

Type:

str

seqn

ID of the person whose data is being loaded.

Type:

str or None

__init__(nhanes_file_dir, seqn=None, verbose=False)[source]

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

Parameters:
  • nhanes_file_dir (str)

  • seqn (int | None)

  • verbose (bool)

get_ml_data()[source]

Get the minute-level data.

class GalaxyDataHandler(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]

Bases: DataHandler

Unified data handler for Samsung Galaxy Watch accelerometer data.

This class handles loading, filtering, and processing of Galaxy Watch accelerometer data in both binary and CSV formats. Currently supports: - Binary format with accelerometer data type - CSV format with ENMO data type

Parameters:
  • galaxy_file_path (str)

  • data_format (str)

  • data_type (str | None)

  • time_column (str | None)

  • data_columns (list | None)

  • preprocess_args (dict)

  • verbose (bool)

galaxy_file_path

Path to the Galaxy Watch data file (for CSV) or directory (for binary).

Type:

str

data_format

Format of the data (‘csv’ or ‘binary’).

Type:

str

data_type

Type of the data (‘enmo’ or ‘accelerometer’).

Type:

str

time_column

Name of the timestamp column.

Type:

str

data_columns

Names of the data columns.

Type:

list

preprocess_args

Arguments for preprocessing.

Type:

dict

__init__(galaxy_file_path, data_format='binary', data_type=None, time_column=None, data_columns=None, preprocess_args={}, verbose=False)[source]

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

Parameters:
  • galaxy_file_path (str)

  • data_format (str)

  • data_type (str | None)

  • time_column (str | None)

  • data_columns (list | None)

  • preprocess_args (dict)

  • verbose (bool)

class UKBDataHandler(qa_file_path, ukb_file_dir, eid, verbose=False)[source]

Bases: DataHandler

Data handler for UK Biobank accelerometer data.

This class handles loading, filtering, and processing of UK Biobank accelerometer data.

Parameters:
  • qa_file_path (str)

  • ukb_file_dir (str)

  • eid (int)

  • verbose (bool)

qa_file_path

Path to quality assessment file.

Type:

str

ukb_file_dir

Directory containing UK Biobank data files.

Type:

str

eid

Participant ID.

Type:

int

__init__(qa_file_path, ukb_file_dir, eid, verbose=False)[source]

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

Parameters:
  • qa_file_path (str)

  • ukb_file_dir (str)

  • eid (int)

  • verbose (bool)

class GenericDataHandler(file_path, data_format='csv', data_type='accelerometer-mg', time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, preprocess_args={}, verbose=False)[source]

Bases: DataHandler

Generic data handler for processing accelerometer and ENMO data from CSV files.

This class provides a flexible interface for loading and processing various types of accelerometer data, including ENMO (Euclidean Norm Minus One), raw accelerometer data (x, y, z), and alternative count data. It supports automatic data filtering, resampling, preprocessing, and ENMO calculation.

Parameters:
  • file_path (str)

  • data_format (str)

  • data_type (str)

  • time_format (str)

  • time_column (str)

  • time_zone (str | None)

  • data_columns (list | None)

  • preprocess_args (dict)

  • verbose (bool)

file_path

Path to the CSV file containing the data.

Type:

str

data_format

Format of the data file.

Type:

str

data_type

Type of data in the file.

Type:

str

time_format

Format of timestamps.

Type:

str

time_column

Name of the timestamp column.

Type:

str

time_zone

Timezone for datetime conversion.

Type:

str or None

data_columns

Names of the data columns.

Type:

list

preprocess_args

Preprocessing arguments.

Type:

dict

raw_data

Raw data loaded from the file with timestamp index.

Type:

pd.DataFrame or None

sf_data

Data after filtering and resampling (sensor fusion data).

Type:

pd.DataFrame or None

ml_data

Minute-level ENMO data calculated from the processed data.

Type:

pd.DataFrame or None

meta_dict

Metadata dictionary containing information about the data processing.

Type:

dict

Examples

Load ENMO data from a CSV file:

>>> handler = GenericDataHandler(
...     file_path='data/enmo_data.csv',
...     data_type='enmo',
...     time_column='timestamp',
...     data_columns=['enmo']
... )
>>> raw_data = handler.get_raw_data()
>>> ml_data = handler.get_ml_data()

Load accelerometer data from a CSV file:

>>> handler = GenericDataHandler(
...     file_path='data/accel_data.csv',
...     data_type='accelerometer',
...     time_column='time',
...     data_columns=['x', 'y', 'z']
... )
>>> raw_data = handler.get_raw_data()
>>> ml_data = handler.get_ml_data()

Notes

The data processing pipeline includes: 1. Loading raw data from CSV file 2. Filtering incomplete days and selecting longest consecutive sequence 3. Resampling to minute-level data 4. Preprocessing (wear detection, noise removal, etc.) 5. Calculating minute-level ENMO values

The class automatically handles column mapping and timestamp processing.

__init__(file_path, data_format='csv', data_type='accelerometer-mg', time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, preprocess_args={}, verbose=False)[source]

Initialize GenericDataHandler with CSV data file.

Parameters:
  • file_path (str) – Path to the CSV file containing the data.

  • data_format (str, default='csv') – Format of the data file. Currently only ‘csv’ is supported.

  • data_type (str, default='accelerometer-mg') – Type of data in the file. Must be one of: - ‘enmo-mg’, ‘enmo-g’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer-mg’, ‘accelerometer-g’, ‘accelerometer-ms2’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data

  • time_format (str, default='unix-ms') – Format of timestamps. Must be one of: ‘unix-ms’, ‘unix-s’, ‘datetime’.

  • time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.

  • time_zone (str, optional) – Timezone for datetime conversion. If None, uses local timezone.

  • data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults are: - [‘enmo’] for data_type=’enmo-mg’ or ‘enmo-g’ - [‘x’, ‘y’, ‘z’] for data_type=’accelerometer-mg’, ‘accelerometer-g’, or ‘accelerometer-ms2’ - [‘counts’] for data_type=’alternative_count’

  • preprocess_args (dict, default={}) – Additional preprocessing arguments to pass to the filtering and preprocessing functions.

  • verbose (bool, default=False) – Whether to print progress information during data loading and processing.

Utility Functions

Generic Data Functions

read_generic_xD_data(file_path, data_type, meta_dict, n_dimensions, time_format='unix-ms', time_column='timestamp', time_zone=None, data_columns=None, verbose=False)[source]

Read generic accelerometer or count data from a CSV file.

This function loads data from a CSV file and standardizes the column names for further processing. It supports both 1-dimensional (counts/ENMO) and 3-dimensional (accelerometer) data formats.

Parameters:
  • file_path (str) – Path to the CSV file containing the data.

  • meta_dict (dict) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - sf: Sampling frequency in Hz - raw_data_frequency: Sampling frequency as string - raw_data_type: Type of data (‘Counts’ or ‘Accelerometer’) - raw_data_unit: Unit of data (‘counts’ or ‘mg’)

  • n_dimensions (int) – Number of dimensions in the data. Must be either 1 (for counts/ENMO) or 3 (for accelerometer).

  • time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.

  • data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults are: - [‘counts’] for n_dimensions=1 - [‘x’, ‘y’, ‘z’] for n_dimensions=3

  • verbose (bool, default=False) – Whether to print progress information.

  • data_type (str)

  • time_format (str)

  • time_zone (str | None)

Returns:

DataFrame containing the loaded data with standardized column names: - For n_dimensions=1: [‘ENMO’] (single column) - For n_dimensions=3: [‘x’, ‘y’, ‘z’] (three columns) The DataFrame has a datetime index from the timestamp column.

Return type:

pd.DataFrame

Raises:

ValueError – If n_dimensions is not 1 or 3, or if the number of data_columns doesn’t match n_dimensions.

Examples

Load 1-dimensional count data:

>>> meta_dict = {}
>>> data = read_generic_xD(
...     file_path='data/counts.csv',
...     meta_dict=meta_dict,
...     n_dimensions=1,
...     time_column='time',
...     data_columns=['counts']
... )
>>> print(data.columns)
Index(['ENMO'], dtype='object')

Load 3-dimensional accelerometer data:

>>> meta_dict = {}
>>> data = read_generic_xD(
...     file_path='data/accel.csv',
...     meta_dict=meta_dict,
...     n_dimensions=3,
...     time_column='timestamp',
...     data_columns=['accel_x', 'accel_y', 'accel_z']
... )
>>> print(data.columns)
Index(['x', 'y', 'z'], dtype='object')

Notes

The function automatically: - Converts timestamps to datetime objects - Removes timezone information - Fills missing values with 0 - Sorts data by timestamp - Detects sampling frequency from timestamps - Populates metadata dictionary with data information

filter_generic_data(data, data_type, meta_dict={}, verbose=False, preprocess_args={})[source]

Filter generic data by removing incomplete days and selecting longest consecutive sequence.

This function applies data quality filters to ensure only complete and consecutive days of data are retained for analysis. It removes incomplete days and selects the longest sequence of consecutive days.

Parameters:
  • data (pd.DataFrame) – Input DataFrame with datetime index containing accelerometer or count data.

  • data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data

  • meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Will be updated with: - filtered_n_datapoints: Number of data points after filtering - filtered_start_datetime: Start timestamp after filtering - filtered_end_datetime: End timestamp after filtering

  • verbose (bool, default=False) – Whether to print progress information during filtering.

  • preprocess_args (dict, default={}) – Additional preprocessing arguments that may affect filtering behavior.

Returns:

Filtered DataFrame containing only complete and consecutive days of data. The DataFrame maintains the same structure as the input.

Return type:

pd.DataFrame

Notes

  • Removes days that don’t have the expected number of data points

  • Selects the longest sequence of consecutive days (minimum 4 days required)

  • Updates metadata with information about the filtered data

  • The function assumes 24-hour periods for day-based filtering

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with some incomplete days
>>> dates = pd.date_range('2023-01-01', periods=10000, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.randn(10000)}, index=dates)
>>>
>>> # Filter the data
>>> meta_dict = {}
>>> filtered_data = filter_generic_data(
...     data, data_type='enmo', meta_dict=meta_dict, verbose=True
... )
>>> print(f"Original data points: {len(data)}")
>>> print(f"Filtered data points: {len(filtered_data)}")
resample_generic_data(data, data_type, meta_dict={}, verbose=False)[source]

Resample generic data to minute-level resolution.

This function resamples high-frequency data to minute-level resolution using mean aggregation. This is a standard preprocessing step for circadian rhythm analysis.

Parameters:
  • data (pd.DataFrame) – Input DataFrame with datetime index containing high-frequency data.

  • data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data

  • meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process. Will be updated with: - resampled_n_datapoints: Number of data points after resampling - resampled_start_datetime: Start timestamp after resampling - resampled_end_datetime: End timestamp after resampling

  • verbose (bool, default=False) – Whether to print progress information during resampling.

Returns:

Resampled DataFrame with minute-level resolution. The DataFrame maintains the same column structure as the input but with reduced temporal resolution.

Return type:

pd.DataFrame

Notes

  • Uses pandas resample(‘min’).mean() for minute-level aggregation

  • The function assumes the input data has a datetime index

  • All columns are resampled using mean aggregation

  • Updates metadata with information about the resampled data

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample high-frequency data (every 10 seconds)
>>> dates = pd.date_range('2023-01-01', periods=8640, freq='10S')  # 24 hours
>>> data = pd.DataFrame({
...     'ENMO': np.random.randn(8640),
...     'wear': np.ones(8640)
... }, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {}
>>> resampled_data = resample_generic_data(
...     data, data_type='enmo', meta_dict=meta_dict, verbose=True
... )
>>> print(f"Original frequency: {len(data)} points")
>>> print(f"Resampled frequency: {len(resampled_data)} points")
preprocess_generic_data(data, data_type, preprocess_args={}, meta_dict={}, verbose=False)[source]

Preprocess generic accelerometer data with calibration, noise removal, and wear detection.

This function applies a comprehensive preprocessing pipeline to accelerometer data, including calibration, noise filtering, and wear detection. The preprocessing steps are applied based on the data type and preprocessing arguments.

Parameters:
  • data (pd.DataFrame) – Input DataFrame with datetime index containing accelerometer data. For accelerometer data, must have columns [‘x’, ‘y’, ‘z’].

  • data_type (str) – Type of data being processed. Must be one of: - ‘enmo’: ENMO (Euclidean Norm Minus One) data - ‘accelerometer’: Raw accelerometer data (x, y, z) - ‘alternative_count’: Alternative count data

  • preprocess_args (dict, default={}) – Dictionary containing preprocessing parameters: - ‘calibrate’: Whether to apply accelerometer calibration (default: False) - ‘sphere_crit’: Sphere fitting criterion for calibration (default: 0.3) - ‘sd_criteria’: Standard deviation criterion for calibration (default: 0.1) - ‘remove_noise’: Whether to apply noise filtering (default: False) - ‘filter_cutoff’: Cutoff frequency for noise filter in Hz (default: 2) - ‘detect_wear’: Whether to apply wear detection (default: False) - ‘sd_crit’: Standard deviation criterion for wear detection (default: 0.013) - ‘range_crit’: Range criterion for wear detection (default: 0.05) - ‘window_length’: Window length for wear detection in seconds (default: 60) - ‘window_skip’: Window skip for wear detection in seconds (default: 30)

  • meta_dict (dict, default={}) – Dictionary to store metadata about the preprocessing process.

  • verbose (bool, default=False) – Whether to print progress information during preprocessing.

Returns:

Preprocessed DataFrame with the same structure as input but with applied preprocessing steps. May include additional columns like ‘wear’ if wear detection is enabled.

Return type:

pd.DataFrame

Notes

  • Calibration is only applied to accelerometer data (data_type=’accelerometer-mg’, ‘accelerometer-g’, ‘accelerometer-ms2’)

  • Noise removal uses a Butterworth low-pass filter

  • Wear detection adds a binary ‘wear’ column (1=worn, 0=not worn)

  • The function skips preprocessing steps that are not enabled in preprocess_args

  • All preprocessing steps are applied in sequence: calibration → noise removal → wear detection

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> dates = pd.date_range('2023-01-01', periods=1440, freq='min')
>>> data = pd.DataFrame({
...     'x': np.random.randn(1440),
...     'y': np.random.randn(1440),
...     'z': np.random.randn(1440) + 1  # Add gravity component
... }, index=dates)
>>>
>>> # Apply preprocessing with wear detection
>>> preprocess_args = {
...     'calibrate': True,
...     'remove_noise': True,
...     'detect_wear': True
... }
>>> meta_dict = {}
>>> processed_data = preprocess_generic_data(
...     data, data_type='accelerometer',
...     preprocess_args=preprocess_args, meta_dict=meta_dict, verbose=True
... )
>>> print(f"Processed data shape: {processed_data.shape}")
>>> print(f"Wear column present: {'wear' in processed_data.columns}")

Galaxy Smartwatch Data Functions

read_galaxy_binary_data(galaxy_file_dir, meta_dict, time_column='unix_timestamp_in_ms', data_columns=None, verbose=False)[source]

Read accelerometer data from Galaxy Watch binary files.

Parameters:
  • galaxy_file_dir (str) – Directory containing Galaxy Watch data files

  • meta_dict (dict) – Dictionary to store metadata about the loaded data

  • time_column (str) – Name of the timestamp column in the binary data

  • data_columns (list) – Names of the data columns in the binary data

  • verbose (bool) – Whether to print progress information

Returns:

DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]

Return type:

pd.DataFrame

filter_galaxy_binary_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]

Filter Galaxy Watch accelerometer data by removing incomplete days and selecting longest consecutive sequence.

Parameters:
  • data (pd.DataFrame) – Raw accelerometer data

  • meta_dict (dict) – Dictionary to store metadata about the filtering process

  • verbose (bool) – Whether to print progress information

  • preprocess_args (dict)

Returns:

Filtered accelerometer data

Return type:

pd.DataFrame

resample_galaxy_binary_data(data, meta_dict={}, verbose=False)[source]

Resample Galaxy Watch accelerometer data to a regular interval.

Parameters:
  • data (pd.DataFrame) – Filtered accelerometer data

  • meta_dict (dict) – Dictionary to store metadata about the resampling process

  • verbose (bool) – Whether to print progress information

Returns:

Resampled accelerometer data at regular frequency.

Return type:

pd.DataFrame

preprocess_galaxy_binary_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]

Preprocess Galaxy Watch accelerometer data including rescaling, calibration, noise removal, and wear detection.

Parameters:
  • data (pd.DataFrame) – Resampled accelerometer data

  • preprocess_args (dict) – Dictionary containing preprocessing parameters

  • meta_dict (dict) – Dictionary to store metadata about the preprocessing

  • verbose (bool) – Whether to print progress information

Returns:

Preprocessed accelerometer data with additional columns for raw values and wear detection

Return type:

pd.DataFrame

acceleration_data_to_dataframe(data)[source]

Convert binary acceleration data to pandas DataFrame.

This function converts raw binary acceleration data from Samsung Galaxy Watch into a structured pandas DataFrame format for further processing.

Parameters:

data (object) – Binary acceleration data object containing samples with the following attributes: - acceleration_x: X-axis acceleration value - acceleration_y: Y-axis acceleration value - acceleration_z: Z-axis acceleration value - sensor_body_location: Location of the sensor on the body - unix_timestamp_in_ms: Timestamp in milliseconds since Unix epoch - effective_time_frame: Effective time frame for the sample

Returns:

DataFrame containing accelerometer data with columns: - ‘acceleration_x’: X-axis acceleration values - ‘acceleration_y’: Y-axis acceleration values - ‘acceleration_z’: Z-axis acceleration values - ‘sensor_body_location’: Sensor location information - ‘unix_timestamp_in_ms’: Timestamps in milliseconds - ‘effective_time_frame’: Effective time frame information

Return type:

pd.DataFrame

Notes

  • This function is used internally by read_galaxy_binary_data

  • The function iterates through all samples in the binary data object

  • Each sample is converted to a dictionary and added to the DataFrame

  • The resulting DataFrame maintains the original data structure from the binary file

Examples

>>> # This function is typically called internally by read_galaxy_binary_data
>>> # but can be used directly if you have binary data objects:
>>>
>>> # Load binary data (example)
>>> binary_data = load_acceleration_data("path/to/binary/file")
>>>
>>> # Convert to DataFrame
>>> df = acceleration_data_to_dataframe(binary_data)
>>> print(f"Converted {len(df)} acceleration samples")
>>> print(f"Columns: {df.columns.tolist()}")
read_galaxy_csv_data(galaxy_file_path, meta_dict, time_column='timestamp', data_columns=None, verbose=False)[source]

Read ENMO data from Galaxy Watch CSV file.

This function loads ENMO (Euclidean Norm Minus One) data from Samsung Galaxy Watch CSV files and standardizes the format for further processing.

Parameters:
  • galaxy_file_path (str) – Path to the Galaxy Watch CSV data file containing ENMO values.

  • meta_dict (dict) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - sf: Sampling frequency in Hz - raw_data_frequency: Sampling frequency as string - raw_data_type: Type of data (‘ENMO’) - raw_data_unit: Unit of data (‘mg’)

  • time_column (str, default='timestamp') – Name of the timestamp column in the CSV file.

  • data_columns (list, optional) – Names of the data columns in the CSV file. If not provided, defaults to [‘enmo’].

  • verbose (bool, default=False) – Whether to print progress information during loading.

Returns:

DataFrame containing ENMO data with standardized column names: - ‘ENMO’: ENMO values in mg units The DataFrame has a datetime index from the timestamp column.

Return type:

pd.DataFrame

Notes

  • The function automatically converts UTC timestamps to local time

  • Missing values are filled with 0

  • Data is sorted by timestamp

  • Sampling frequency is automatically detected from timestamps

  • Column names are standardized to ‘ENMO’ for consistency

Examples

>>> import pandas as pd
>>>
>>> # Load ENMO data from Galaxy Watch CSV file
>>> meta_dict = {}
>>> data = read_galaxy_csv_data(
...     galaxy_file_path='data/galaxy_enmo.csv',
...     meta_dict=meta_dict,
...     time_column='time',
...     data_columns=['enmo_mg'],
...     verbose=True
... )
>>> print(f"Loaded {len(data)} ENMO records")
>>> print(f"Sampling frequency: {meta_dict['sf']:.1f} Hz")
filter_galaxy_csv_data(data, meta_dict={}, verbose=False, preprocess_args={})[source]

Filter Galaxy Watch ENMO data by removing incomplete days and selecting longest consecutive sequence.

This function applies data quality filters to Galaxy Watch ENMO data, including removal of incomplete days and selection of the longest consecutive sequence of days.

Parameters:
  • data (pd.DataFrame) – Raw ENMO data with datetime index and ‘ENMO’ column.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Should contain: - sf: Sampling frequency in Hz

  • verbose (bool, default=False) – Whether to print progress information during filtering.

  • preprocess_args (dict, default={}) – Dictionary containing filtering parameters: - required_daily_coverage: Minimum fraction of daily data required (default: 0.5)

Returns:

Filtered ENMO data containing only complete and consecutive days. The DataFrame maintains the same structure as the input.

Return type:

pd.DataFrame

Notes

  • Removes days that don’t meet the required daily coverage threshold

  • Selects the longest sequence of consecutive days (minimum 4 days required)

  • Resamples data to minute-level resolution

  • Removes incomplete first and last days

  • Updates metadata with information about the filtering process

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data
>>> dates = pd.date_range('2023-01-01', periods=10000, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.randn(10000)}, index=dates)
>>>
>>> # Filter the data
>>> meta_dict = {'sf': 1/60}  # 1 sample per minute
>>> preprocess_args = {'required_daily_coverage': 0.8}
>>> filtered_data = filter_galaxy_csv_data(
...     data, meta_dict=meta_dict, preprocess_args=preprocess_args, verbose=True
... )
>>> print(f"Original data points: {len(data)}")
>>> print(f"Filtered data points: {len(filtered_data)}")
resample_galaxy_csv_data(data, meta_dict={}, verbose=False)[source]

Ensure we have minute-level data across the whole timeseries.

This function resamples Galaxy Watch ENMO data to ensure consistent minute-level resolution across the entire time series.

Parameters:
  • data (pd.DataFrame) – Filtered ENMO data with datetime index and ‘ENMO’ column.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.

  • verbose (bool, default=False) – Whether to print progress information during resampling.

Returns:

Resampled ENMO data with consistent minute-level resolution. The DataFrame maintains the same structure as the input.

Return type:

pd.DataFrame

Notes

  • Uses pandas resample(‘1min’) with linear interpolation

  • Forward fills any remaining gaps with bfill()

  • Ensures consistent temporal resolution for analysis

  • Updates metadata with information about the resampling process

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data with irregular intervals
>>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30',
...                         '2023-01-01 00:03:00', '2023-01-01 00:04:30'])
>>> data = pd.DataFrame({'ENMO': [0.1, 0.2, 0.3, 0.4]}, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {}
>>> resampled_data = resample_galaxy_csv_data(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original data points: {len(data)}")
>>> print(f"Resampled data points: {len(resampled_data)}")
preprocess_galaxy_csv_data(data, preprocess_args={}, meta_dict={}, verbose=False)[source]

Preprocess Galaxy Watch ENMO data including rescaling, calibration, noise removal, and wear detection.

This function applies preprocessing steps to Galaxy Watch ENMO data. Currently, wear detection is not implemented for ENMO data as the algorithm relies on raw accelerometer data.

Parameters:
  • data (pd.DataFrame) – Resampled ENMO data with datetime index and ‘ENMO’ column.

  • preprocess_args (dict, default={}) – Dictionary containing preprocessing parameters (currently not used for ENMO data).

  • meta_dict (dict, default={}) – Dictionary to store metadata about the preprocessing process.

  • verbose (bool, default=False) – Whether to print progress information during preprocessing.

Returns:

Preprocessed ENMO data with additional columns: - ‘ENMO’: Original ENMO values - ‘wear’: Wear detection column (set to -1 for ENMO data)

Return type:

pd.DataFrame

Notes

  • Wear detection is not implemented for ENMO data

  • The ‘wear’ column is set to -1 to indicate no wear detection

  • Future implementations may add wear detection for ENMO data

  • The function maintains the original ENMO values

Examples

>>> import pandas as pd
>>>
>>> # Create sample ENMO data
>>> dates = pd.date_range('2023-01-01', periods=1440, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.uniform(0, 0.1, 1440)}, index=dates)
>>>
>>> # Preprocess the data
>>> meta_dict = {}
>>> preprocess_args = {}
>>> processed_data = preprocess_galaxy_csv_data(
...     data, preprocess_args=preprocess_args, meta_dict=meta_dict, verbose=True
... )
>>> print(f"Processed data shape: {processed_data.shape}")
>>> print(f"Wear column present: {'wear' in processed_data.columns}")

UK Biobank Data Functions

read_ukb_data(qc_file_path, enmo_file_dir, eid, meta_dict={}, verbose=False)[source]

Read and process UK Biobank accelerometer data for a specific participant.

This function loads and processes UK Biobank accelerometer data for a specific participant, applying quality control checks and converting the data to a standardized format.

Parameters:
  • qc_file_path (str) – Path to the quality control CSV file containing participant metadata. Must contain columns: eid, acc_data_problem, acc_weartime, acc_calibration, acc_owndata, acc_interrupt_period.

  • enmo_file_dir (str) – Directory containing the ENMO data files (OUT_*.csv format).

  • eid (int) – Participant ID to process.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - raw_data_frequency: Sampling frequency (‘minute-level’) - raw_data_type: Type of data (‘ENMO’) - raw_data_unit: Unit of data (‘mg’)

  • verbose (bool, default=False) – Whether to print processing information and progress.

Returns:

DataFrame containing processed ENMO data with columns: - ‘ENMO’: Euclidean Norm Minus One values in milligravity units The DataFrame has a datetime index.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – If QC file or ENMO directory doesn’t exist.

  • ValueError – If participant data is invalid or fails quality control checks.

Notes

  • Applies multiple quality control filters from the QC file

  • Processes ENMO data from CSV files with acceleration headers

  • Converts timestamps to proper datetime format

  • Filters ENMO values >= 0.1, sets others to 0

  • Sorts data by timestamp for consistency

Examples

>>> import os
>>>
>>> # Load UK Biobank data for a specific participant
>>> qc_file_path = '/path/to/ukb_qc.csv'
>>> enmo_file_dir = '/path/to/enmo/files'
>>> eid = 12345  # Participant ID
>>> meta_dict = {}
>>> data = read_ukb_data(
...     qc_file_path=qc_file_path,
...     enmo_file_dir=enmo_file_dir,
...     eid=eid,
...     meta_dict=meta_dict,
...     verbose=True
... )
>>> print(f"Loaded {len(data)} ENMO records for participant {eid}")
>>> print(f"Data range: {data.index.min()} to {data.index.max()}")
filter_ukb_data(data, meta_dict={}, verbose=False)[source]

Filter UK Biobank accelerometer data to ensure data quality.

This function applies data quality filters to UK Biobank ENMO data, including removal of incomplete days and selection of the longest consecutive sequence.

Parameters:
  • data (pd.DataFrame) – Input DataFrame containing ENMO data with datetime index and ‘ENMO’ column.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process.

  • verbose (bool, default=False) – Whether to print filtering information and progress.

Returns:

Filtered DataFrame containing only complete and consecutive days of data. Maintains same structure as input DataFrame.

Return type:

pd.DataFrame

Notes

  • Removes incomplete days using filter_incomplete_days (requires 1440 points per day)

  • Selects longest consecutive sequence using filter_consecutive_days

  • Assumes minute-level data (1/60 Hz sampling frequency)

  • Updates metadata with information about the filtering process

Examples

>>> import pandas as pd
>>>
>>> # Create sample UK Biobank data
>>> dates = pd.date_range('2023-01-01', periods=10000, freq='min')
>>> data = pd.DataFrame({'ENMO': np.random.uniform(0, 0.1, 10000)}, index=dates)
>>>
>>> # Filter the data
>>> meta_dict = {}
>>> filtered_data = filter_ukb_data(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original data points: {len(data)}")
>>> print(f"Filtered data points: {len(filtered_data)}")
resample_ukb_data(data, meta_dict={}, verbose=False)[source]

Resample UK Biobank accelerometer data to ensure consistent 1-minute intervals.

This function ensures consistent minute-level resolution for UK Biobank ENMO data by resampling to 1-minute intervals and handling any gaps in the data.

Parameters:
  • data (pd.DataFrame) – Input DataFrame containing ENMO data with datetime index and ‘ENMO’ column.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.

  • verbose (bool, default=False) – Whether to print resampling information and progress.

Returns:

Resampled DataFrame with consistent 1-minute intervals. Missing values are interpolated linearly and any remaining gaps are filled using backward fill.

Return type:

pd.DataFrame

Notes

  • Uses pandas resample(‘1min’) with linear interpolation

  • Applies backward fill (bfill) to handle any remaining gaps

  • Ensures consistent temporal resolution for analysis

  • Maintains data integrity and structure

Examples

>>> import pandas as pd
>>>
>>> # Create sample UK Biobank data with irregular intervals
>>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30',
...                         '2023-01-01 00:03:00', '2023-01-01 00:04:30'])
>>> data = pd.DataFrame({
: [0.1, 0.2, 0.3, 0.4]}, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {}
>>> resampled_data = resample_ukb_data(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original data points: {len(data)}")
>>> print(f"Resampled data points: {len(resampled_data)}")

NHANES Data Functions

read_nhanes_data(file_dir, seqn=None, meta_dict={}, verbose=False)[source]

Read and process NHANES accelerometer data files for a specific person.

This function loads and processes National Health and Nutrition Examination Survey (NHANES) accelerometer data for a specific participant. It handles the complex NHANES data structure including day-level, minute-level, and header files.

Parameters:
  • file_dir (str) – Directory containing NHANES data files. Must contain: - PAXDAY_*.xpt: Day-level data files - PAXHD_*.xpt: Header data files - PAXMIN_*.xpt: Minute-level data files

  • seqn (str, optional) – Unique identifier for the participant. Required for data extraction.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the loaded data. Will be populated with: - raw_n_datapoints: Number of data points - raw_start_datetime: Start timestamp - raw_end_datetime: End timestamp - raw_data_frequency: Sampling frequency (‘minute-level’) - raw_data_type: Type of data (‘accelerometer’) - raw_data_unit: Unit of data (‘MIMS’)

  • verbose (bool, default=False) – Whether to print processing status and progress information.

Returns:

Processed accelerometer data with columns: - ‘x’, ‘y’, ‘z’: Accelerometer values in MIMS units - ‘wear’: Binary wear detection (1=worn, 0=not worn) - ‘sleep’: Binary sleep detection (1=sleep, 0=wake) - ‘paxpredm’: Original NHANES prediction values The DataFrame is indexed by timestamp.

Return type:

pd.DataFrame

Raises:

ValueError – If seqn is None or if no valid NHANES data is found for the participant.

Notes

  • Automatically detects and processes multiple NHANES data versions

  • Applies data quality filters (paxqfd < 1, valid_hours > 16)

  • Requires at least 4 days of valid data per participant

  • Filters for complete days (288 epochs per day)

  • Converts column names to lowercase for consistency

  • Removes byte-encoded data using remove_bytes function

Examples

>>> import os
>>>
>>> # Load NHANES data for a specific participant
>>> file_dir = '/path/to/nhanes/data'
>>> seqn = '12345'  # Participant ID
>>> meta_dict = {}
>>> data = read_nhanes_data(
...     file_dir=file_dir,
...     seqn=seqn,
...     meta_dict=meta_dict,
...     verbose=True
... )
>>> print(f"Loaded {len(data)} records for participant {seqn}")
>>> print(f"Data columns: {data.columns.tolist()}")
filter_and_preprocess_nhanes_data(data, meta_dict={}, verbose=False)[source]

Filter NHANES accelerometer data for incomplete days and non-consecutive sequences.

This function applies data quality filters to NHANES accelerometer data and converts the data to the standard format used by the CosinorAge pipeline.

Parameters:
  • data (pd.DataFrame) – Raw NHANES accelerometer data with columns [‘x’, ‘y’, ‘z’, ‘wear’, ‘sleep’, ‘paxpredm’] and datetime index.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the filtering process. Will be populated with: - n_days: Number of valid days after filtering

  • verbose (bool, default=False) – Whether to print processing status and progress information.

Returns:

Filtered and preprocessed accelerometer data with columns: - ‘x’, ‘y’, ‘z’: Accelerometer values converted from MIMS to mg units - ‘x_raw’, ‘y_raw’, ‘z_raw’: Original accelerometer values - ‘wear’: Binary wear detection - ‘sleep’: Binary sleep detection - ‘paxpredm’: Original NHANES prediction values - ‘ENMO’: Calculated ENMO values (scaled by factor of 257)

Return type:

pd.DataFrame

Notes

  • Removes incomplete days using filter_incomplete_days

  • Selects longest consecutive sequence using filter_consecutive_days

  • Converts accelerometer values from MIMS to mg units (division by 9.81)

  • Calculates ENMO values with a scaling factor of 257 for parameter tuning

  • Stores original values in *_raw columns for reference

Examples

>>> import pandas as pd
>>>
>>> # Create sample NHANES data
>>> dates = pd.date_range('2023-01-01', periods=10000, freq='min')
>>> data = pd.DataFrame({
...     'x': np.random.randn(10000),
...     'y': np.random.randn(10000),
...     'z': np.random.randn(10000),
...     'wear': np.random.choice([0, 1], 10000),
...     'sleep': np.random.choice([0, 1], 10000),
...     'paxpredm': np.random.choice([0, 1, 2], 10000)
... }, index=dates)
>>>
>>> # Filter and preprocess the data
>>> meta_dict = {}
>>> processed_data = filter_and_preprocess_nhanes_data(
...     data, meta_dict=meta_dict, verbose=True
... )
>>> print(f"Processed data shape: {processed_data.shape}")
>>> print(f"Number of days: {meta_dict.get('n_days', 'N/A')}")
resample_nhanes_data(data, meta_dict={}, verbose=False)[source]

Resample NHANES accelerometer data to 1-minute intervals using linear interpolation.

This function ensures consistent minute-level resolution for NHANES accelerometer data by resampling to 1-minute intervals and handling categorical variables appropriately.

Parameters:
  • data (pd.DataFrame) – NHANES accelerometer data with datetime index and columns including ‘x’, ‘y’, ‘z’, ‘sleep’, ‘wear’.

  • meta_dict (dict, default={}) – Dictionary to store metadata about the resampling process.

  • verbose (bool, default=False) – Whether to print processing status and progress information.

Returns:

Resampled accelerometer data with consistent 1-minute intervals. Categorical variables (‘sleep’, ‘wear’) are rounded to nearest integer.

Return type:

pd.DataFrame

Notes

  • Uses pandas resample(‘1min’) with linear interpolation for continuous variables

  • Applies forward fill (bfill) to handle any remaining gaps

  • Rounds categorical variables (‘sleep’, ‘wear’) to nearest integer

  • Maintains data integrity for binary classification variables

Examples

>>> import pandas as pd
>>>
>>> # Create sample NHANES data with irregular intervals
>>> dates = pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:30',
...                         '2023-01-01 00:03:00', '2023-01-01 00:04:30'])
>>> data = pd.DataFrame({
...     'x': [0.1, 0.2, 0.3, 0.4],
...     'y': [0.1, 0.2, 0.3, 0.4],
...     'z': [0.1, 0.2, 0.3, 0.4],
...     'sleep': [0, 1, 0, 1],
...     'wear': [1, 1, 0, 1]
... }, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {}
>>> resampled_data = resample_nhanes_data(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original data points: {len(data)}")
>>> print(f"Resampled data points: {len(resampled_data)}")
remove_bytes(df)[source]

Convert byte string columns to regular strings in a DataFrame.

This function handles byte-encoded string columns that are common in NHANES data files, converting them to UTF-8 encoded strings for proper processing.

Parameters:

df (pd.DataFrame) – Input DataFrame containing potential byte string columns.

Returns:

DataFrame with byte strings converted to UTF-8 strings. Non-byte string columns remain unchanged.

Return type:

pd.DataFrame

Notes

  • Only processes columns with object dtype (likely to contain byte strings)

  • Uses UTF-8 encoding for conversion

  • Leaves non-byte string values unchanged

  • Common in NHANES data due to SAS file format

Examples

>>> import pandas as pd
>>>
>>> # Create sample DataFrame with byte strings
>>> data = {
...     'col1': [b'hello', b'world', 'normal_string'],
...     'col2': [1, 2, 3],
...     'col3': [b'byte1', b'byte2', b'byte3']
... }
>>> df = pd.DataFrame(data)
>>>
>>> # Convert byte strings
>>> cleaned_df = remove_bytes(df)
>>> print(cleaned_df['col1'].iloc[0])  # 'hello' instead of b'hello'
clean_data(df, days)[source]

Clean NHANES minute-level data by applying quality filters.

This function applies multiple quality filters to NHANES minute-level data to ensure only valid measurements are included in the analysis.

Parameters:
  • df (pd.DataFrame) – Raw minute-level NHANES data containing columns: - ‘SEQN’: Participant identifier - ‘PAXMTSM’: Minute-level timestamp - ‘PAXPREDM’: Prediction values - ‘PAXQFM’: Quality flag

  • days (pd.DataFrame) – Day-level NHANES data containing valid participant identifiers in ‘seqn’ column.

Returns:

Cleaned minute-level data with invalid measurements and participants removed.

Return type:

pd.DataFrame

Notes

  • Filters for participants present in day-level data

  • Removes measurements with PAXMTSM = -0.01 (invalid timestamp)

  • Excludes PAXPREDM values of 3 or 4 (invalid predictions)

  • Removes measurements with PAXQFM >= 1 (poor quality)

Examples

>>> import pandas as pd
>>>
>>> # Create sample NHANES data
>>> minute_data = pd.DataFrame({
...     'SEQN': ['12345', '12345', '12346', '12345'],
...     'PAXMTSM': [0, -0.01, 60, 120],
...     'PAXPREDM': [1, 2, 3, 1],
...     'PAXQFM': [0, 0, 1, 0]
... })
>>>
>>> day_data = pd.DataFrame({'seqn': ['12345']})
>>>
>>> # Clean the data
>>> cleaned_data = clean_data(minute_data, day_data)
>>> print(f"Original records: {len(minute_data)}")
>>> print(f"Cleaned records: {len(cleaned_data)}")
calculate_measure_time(row)[source]

Calculate the measurement timestamp for a row of NHANES data.

This function converts NHANES timing information into actual datetime timestamps by combining the day start time with the seconds since midnight.

Parameters:

row (pd.Series) – Row containing timing information: - ‘day1_start_time’: Start time of the first day in format “HH:MM:SS” - ‘paxssnmp’: Seconds since midnight (scaled by 80)

Returns:

Calculated measurement timestamp combining base time and offset.

Return type:

datetime

Notes

  • Converts day1_start_time string to datetime object

  • Divides paxssnmp by 80 to get actual seconds (NHANES scaling factor)

  • Adds the offset to the base time to get measurement timestamp

  • Used for creating proper datetime index for NHANES data

Examples

>>> import pandas as pd
>>>
>>> # Create sample row with timing information
>>> row = pd.Series({
...     'day1_start_time': '08:30:00',
...     'paxssnmp': 8000  # 100 seconds * 80
... })
>>>
>>> # Calculate measurement time
>>> measure_time = calculate_measure_time(row)
>>> print(f"Measurement time: {measure_time}")
>>> # Output: 1900-01-01 08:31:40 (base time + 100 seconds)

General Utility Functions

filter_incomplete_days(df, data_freq, expected_points_per_day=None)[source]

Filter out data from incomplete days to ensure 24-hour data periods.

This function removes data from days that don’t have the expected number of data points to ensure that only complete 24-hour data is retained for analysis.

Parameters:
  • df (pd.DataFrame) – DataFrame with datetime index, which is used to determine the day. The index should contain datetime objects.

  • data_freq (float) – Frequency of data collection in Hz (e.g., 1/60 for minute-level data).

  • expected_points_per_day (int, optional) – Expected number of data points per day. If None, calculated using data_freq * 86400.

Returns:

Filtered DataFrame containing only complete days. Returns empty DataFrame if an error occurs during processing.

Return type:

pd.DataFrame

Notes

  • Calculates expected points per day as data_freq * 60 * 60 * 24 if not provided

  • Groups data by date and counts points per day

  • Retains only days with sufficient data points

  • Removes the temporary ‘DATE’ column before returning

  • Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with some incomplete days
>>> dates = pd.date_range('2023-01-01', periods=5000, freq='min')
>>> data = pd.DataFrame({'value': np.random.randn(5000)}, index=dates)
>>>
>>> # Filter incomplete days (expecting 1440 points per day for minute data)
>>> filtered_data = filter_incomplete_days(data, data_freq=1/60, expected_points_per_day=1440)
>>> print(f"Original days: {len(data.index.date.unique())}")
>>> print(f"Complete days: {len(filtered_data.index.date.unique())}")
filter_consecutive_days(df)[source]

Filter DataFrame to retain only the longest sequence of consecutive days.

This function identifies the longest sequence of consecutive days in the data and filters the DataFrame to include only those days. This is important for circadian rhythm analysis which requires continuous data.

Parameters:

df (pd.DataFrame) – DataFrame with datetime index containing the data to filter.

Returns:

Filtered DataFrame containing only the longest sequence of consecutive days.

Return type:

pd.DataFrame

Raises:

ValueError – If less than 2 consecutive days are found in the data.

Notes

  • Extracts unique dates from the datetime index

  • Finds the longest consecutive sequence using largest_consecutive_sequence

  • Requires at least 2 consecutive days for valid analysis

  • Filters the DataFrame to include only data from consecutive days

  • Important for circadian rhythm analysis which requires continuous data

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with gaps
>>> dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03',
...                         '2023-01-05', '2023-01-06', '2023-01-07'])
>>> data = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates)
>>>
>>> # Filter to longest consecutive sequence
>>> filtered_data = filter_consecutive_days(data)
>>> print(f"Original dates: {data.index.date.tolist()}")
>>> print(f"Consecutive dates: {filtered_data.index.date.tolist()}")
largest_consecutive_sequence(dates)[source]

Find the longest sequence of consecutive dates in a list.

This function analyzes a list of dates and returns the longest subsequence of consecutive dates. It’s used to identify continuous periods of data for circadian rhythm analysis.

Parameters:

dates (List[datetime]) – List of dates to analyze for consecutive sequences.

Returns:

Longest sequence of consecutive dates found. Returns empty list if input is empty.

Return type:

List[datetime]

Notes

  • Sorts and removes duplicate dates before processing

  • Compares dates using timedelta(days=1) for consecutive day detection

  • Maintains the original order within consecutive sequences

  • Handles edge cases like empty lists and single dates

  • Used internally by filter_consecutive_days

Examples

>>> from datetime import datetime
>>>
>>> # Example with gaps in dates
>>> dates = [
...     datetime(2023, 1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3),
...     datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)
... ]
>>> consecutive = largest_consecutive_sequence(dates)
>>> print(f"Longest consecutive sequence: {consecutive}")
>>> # Output: [datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)]
>>>
>>> # Example with single date
>>> single_date = [datetime(2023, 1, 1)]
>>> result = largest_consecutive_sequence(single_date)
>>> print(f"Single date result: {result}")
>>> # Output: [datetime(2023, 1, 1)]
calculate_enmo(data, verbose=False)[source]

Calculate the Euclidean Norm Minus One (ENMO) metric from accelerometer data.

This function computes the ENMO metric, which is a widely used measure in physical activity research for quantifying acceleration while accounting for gravity.

Parameters:
  • data (pd.DataFrame) – DataFrame containing accelerometer data with columns: - ‘x’: X-axis acceleration values - ‘y’: Y-axis acceleration values - ‘z’: Z-axis acceleration values All values should be in g units (1g = 9.81 m/s²).

  • verbose (bool, default=False) – If True, prints processing information.

Returns:

Array of ENMO values. Values are truncated at 0, meaning negative values are set to 0. Returns np.nan if calculation fails.

Return type:

numpy.ndarray

Notes

  • ENMO = sqrt(x² + y² + z²) - 1

  • Values are truncated at 0 (negative values become 0)

  • ENMO represents acceleration in excess of 1g (gravity)

  • Commonly used in physical activity and sleep research

  • Handles errors gracefully by returning np.nan

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> data = pd.DataFrame({
...     'x': [0.1, 0.2, 0.3],
...     'y': [0.1, 0.2, 0.3],
...     'z': [1.0, 1.1, 1.2]  # Close to 1g (gravity)
... })
>>>
>>> # Calculate ENMO
>>> enmo_values = calculate_enmo(data, verbose=True)
>>> print(f"ENMO values: {enmo_values}")
>>> # Output: [0.014, 0.028, 0.042] (approximately)
calculate_minute_level_enmo(data, meta_dict={}, verbose=False)[source]

Resample high-frequency ENMO data to minute-level by averaging over each minute.

This function aggregates high-frequency ENMO data to minute-level resolution using mean aggregation, which is the standard approach for circadian rhythm analysis.

Parameters:
  • data (pd.DataFrame) – DataFrame with datetime index and ‘ENMO’ column containing high-frequency ENMO data. Optional ‘wear’ column for wear time information.

  • meta_dict (dict, default={}) – Dictionary containing metadata. Should include: - ‘sf’: Sampling frequency in Hz (defaults to 25Hz if not specified)

  • verbose (bool, default=False) – If True, prints processing information.

Returns:

DataFrame containing minute-level aggregated data with: - ‘ENMO’: Mean ENMO value for each minute - ‘wear’: Mean wear time for each minute (if wear column exists in input) Index is datetime at minute resolution.

Return type:

pd.DataFrame

Raises:

ValueError – If sampling frequency is less than 1/60 Hz (less than one sample per minute).

Notes

  • Uses pandas resample(‘min’).mean() for aggregation

  • Handles both ENMO and wear columns if present

  • Converts index to datetime format

  • Standard preprocessing step for circadian rhythm analysis

  • Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample high-frequency ENMO data
>>> dates = pd.date_range('2023-01-01 00:00:00', periods=3600, freq='S')  # 1 hour of second-level data
>>> data = pd.DataFrame({
...     'ENMO': np.random.uniform(0, 0.1, 3600),
...     'wear': np.random.choice([0, 1], 3600)
... }, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {'sf': 1}  # 1 Hz sampling frequency
>>> minute_data = calculate_minute_level_enmo(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original records: {len(data)}")
>>> print(f"Minute-level records: {len(minute_data)}")
calibrate_accelerometer(data, sphere_crit, sd_criteria, meta_dict=None, verbose=False)[source]

Calibrate accelerometer data using sphere fitting method.

This function applies accelerometer calibration using the sphere fitting approach to correct for sensor bias and scaling errors. The calibration process fits the accelerometer data to a unit sphere and applies correction factors.

Parameters:
  • data (pd.DataFrame) – Raw accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units (1g = 9.81 m/s²).

  • sphere_crit (float) – Sphere fitting criterion threshold. Controls the tolerance for sphere fitting. Lower values result in stricter calibration requirements.

  • sd_criteria (float) – Standard deviation criterion threshold. Controls the tolerance for standard deviation of the calibrated data.

  • meta_dict (dict, optional) – Dictionary to store calibration parameters and metadata. If None, an empty dict will be created. Updated with calibration results including: - ‘calibration_offset’: Offset correction factors - ‘calibration_scale’: Scale correction factors

  • verbose (bool, default=False) – Whether to print progress information during calibration.

Returns:

Calibrated accelerometer data with the same structure as input data. The calibrated data has corrected bias and scaling errors.

Return type:

pd.DataFrame

Notes

  • The function uses the skdh.preprocessing.CalibrateAccelerometer class

  • Calibration parameters are stored in meta_dict for future reference

  • The function assumes data is sampled at the frequency specified in meta_dict[‘sf’]

  • If no sampling frequency is found in meta_dict, defaults to 25 Hz

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000),
...     'y': np.random.normal(0, 0.1, 1000),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Calibrate the data
>>> meta_dict = {'sf': 25}
>>> calibrated_data = calibrate_accelerometer(
...     data, sphere_crit=0.3, sd_criteria=0.1,
...     meta_dict=meta_dict, verbose=True
... )
>>> print(f"Calibration offset: {meta_dict.get('calibration_offset')}")
detect_frequency_from_timestamps(timestamps)[source]

Detect sampling frequency by finding the most common time delta.

This function analyzes a series of timestamps to determine the sampling frequency of the data by calculating the time differences between consecutive samples and finding the most frequently occurring interval.

Parameters:

timestamps (pd.Series) – Series or array of datetime objects representing the timestamps of data points. Can be pandas datetime objects, numpy datetime64, or string timestamps that can be converted to datetime.

Returns:

Sampling frequency in Hz (samples per second).

Return type:

float

Raises:

ValueError – If less than two timestamps are provided. If no time deltas can be calculated. If the most common time delta is zero. If the mode cannot be determined.

Notes

  • The function converts all timestamps to pandas datetime format

  • Time deltas are calculated in seconds

  • The most common (mode) time delta is used to determine frequency

  • Frequency is calculated as 1.0 / most_common_delta

Examples

>>> import pandas as pd
>>>
>>> # Regular 25 Hz sampling
>>> timestamps = pd.date_range('2023-01-01', periods=100, freq='40ms')
>>> freq = detect_frequency_from_timestamps(timestamps)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz
>>>
>>> # Irregular sampling with some missing points
>>> irregular_times = pd.to_datetime([
...     '2023-01-01 00:00:00',
...     '2023-01-01 00:00:00.040',
...     '2023-01-01 00:00:00.080',
...     '2023-01-01 00:00:00.120',
...     '2023-01-01 00:00:00.200',  # Gap here
...     '2023-01-01 00:00:00.240'
... ])
>>> freq = detect_frequency_from_timestamps(irregular_times)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz
remove_noise(data, sf, filter_type='lowpass', filter_cutoff=2, verbose=False)[source]

Remove noise from accelerometer data using a Butterworth filter.

This function applies a digital Butterworth filter to remove noise from accelerometer data. The filter can be configured as lowpass, highpass, bandpass, or bandstop depending on the noise characteristics.

Parameters:
  • data (pd.DataFrame) – DataFrame containing accelerometer data with columns [‘x’, ‘y’, ‘z’]. Data should have a datetime index and contain acceleration values in g units.

  • sf (float) – Sampling frequency of the accelerometer data in Hz.

  • filter_type (str, default='lowpass') – Type of filter to apply. Must be one of: - ‘lowpass’: Removes high-frequency noise above cutoff - ‘highpass’: Removes low-frequency noise below cutoff - ‘bandpass’: Keeps frequencies between two cutoff values - ‘bandstop’: Removes frequencies between two cutoff values

  • filter_cutoff (float or list, default=2) – Cutoff frequency(ies) for the filter in Hz. - For lowpass/highpass: single float value - For bandpass/bandstop: list of two values [low_cutoff, high_cutoff]

  • verbose (bool, default=False) – Whether to print progress information during filtering.

Returns:

DataFrame with noise removed from the [‘x’, ‘y’, ‘z’] columns. The filtered data maintains the same structure as the input.

Return type:

pd.DataFrame

Raises:
  • ValueError – If filter_type is ‘bandpass’ or ‘bandstop’ but filter_cutoff is not a list of two values. If filter_type is ‘lowpass’ or ‘highpass’ but filter_cutoff is not a single numeric value. If the input DataFrame is empty.

  • KeyError – If the DataFrame does not contain required columns [‘x’, ‘y’, ‘z’].

Notes

  • Uses scipy.signal.butter and scipy.signal.filtfilt for zero-phase filtering

  • The filter order is fixed at 2 (second-order Butterworth filter)

  • The function applies the same filter to all three axes (x, y, z)

  • Zero-phase filtering is used to avoid phase distortion

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data with noise
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000) + 0.5*np.sin(2*np.pi*0.1*np.arange(1000)),
...     'y': np.random.normal(0, 0.1, 1000) + 0.3*np.cos(2*np.pi*0.05*np.arange(1000)),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Remove high-frequency noise with lowpass filter
>>> filtered_data = remove_noise(data, sf=25, filter_type='lowpass',
...                              filter_cutoff=2, verbose=True)
>>>
>>> # Remove low-frequency drift with highpass filter
>>> filtered_data = remove_noise(data, sf=25, filter_type='highpass',
...                              filter_cutoff=0.1, verbose=True)
detect_wear_periods(data, sf, sd_crit, range_crit, window_length, window_skip, meta_dict={}, verbose=False)[source]

Detect periods of device wear using acceleration thresholds.

This function identifies when the accelerometer device is being worn by analyzing the standard deviation and range of acceleration data within sliding windows. The algorithm is based on the assumption that worn devices show more variable acceleration patterns than unworn devices.

Parameters:
  • data (pd.DataFrame) – Preprocessed accelerometer data with datetime index and columns [‘x’, ‘y’, ‘z’]. Data should be in g units and cleaned of major artifacts.

  • sf (float) – Sampling frequency of the accelerometer data in Hz.

  • sd_crit (float) – Standard deviation criterion for wear detection. Threshold for the minimum standard deviation required to classify a window as “worn”.

  • range_crit (float) – Range criterion for wear detection. Threshold for the minimum range of acceleration values required to classify a window as “worn”.

  • window_length (int) – Length of the sliding window in seconds. Longer windows provide more stable wear detection but may miss brief wear periods.

  • window_skip (int) – Number of seconds to skip between consecutive windows. Controls the temporal resolution of wear detection.

  • meta_dict (dict, default={}) – Dictionary to store wear detection metadata and parameters.

  • verbose (bool, default=False) – Whether to print progress information during wear detection.

Returns:

DataFrame with binary wear detection column [‘wear’] where: - 1 indicates the device is being worn - 0 indicates the device is not being worn The DataFrame has the same index as the input data.

Return type:

pd.DataFrame

Notes

  • Uses skdh.preprocessing.AccelThresholdWearDetection for the core algorithm

  • The function converts acceleration data from g to mg units for processing

  • Wear periods are determined by analyzing both standard deviation and range

  • The algorithm is sensitive to the choice of sd_crit and range_crit parameters

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'x': np.random.normal(0, 0.1, 1000),
...     'y': np.random.normal(0, 0.1, 1000),
...     'z': np.random.normal(1, 0.1, 1000)  # Gravity component
... }, index=timestamps)
>>>
>>> # Detect wear periods
>>> wear_data = detect_wear_periods(
...     data, sf=25, sd_crit=0.013, range_crit=0.05,
...     window_length=60, window_skip=30, verbose=True
... )
>>> print(f"Wear time: {wear_data['wear'].sum() / 25:.1f} seconds")
calc_weartime(data, sf, meta_dict, verbose)[source]

Calculate total, wear, and non-wear time from accelerometer data.

This function computes summary statistics about device wear time based on wear detection results. It calculates the total recording duration, time the device was worn, and time the device was not worn.

Parameters:
  • data (pd.DataFrame) – DataFrame containing accelerometer data with a ‘wear’ column indicating wear status (1 for worn, 0 for not worn). Should have a datetime index.

  • sf (float) – Sampling frequency of the accelerometer data in Hz.

  • meta_dict (dict) – Dictionary to store wear time metadata. Will be updated with the following keys: - ‘total_time’: Total recording time in seconds - ‘wear_time’: Time device was worn in seconds - ‘non-wear_time’: Time device was not worn in seconds

  • verbose (bool) – Whether to print progress information during calculation.

Returns:

Updates meta_dict with wear time statistics.

Return type:

None

Notes

  • Total time is calculated from the first to last timestamp

  • Wear time is calculated by summing the ‘wear’ column and converting to seconds

  • Non-wear time is calculated as total_time - wear_time

  • All times are stored in seconds in the meta_dict

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample data with wear detection
>>> timestamps = pd.date_range('2023-01-01', periods=1000, freq='40ms')
>>> data = pd.DataFrame({
...     'wear': np.random.choice([0, 1], 1000, p=[0.3, 0.7])  # 70% wear time
... }, index=timestamps)
>>>
>>> # Calculate wear time statistics
>>> meta_dict = {}
>>> calc_weartime(data, sf=25, meta_dict=meta_dict, verbose=True)
>>> print(f"Total time: {meta_dict['total_time']:.1f} seconds")
>>> print(f"Wear time: {meta_dict['wear_time']:.1f} seconds")
>>> print(f"Non-wear time: {meta_dict['non-wear_time']:.1f} seconds")

Visualization Functions

plot_orig_enmo(acc_handler, resample='15min', wear=True)[source]

Plot the original ENMO values resampled at a specified interval.

This function creates a time series plot of ENMO (Euclidean Norm Minus One) values with optional highlighting of wear and non-wear periods. The data is resampled to reduce noise and improve visualization clarity.

Parameters:
  • acc_handler (DataHandler) – Accelerometer data handler object containing the raw data. Must have: - get_sf_data(): Method returning DataFrame with ‘ENMO’ and ‘wear’ columns

  • resample (str, default='15min') – The resampling interval for the plot. Can be any pandas time frequency string (e.g., ‘5min’, ‘1H’, ‘1D’).

  • wear (bool, default=True) – Whether to add color bands for wear and non-wear periods. - True: Shows red bands for non-wear periods - False: Shows only the ENMO time series

Returns:

Displays a matplotlib plot.

Return type:

None

Notes

  • The function resamples the data using mean aggregation

  • Non-wear periods are highlighted with red bands when wear=True

  • The plot uses a progress bar (tqdm) when processing wear data

  • The figure size is set to 12x6 inches

Examples

>>> from cosinorage.datahandlers import GenericDataHandler
>>>
>>> # Load data
>>> handler = GenericDataHandler('data.csv')
>>>
>>> # Plot with wear periods highlighted
>>> plot_orig_enmo(handler, resample='30min', wear=True)
>>>
>>> # Plot without wear highlighting
>>> plot_orig_enmo(handler, resample='1H', wear=False)
plot_enmo(handler)[source]

Plot minute-level ENMO values with optional wear/non-wear period highlighting.

This function creates a time series plot of minute-level ENMO values with automatic highlighting of wear and non-wear periods using colored bands.

Parameters:

handler (DataHandler) – Data handler object containing the minute-level ENMO data. Must have: - get_ml_data(): Method returning DataFrame with ‘ENMO’ column - Optional ‘wear’ column for wear/non-wear periods

Returns:

Displays a matplotlib plot showing ENMO values over time with optional wear/non-wear period highlighting in green/red.

Return type:

None

Notes

  • Wear periods are highlighted in green

  • Non-wear periods are highlighted in red

  • The plot automatically adjusts Y-axis limits to show the full range

  • If no ‘wear’ column is present, only the ENMO time series is shown

  • The figure size is set to 12x6 inches

Examples

>>> from cosinorage.datahandlers import GenericDataHandler
>>>
>>> # Load data
>>> handler = GenericDataHandler('data.csv')
>>>
>>> # Plot minute-level ENMO with wear highlighting
>>> plot_enmo(handler)
plot_orig_enmo_freq(acc_handler)[source]

Plot the frequency domain representation of the original ENMO signal using Welch’s method.

This function computes and displays the power spectral density (PSD) of the ENMO signal using Welch’s method, which provides a smoothed estimate of the signal’s frequency content.

Parameters:

acc_handler (DataHandler) – Accelerometer data handler object containing the raw ENMO data. Must have: - get_sf_data(): Method returning DataFrame with ‘ENMO’ column

Returns:

Displays a matplotlib plot showing the power spectral density of the ENMO signal computed using Welch’s method.

Return type:

None

Notes

  • Uses scipy.signal.welch for power spectral density estimation

  • Sampling frequency is set to 80 Hz

  • Segment length is set to 1024 samples for frequency resolution

  • The plot shows frequency (Hz) on the x-axis and power spectral density on the y-axis

  • The figure size is set to 20x5 inches

Examples

>>> from cosinorage.datahandlers import GenericDataHandler
>>>
>>> # Load data
>>> handler = GenericDataHandler('data.csv')
>>>
>>> # Plot frequency domain representation
>>> plot_orig_enmo_freq(handler)