cosinorage.datahandlers Module

Module Contents

Classes

class DataHandler[source]

Bases: object

A base class for data handlers that process and store ENMO data at the minute level.

This class provides a common interface for data handlers with methods to load data, retrieve processed ENMO values, and save data. The load_data and save_data methods are intended to be overridden by subclasses.

raw_data

Raw accelerometer data loaded from the source.

Type:

pd.DataFrame or None

sf_data

Filtered and processed accelerometer data.

Type:

pd.DataFrame or None

ml_data

Minute-level ENMO data calculated from processed data.

Type:

pd.DataFrame or None

meta_dict

Dictionary storing metadata about the data processing.

Type:

dict

__init__()[source]

Initializes an empty DataHandler instance with an empty DataFrame for storing minute-level ENMO values.

Notes

This is a base class constructor. Subclasses should override this method to accept specific parameters for their data sources.

save_data(output_path)[source]

Save minute-level ENMO data to a specified output path.

This method is intended to be implemented by subclasses, specifying the format and structure for saving data.

Parameters:

output_path (str) – The file path where the minute-level ENMO data will be saved.

get_raw_data()[source]

Retrieve the raw data.

Returns:

A DataFrame containing the raw data.

Return type:

pd.DataFrame

get_sf_data()[source]

Retrieve the filtered data.

Returns:

A DataFrame containing the filtered data.

Return type:

pd.DataFrame

get_ml_data()[source]

Retrieve the minute-level ENMO values.

Returns:

A DataFrame containing the minute-level ENMO values.

Return type:

pd.DataFrame

get_meta_data()[source]

Retrieve the metadata.

Returns:

A dictionary containing the metadata.

Return type:

dict

Utility Functions

Generic Data Functions

Galaxy Smartwatch Data Functions

UK Biobank Data Functions

NHANES Data Functions

General Utility Functions

filter_incomplete_days(df, data_freq, expected_points_per_day=None)[source]

Filter out data from incomplete days to ensure 24-hour data periods.

This function removes data from days that don’t have the expected number of data points to ensure that only complete 24-hour data is retained for analysis.

Parameters:
  • df (pd.DataFrame) – DataFrame with datetime index, which is used to determine the day. The index should contain datetime objects.

  • data_freq (float) – Frequency of data collection in Hz (e.g., 1/60 for minute-level data).

  • expected_points_per_day (int, optional) – Expected number of data points per day. If None, calculated using data_freq * 86400.

Returns:

Filtered DataFrame containing only complete days. Returns empty DataFrame if an error occurs during processing.

Return type:

pd.DataFrame

Notes

  • Calculates expected points per day as data_freq * 60 * 60 * 24 if not provided

  • Groups data by date and counts points per day

  • Retains only days with sufficient data points

  • Removes the temporary ‘DATE’ column before returning

  • Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with some incomplete days
>>> dates = pd.date_range('2023-01-01', periods=5000, freq='min')
>>> data = pd.DataFrame({'value': np.random.randn(5000)}, index=dates)
>>>
>>> # Filter incomplete days (expecting 1440 points per day for minute data)
>>> filtered_data = filter_incomplete_days(data, data_freq=1/60, expected_points_per_day=1440)
>>> print(f"Original days: {len(data.index.date.unique())}")
>>> print(f"Complete days: {len(filtered_data.index.date.unique())}")
filter_consecutive_days(df)[source]

Filter DataFrame to retain only the longest sequence of consecutive days.

This function identifies the longest sequence of consecutive days in the data and filters the DataFrame to include only those days. This is important for circadian rhythm analysis which requires continuous data.

Parameters:

df (pd.DataFrame) – DataFrame with datetime index containing the data to filter.

Returns:

Filtered DataFrame containing only the longest sequence of consecutive days.

Return type:

pd.DataFrame

Raises:

ValueError – If less than 2 consecutive days are found in the data.

Notes

  • Extracts unique dates from the datetime index

  • Finds the longest consecutive sequence using largest_consecutive_sequence

  • Requires at least 2 consecutive days for valid analysis

  • Filters the DataFrame to include only data from consecutive days

  • Important for circadian rhythm analysis which requires continuous data

Examples

>>> import pandas as pd
>>>
>>> # Create sample data with gaps
>>> dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03',
...                         '2023-01-05', '2023-01-06', '2023-01-07'])
>>> data = pd.DataFrame({'value': np.random.randn(len(dates))}, index=dates)
>>>
>>> # Filter to longest consecutive sequence
>>> filtered_data = filter_consecutive_days(data)
>>> print(f"Original dates: {data.index.date.tolist()}")
>>> print(f"Consecutive dates: {filtered_data.index.date.tolist()}")
largest_consecutive_sequence(dates)[source]

Find the longest sequence of consecutive dates in a list.

This function analyzes a list of dates and returns the longest subsequence of consecutive dates. It’s used to identify continuous periods of data for circadian rhythm analysis.

Parameters:

dates (List[datetime]) – List of dates to analyze for consecutive sequences.

Returns:

Longest sequence of consecutive dates found. Returns empty list if input is empty.

Return type:

List[datetime]

Notes

  • Sorts and removes duplicate dates before processing

  • Compares dates using timedelta(days=1) for consecutive day detection

  • Maintains the original order within consecutive sequences

  • Handles edge cases like empty lists and single dates

  • Used internally by filter_consecutive_days

Examples

>>> from datetime import datetime
>>>
>>> # Example with gaps in dates
>>> dates = [
...     datetime(2023, 1, 1), datetime(2023, 1, 2), datetime(2023, 1, 3),
...     datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)
... ]
>>> consecutive = largest_consecutive_sequence(dates)
>>> print(f"Longest consecutive sequence: {consecutive}")
>>> # Output: [datetime(2023, 1, 5), datetime(2023, 1, 6), datetime(2023, 1, 7)]
>>>
>>> # Example with single date
>>> single_date = [datetime(2023, 1, 1)]
>>> result = largest_consecutive_sequence(single_date)
>>> print(f"Single date result: {result}")
>>> # Output: [datetime(2023, 1, 1)]
calculate_enmo(data, verbose=False)[source]

Calculate the Euclidean Norm Minus One (ENMO) metric from accelerometer data.

This function computes the ENMO metric, which is a widely used measure in physical activity research for quantifying acceleration while accounting for gravity.

Parameters:
  • data (pd.DataFrame) – DataFrame containing accelerometer data with columns: - ‘x’: X-axis acceleration values - ‘y’: Y-axis acceleration values - ‘z’: Z-axis acceleration values All values should be in g units (1g = 9.81 m/s²).

  • verbose (bool, default=False) – If True, prints processing information.

Returns:

Array of ENMO values. Values are truncated at 0, meaning negative values are set to 0. Returns np.nan if calculation fails.

Return type:

numpy.ndarray

Notes

  • ENMO = sqrt(x² + y² + z²) - 1

  • Values are truncated at 0 (negative values become 0)

  • ENMO represents acceleration in excess of 1g (gravity)

  • Commonly used in physical activity and sleep research

  • Handles errors gracefully by returning np.nan

Examples

>>> import pandas as pd
>>> import numpy as np
>>>
>>> # Create sample accelerometer data
>>> data = pd.DataFrame({
...     'x': [0.1, 0.2, 0.3],
...     'y': [0.1, 0.2, 0.3],
...     'z': [1.0, 1.1, 1.2]  # Close to 1g (gravity)
... })
>>>
>>> # Calculate ENMO
>>> enmo_values = calculate_enmo(data, verbose=True)
>>> print(f"ENMO values: {enmo_values}")
>>> # Output: [0.014, 0.028, 0.042] (approximately)
calculate_minute_level_enmo(data, meta_dict={}, verbose=False)[source]

Resample high-frequency ENMO data to minute-level by averaging over each minute.

This function aggregates high-frequency ENMO data to minute-level resolution using mean aggregation, which is the standard approach for circadian rhythm analysis.

Parameters:
  • data (pd.DataFrame) – DataFrame with datetime index and ‘ENMO’ column containing high-frequency ENMO data. Optional ‘wear’ column for wear time information.

  • meta_dict (dict, default={}) – Dictionary containing metadata. Should include: - ‘sf’: Sampling frequency in Hz (defaults to 25Hz if not specified)

  • verbose (bool, default=False) – If True, prints processing information.

Returns:

DataFrame containing minute-level aggregated data with: - ‘ENMO’: Mean ENMO value for each minute - ‘wear’: Mean wear time for each minute (if wear column exists in input) Index is datetime at minute resolution.

Return type:

pd.DataFrame

Raises:

ValueError – If sampling frequency is less than 1/60 Hz (less than one sample per minute).

Notes

  • Uses pandas resample(‘min’).mean() for aggregation

  • Handles both ENMO and wear columns if present

  • Converts index to datetime format

  • Standard preprocessing step for circadian rhythm analysis

  • Handles errors gracefully by returning empty DataFrame

Examples

>>> import pandas as pd
>>>
>>> # Create sample high-frequency ENMO data
>>> dates = pd.date_range('2023-01-01 00:00:00', periods=3600, freq='S')  # 1 hour of second-level data
>>> data = pd.DataFrame({
...     'ENMO': np.random.uniform(0, 0.1, 3600),
...     'wear': np.random.choice([0, 1], 3600)
... }, index=dates)
>>>
>>> # Resample to minute level
>>> meta_dict = {'sf': 1}  # 1 Hz sampling frequency
>>> minute_data = calculate_minute_level_enmo(data, meta_dict=meta_dict, verbose=True)
>>> print(f"Original records: {len(data)}")
>>> print(f"Minute-level records: {len(minute_data)}")
detect_frequency_from_timestamps(timestamps)[source]

Detect sampling frequency by finding the most common time delta.

This function analyzes a series of timestamps to determine the sampling frequency of the data by calculating the time differences between consecutive samples and finding the most frequently occurring interval.

Parameters:

timestamps (pd.Series) – Series or array of datetime objects representing the timestamps of data points. Can be pandas datetime objects, numpy datetime64, or string timestamps that can be converted to datetime.

Returns:

Sampling frequency in Hz (samples per second).

Return type:

float

Raises:

ValueError – If less than two timestamps are provided. If no time deltas can be calculated. If the most common time delta is zero. If the mode cannot be determined.

Notes

  • The function converts all timestamps to pandas datetime format

  • Time deltas are calculated in seconds

  • The most common (mode) time delta is used to determine frequency

  • Frequency is calculated as 1.0 / most_common_delta

Examples

>>> import pandas as pd
>>>
>>> # Regular 25 Hz sampling
>>> timestamps = pd.date_range('2023-01-01', periods=100, freq='40ms')
>>> freq = detect_frequency_from_timestamps(timestamps)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz
>>>
>>> # Irregular sampling with some missing points
>>> irregular_times = pd.to_datetime([
...     '2023-01-01 00:00:00',
...     '2023-01-01 00:00:00.040',
...     '2023-01-01 00:00:00.080',
...     '2023-01-01 00:00:00.120',
...     '2023-01-01 00:00:00.200',  # Gap here
...     '2023-01-01 00:00:00.240'
... ])
>>> freq = detect_frequency_from_timestamps(irregular_times)
>>> print(f"Detected frequency: {freq:.1f} Hz")
Detected frequency: 25.0 Hz

Visualization Functions