Group & Aggregate
SummarizeGroup rows by one or more columns and compute a summary statistic (mean, sum, count, median, etc.) for each group, collapsing the dataset to one row per unique group. This is essential for creating summary tables, computing descriptive statistics by condition, and reducing raw data to group-level results for reporting or further analysis.
When to Use
- You need to compute group-level summaries from individual observations (e.g., mean score per treatment group, total sales per region, median survival time per cohort).
- You want to create a summary table for reporting, with one row per experimental condition or group, ready for inclusion in a manuscript or presentation.
- You are reducing a large dataset to group-level statistics before comparing conditions, creating plots, or exporting results to another tool.
- You need to count observations per group (e.g., sample sizes per treatment arm) to verify that your experimental design is balanced.
Required Inputs
- Group By -- one or more columns that define the grouping variable (e.g., Treatment, TimePoint, Site). Each unique combination of values across these columns defines one output row.
- Aggregation function for each remaining column: Sum, Average (mean), Count (non-null values), Min, Max, Median, Std Dev, or None (exclude the column from the output). Choose the function that makes sense for each variable.
What Changes
- Row count decreases to one row per unique combination of the Group By columns. If you group by Treatment with two levels (A, B), the result has exactly two rows.
- Columns set to "None" are excluded from the output entirely and do not appear in the result.
- Cell values are replaced by the computed aggregate (mean, sum, etc.). The original individual observations are no longer visible in the aggregated output.
- The Group By columns appear first in the output, followed by the aggregated value columns in their original order.
Common Pitfalls
- Numeric aggregation functions (Sum, Average, Median, Std Dev) require numeric columns. Applying them to text columns produces errors or NaN values. Use Count for text columns if you need a summary.
- The Count function counts non-null values, not total rows. If a column has missing values, the count will be less than the group size. Use a column with no missing values for accurate group counts.
- Setting a column to "None" removes it from the output entirely. If you want to keep a column visible as a row-count indicator, use Count instead of None.
- Grouping by a column with many unique values (e.g., a continuous numeric column or a free-text field) may produce as many output rows as input rows, defeating the purpose of aggregation. Group by categorical variables for meaningful summaries.
Example: Before & After
Before
| Group | Score |
|---|---|
| A | 80 |
| A | 90 |
| B | 70 |
| B | 85 |
→
After
| Group | Score |
|---|---|
| A | 85 |
| B | 77.5 |
Group By: Group, Aggregation: Score -> Average. Group A's mean is (80 + 90) / 2 = 85; Group B's mean is (70 + 85) / 2 = 77.5.