CSV Transformation Best Practices
CSV files are simple but often messy. Good data transformation is about understanding the data first, then applying changes systematically.
Understanding the Data First
Before any transformation:
- How many rows and columns?
- What are the column names?
- What data types should each column be?
- Are there obvious issues (blanks, duplicates, inconsistencies)?
- What's the desired end state?
Common Cleaning Operations
Missing Data
- Identify: Which columns have blanks/nulls?
- Options:
- Remove rows with missing data
- Fill with default value
- Fill with calculated value (mean, median)
- Leave as-is (if downstream can handle)
Duplicates
- Define: What makes a row duplicate? (all columns? subset?)
- Options:
- Remove all duplicates
- Keep first/last occurrence
- Merge duplicate rows
Formatting Issues
- Whitespace: Trim leading/trailing spaces
- Case: Standardize to upper/lower/title case
- Dates: Convert to consistent format
- Numbers: Remove currency symbols, standardize decimals
Data Types
- Convert strings to numbers where appropriate
- Parse dates from text
- Boolean standardization (yes/no → true/false)
Common Transform Operations
Column Operations
- Rename: Change column headers
- Reorder: Rearrange column sequence
- Add: Create new columns (calculated or constant)
- Remove: Drop unnecessary columns
- Split: Break one column into multiple (e.g., "John Smith" → "John", "Smith")
- Combine: Merge multiple columns into one
Row Operations
- Filter: Keep rows matching criteria
- Sort: Order by one or more columns
- Sample: Take subset of rows
- Aggregate: Group and summarize (count, sum, average)
Value Operations
- Replace: Find and replace values
- Map: Transform values using lookup
- Calculate: Create derived values
Merge Operations
Join Types
- Inner: Only rows that match in both files
- Left: All rows from first file, matching from second
- Right: All rows from second file, matching from first
- Outer: All rows from both files
Key Matching
- Single column: Simple match on one field
- Multiple columns: Composite key matching
- Fuzzy matching: When exact match isn't possible
Common Issues
- Duplicate keys: What happens when one file has multiple matches?
- Missing keys: How to handle non-matches?
- Column name conflicts: Both files have columns with same name
Format Conversions
CSV to JSON
name,age,city
John,30,NYC
→
[{"name":"John","age":"30","city":"NYC"}]CSV to Markdown Table
| name | age | city |
|------|-----|------|
| John | 30 | NYC |Encoding
- UTF-8 is default and preferred
- Watch for encoding issues with special characters
- Excel sometimes creates files with different encodings
Validation
After transformation, verify:
- Row count (expected vs actual)
- Column count
- Sample values look correct
- No unexpected nulls introduced
- Data types are correct
Output Options
Full Data
- Complete transformed dataset
- Suitable for small to medium files
Summary
- First N rows as preview
- Row/column counts
- Basic statistics
Sample
- Random subset for verification
- Useful for large files
Best Practices
- Preview first: Look at sample before transforming
- Document changes: Track what was done
- Preserve original: Don't modify source files
- Validate output: Check results make sense
- Handle errors: What to do with problematic rows
Common Issues
| Issue | Solution |
|---|---|
| Comma in values | Use quoted strings |
| Newlines in values | Use proper escaping |
| Different delimiters | Detect or specify |
| Header issues | First row is/isn't header |
| Encoding problems | Convert to UTF-8 |
| Large files | Process in chunks |