slice icon Context Slice

CSV Transformation Best Practices

CSV files are simple but often messy. Good data transformation is about understanding the data first, then applying changes systematically.

Understanding the Data First

Before any transformation:

  • How many rows and columns?
  • What are the column names?
  • What data types should each column be?
  • Are there obvious issues (blanks, duplicates, inconsistencies)?
  • What's the desired end state?

Common Cleaning Operations

Missing Data

  • Identify: Which columns have blanks/nulls?
  • Options:
    • Remove rows with missing data
    • Fill with default value
    • Fill with calculated value (mean, median)
    • Leave as-is (if downstream can handle)

Duplicates

  • Define: What makes a row duplicate? (all columns? subset?)
  • Options:
    • Remove all duplicates
    • Keep first/last occurrence
    • Merge duplicate rows

Formatting Issues

  • Whitespace: Trim leading/trailing spaces
  • Case: Standardize to upper/lower/title case
  • Dates: Convert to consistent format
  • Numbers: Remove currency symbols, standardize decimals

Data Types

  • Convert strings to numbers where appropriate
  • Parse dates from text
  • Boolean standardization (yes/no → true/false)

Common Transform Operations

Column Operations

  • Rename: Change column headers
  • Reorder: Rearrange column sequence
  • Add: Create new columns (calculated or constant)
  • Remove: Drop unnecessary columns
  • Split: Break one column into multiple (e.g., "John Smith" → "John", "Smith")
  • Combine: Merge multiple columns into one

Row Operations

  • Filter: Keep rows matching criteria
  • Sort: Order by one or more columns
  • Sample: Take subset of rows
  • Aggregate: Group and summarize (count, sum, average)

Value Operations

  • Replace: Find and replace values
  • Map: Transform values using lookup
  • Calculate: Create derived values

Merge Operations

Join Types

  • Inner: Only rows that match in both files
  • Left: All rows from first file, matching from second
  • Right: All rows from second file, matching from first
  • Outer: All rows from both files

Key Matching

  • Single column: Simple match on one field
  • Multiple columns: Composite key matching
  • Fuzzy matching: When exact match isn't possible

Common Issues

  • Duplicate keys: What happens when one file has multiple matches?
  • Missing keys: How to handle non-matches?
  • Column name conflicts: Both files have columns with same name

Format Conversions

CSV to JSON

name,age,city
John,30,NYC
→
[{"name":"John","age":"30","city":"NYC"}]

CSV to Markdown Table

| name | age | city |
|------|-----|------|
| John | 30  | NYC  |

Encoding

  • UTF-8 is default and preferred
  • Watch for encoding issues with special characters
  • Excel sometimes creates files with different encodings

Validation

After transformation, verify:

  • Row count (expected vs actual)
  • Column count
  • Sample values look correct
  • No unexpected nulls introduced
  • Data types are correct

Output Options

Full Data

  • Complete transformed dataset
  • Suitable for small to medium files

Summary

  • First N rows as preview
  • Row/column counts
  • Basic statistics

Sample

  • Random subset for verification
  • Useful for large files

Best Practices

  1. Preview first: Look at sample before transforming
  2. Document changes: Track what was done
  3. Preserve original: Don't modify source files
  4. Validate output: Check results make sense
  5. Handle errors: What to do with problematic rows

Common Issues

Issue Solution
Comma in values Use quoted strings
Newlines in values Use proper escaping
Different delimiters Detect or specify
Header issues First row is/isn't header
Encoding problems Convert to UTF-8
Large files Process in chunks