Optimizing KNIME Workflow Runtime using Parquet

Introduction

Enterprise analytics workflows routinely ingest millions of rows from regulated databases, legacy Excel report chains, and EUC (End-User Computing) remediation sources. In KNIME, when these datasets pass into Python Script nodes, the JVM serialises the entire table across the Java–Python bridge — row by row, type by type. At a few hundred thousand rows this is tolerable. At five million rows it becomes a workflow killer.

This guide documents a production-tested pattern: writing interim data to Apache Parquet before consuming it in Python. Across real-world KNIME deployments, this single architectural change reduced Python node execution time by 4–8× and cut infrastructure compute costs measurably — without touching business logic.

The Root Problem: How KNIME Passes Data to Python

Before choosing Parquet, it helps to understand precisely where the bottleneck lives.

When KNIME's Python Script node receives a table, the runtime:

•Serialises the KNIME BufferedDataTable through the JVM–Python bridge (based on Apache Arrow's IPC format internally, but still subject to JVM memory pressure and bridge latency)
•Deserialises it into a pandas.DataFrame inside the Python process
•Re-serialises results back across the bridge on completion

Each column goes through type negotiation. Strings are the worst offenders — KNIME's internal String representation is far heavier than Python's. For a 5-million-row table with 40 columns, this round-trip can cost 30–60 seconds before a single line of business logic runs.

The fix is to sidestep the bridge entirely for large datasets: write to disk in a format Python reads natively and fast.

Why Parquet? Every Competing Format Examined

Not all file formats are equal. Here is an honest assessment:

CSV and TSV

The universal fallback. Every tool reads them; every tool writes them slowly.

•No schema: every column is a string until parsed at read time
•No compression: a 500 MB CSV is 500 MB on disk
•Slow reads: the parser must scan every byte
•No predicate pushdown: to read 3 columns of 100, you load all 100

For large KNIME datasets, CSV is the worst possible intermediary.

Excel (.xlsx / .xls)

Worse than CSV for programmatic use. XLSX is a ZIP of XML files. openpyxl reading a 200,000-row sheet can take 45–90 seconds because it parses XML node-by-node. There is no compression benefit — the XML inflates the actual data. Never use Excel as an intermediate format in any automated pipeline.

JSON / NDJSON

JSON is a document format dressed up as a data format. Row-oriented, schema-free, verbose. A number stored as "value": 1234567.89 costs 20+ bytes vs 8 bytes in a typed binary format. NDJSON (newline-delimited) is marginally better for streaming but still offers no columnar advantages.

Apache Feather (Arrow IPC)

Feather is KNIME's closest rival to Parquet. It is essentially Apache Arrow serialised to disk — extremely fast read/write, full type fidelity, excellent Python support. Its weaknesses: no built-in compression in v1 (v2 added LZ4/ZSTD but adoption lags), not splittable for distributed reads, and lower ecosystem support outside Python/R. A legitimate choice for intermediate scratch files where compression is less critical.

HDF5

A scientific data format with a complex hierarchical structure. Read/write performance is good for numeric arrays. Problems: the library (h5py) is a C extension with complex installation, the format is not designed for tabular analytics, and KNIME has no native HDF5 node. Not appropriate here.

Apache ORC

ORC (Optimised Row Columnar) is Parquet's direct competitor in the Hadoop/Hive world. Similar columnar architecture, similar compression support. The decisive problem: minimal Python support. pyarrow can read ORC but write support was only added in v3.0 and remains less mature. In KNIME, there is no native ORC writer node. Use Parquet instead.

Apache Parquet — The Right Choice

Parquet hits every requirement simultaneously:

| Requirement | Parquet | |---|---| | Columnar storage | ✓ Native | | Compression | ✓ SNAPPY, ZSTD, GZIP, LZ4, BROTLI | | Strong typing | ✓ INT32/64, FLOAT, DOUBLE, BYTE_ARRAY, BOOLEAN | | KNIME native support | ✓ Parquet Writer / Reader nodes | | Python ecosystem | ✓ pyarrow, fastparquet, pandas, DuckDB, Polars | | Predicate pushdown | ✓ Skip row-groups at read time | | Splittable | ✓ Distributed reads in Spark/Dask | | Schema evolution | ✓ Add columns without rewriting |

Columnar Storage: Why It Makes Python 4–8× Faster

Row-oriented formats (CSV, Excel, JSON) store data like a spreadsheet printed row-by-row. To compute the sum of column 42, you read every byte of every row.

Parquet stores data column-by-column within each row group (default ~128 MB):

Row Group 1
  ├── Column "account_id"   [INT64, SNAPPY-compressed]
  ├── Column "balance"      [DOUBLE, SNAPPY-compressed]
  └── Column "date"         [INT32 (date), SNAPPY-compressed]
Row Group 2
  └── ...

When your Python script only needs balance and date, Parquet reads two columns and skips everything else. For a 40-column table where your script uses 5 columns, you skip 87.5% of the I/O entirely.

Combined with predicate pushdown — where the library skips entire row groups whose min/max statistics don't match a filter condition — large datasets can be read in a fraction of the time.

Compression Algorithms: A Practical Comparison

Parquet supports multiple compression codecs, each with different trade-offs. Choosing wrong costs either storage or CPU.

| Algorithm | Ratio vs raw | Write speed | Read speed | CPU usage | Best use case | |---|---|---|---|---|---| | UNCOMPRESSED | 1× | ★★★★★ | ★★★★★ | None | Dev/debug only | | SNAPPY | ~2.5–3× | ★★★★★ | ★★★★★ | Very low | KNIME default — production | | LZ4 | ~2.5–3× | ★★★★★ | ★★★★★ | Very low | I/O-bound pipelines | | ZSTD | ~4–5× | ★★★★☆ | ★★★★☆ | Moderate | Modern production — best balance | | GZIP (DEFLATE) | ~4–5× | ★★☆☆☆ | ★★★★☆ | High | Archival, cold storage | | BROTLI | ~5–6× | ★☆☆☆☆ | ★★★☆☆ | Very high | Static/rarely-read archives |

SNAPPY — Production Default

SNAPPY was designed by Google specifically for speed over compression ratio. It does not try for maximum compression; it tries for consistent, low-latency throughput. For KNIME workflows that write and read Parquet multiple times per execution, SNAPPY is the right default. KNIME's Parquet Writer defaults to SNAPPY for this reason.

ZSTD — The Modern Upgrade

ZSTD (Zstandard, by Facebook) achieves GZIP-level compression at speeds approaching SNAPPY. At compression level 3 (the default), it is a strict upgrade over SNAPPY for most scenarios. If your Parquet files are stored in a shared network directory or S3 bucket where bandwidth is constrained, switching from SNAPPY to ZSTD level 3 reduces transfer overhead while adding only ~15% write time.

import pyarrow.parquet as pq
import pyarrow as pa

# Write with ZSTD compression
pq.write_table(
    pa.Table.from_pandas(df),
    "output.parquet",
    compression="zstd",
    compression_level=3,     # 1–22; 3 is the practical default
)

GZIP — Only for Cold Storage

GZIP achieves good compression ratios but writes slowly (3–5× slower than SNAPPY). Use it only for files that are written once and read rarely — compliance archives, end-of-month snapshots. Never use GZIP in a workflow that re-runs daily.

LZ4 — When Disk Is Your Bottleneck

LZ4 is faster than SNAPPY at the same compression ratio, but has less support in older pyarrow versions. If you are on pyarrow ≥ 4.0 and your bottleneck is disk throughput (e.g., spinning HDDs or slow NAS), LZ4 is worth testing. On modern SSDs the difference from SNAPPY is negligible.

Choosing the Right Codec

# Quick decision guide
# • Daily workflow, SSD storage:      SNAPPY (fastest, default)
# • Network storage or S3:            ZSTD level 3 (better size/speed tradeoff)
# • Archive / compliance:             GZIP (best ratio, write-once)
# • Maximum speed, no size concern:   UNCOMPRESSED (dev/debug only)

Python Libraries for Parquet: pyarrow vs fastparquet vs Others

pyarrow (Apache Arrow) — Recommended

pyarrow is the reference implementation of the Apache Arrow columnar memory format. It is fast, mature, actively maintained by the Apache Foundation, and is the engine that KNIME itself uses internally.

import pyarrow.parquet as pq
import pyarrow as pa

# Read — full file
df = pq.read_table("data.parquet").to_pandas()

# Read — specific columns only (huge speed win on wide tables)
df = pq.read_table("data.parquet", columns=["account_id", "balance"]).to_pandas()

# Read — with predicate pushdown (skip row groups where date < threshold)
import pyarrow.compute as pc
df = pq.read_table(
    "data.parquet",
    filters=[("date", ">=", "2024-01-01")],
).to_pandas()

# Write with schema control
schema = pa.schema([
    ("account_id", pa.int64()),
    ("balance", pa.float64()),
    ("date", pa.date32()),
])
pq.write_table(
    pa.Table.from_pandas(df, schema=schema),
    "data.parquet",
    compression="zstd",
)

Why pyarrow over alternatives: It supports the full Parquet spec including nested types, encrypted columns, and row-group statistics. It powers pandas, polars, dask, and DuckDB's Parquet backends. Its memory format is zero-copy compatible with C++, Java, and Rust Arrow implementations.

fastparquet — Lightweight Alternative

fastparquet is a pure-Python Parquet implementation developed by the Dask team. It is smaller and simpler than pyarrow.

When to prefer it:

•Installing pyarrow is problematic (restricted environments, constrained containers)
•You are already in a Dask/Numba stack where fastparquet integrates directly

When NOT to use it:

•Complex nested schemas: fastparquet supports fewer Parquet types
•Performance benchmarks consistently show pyarrow 20–40% faster for read-heavy workloads
•Encryption and advanced compression (ZSTD, BROTLI, LZ4) support is limited

import fastparquet as fp

# Read
pf = fp.ParquetFile("data.parquet")
df = pf.to_pandas()

# Write
fp.write("data.parquet", df, compression="SNAPPY")

pandas Native Engine Selection

pandas.read_parquet delegates to an engine. Specifying it explicitly removes ambiguity:

# Explicit engine selection
df = pd.read_parquet("data.parquet", engine="pyarrow")    # default, recommended
df = pd.read_parquet("data.parquet", engine="fastparquet")  # alternative

When both are installed, pandas defaults to pyarrow. Pin this explicitly in production code to avoid silent engine changes across environments.

Why NOT modin, Dask, or Distributed Pandas?

A common suggestion is to replace pandas with modin or dask to speed up Python processing in KNIME. This is the wrong solution for the right problem.

The Actual Bottleneck

The performance issue is not inside the pandas computation — it is in the JVM-to-Python data transfer. A pandas operation that takes 0.5 seconds on 5M rows is irrelevant if the bridge serialisation takes 40 seconds before it can run.

Modin and Dask speed up the computation phase. They do nothing to reduce bridge overhead. You would still wait 40 seconds for the data to arrive before modin's parallel groupby saves you 0.3 seconds.

The Overhead Problem

Both modin and Dask require initialising a distributed scheduler (Ray or Dask.distributed). In a KNIME Python Script node, this happens on every execution, adding 2–8 seconds of worker initialisation before any work runs. For single-node KNIME workflows processing files of reasonable size (< 50M rows), this overhead exceeds any benefit.

The Right Mental Model

| Approach | Solves bridge overhead? | Adds scheduler overhead? | Complexity | |---|---|---|---| | Parquet + pyarrow | ✓ Yes — bypasses bridge | None | Low | | modin (Ray backend) | ✗ No | 3–8s | High | | Dask | ✗ No | 2–5s | High | | pandas as-is | ✗ No | None | Baseline |

Parquet solves the right problem. modin and Dask solve a different problem (in-memory computation at scale) that most KNIME workflows never encounter.

KNIME Implementation

Step 1: Pre-Processing Component Configuration

Add a configuration component at the start of the workflow. Define a unique string ID for each Parquet table — typically the workflow name + a table identifier:

workflow_parquet_id = "euc_remediation_main_20241129"

This ID becomes the filename base. Unique IDs prevent collisions when multiple workflow instances run concurrently on KNIME Server.

Step 2: Define Output Paths with Variable Creator

Use the Variable Creator node to construct full file paths:

parquet_uid = /knime/workflow-data/euc_remediation_main_20241129.parquet

Using flow variables (not hardcoded strings) ensures the path resolves correctly across local executor, KNIME Server, and CI/CD environments without modification.

Step 3: Read Source Data Normally

Use standard KNIME reader nodes — Excel Reader, CSV Reader, DB Reader, etc. — as you normally would. No changes at this stage.

Step 4: Insert the Parquet Writer Node

Connect the table output to a Parquet Writer node. Configure:

•Output location: use the flow variable from Step 2
•Overwrite policy: set to Overwrite (required for re-runs)
•Compression: SNAPPY (default) or ZSTD for network storage

Step 5: Python Script Node — Read from Parquet

Replace the direct table input in your Python Script node with a file-path flow variable and pyarrow:

import pyarrow.parquet as pq
import pandas as pd

# Read only the columns this script actually needs
df = pq.read_table(
    flow_variables["parquet_uid"],
    columns=["account_id", "product_code", "balance", "report_date"],
).to_pandas()

# All downstream business logic unchanged

Key change: instead of receiving the table from KNIME's bridge (input_table_1), the script reads from disk via pyarrow. The bridge is used only for the final output table, which is typically much smaller than the input.

Step 6: Validate with Timer Info

Add a Timer Info node immediately after the Parquet Writer and after the Python Script node. Compare the millisecond timestamps. A representative before/after benchmark:

| Stage | Before (CSV via bridge) | After (Parquet + pyarrow) | |---|---|---| | Table → Python bridge | 38 500 ms | — (bypassed) | | Parquet write | — | 3 200 ms | | Parquet read (pyarrow) | — | 4 800 ms | | Python computation | 12 000 ms | 11 500 ms | | Total | 50 500 ms | 19 500 ms |

Net improvement: ~2.6× in this scenario. Gains scale with row count. Above 10M rows, improvements of 6–8× are common.

Infrastructure Cost Analysis

In cloud or on-premise KNIME Server environments, workflow runtime directly maps to compute cost.

Consider a workflow running on a 4-vCPU executor instance at $0.20/hour:

| Scenario | Daily executions | Runtime per run | Monthly cost | |---|---|---|---| | Before: CSV bridge | 120 | 52 min | $20.80 | | After: Parquet | 120 | 21 min | $8.40 | | Saving | — | — | $12.40 / month |

For high-frequency workflows (hourly runs, event-driven triggers), or environments with 10–50 concurrent workflows, these savings compound significantly. Parquet's smaller file sizes also reduce shared storage costs and accelerate backup windows.

Conclusion: Choosing the Right Intermediate Format

Parquet is not always the right answer — but for structured tabular data in Python analytics pipelines, it is almost always the best one.

| Format | Speed | Compression | Type safety | Ecosystem | Verdict | |---|---|---|---|---|---| | Parquet + SNAPPY | ★★★★★ | ★★★★☆ | ★★★★★ | ★★★★★ | ✓ Best for KNIME pipelines | | Parquet + ZSTD | ★★★★☆ | ★★★★★ | ★★★★★ | ★★★★★ | ✓ Best for network/cloud storage | | Feather v2 (LZ4) | ★★★★★ | ★★★☆☆ | ★★★★★ | ★★★☆☆ | Good for Python-only scratch | | CSV | ★★☆☆☆ | ★☆☆☆☆ | ★☆☆☆☆ | ★★★★★ | Only for human-readable exports | | Excel | ★☆☆☆☆ | ★★☆☆☆ | ★★☆☆☆ | ★★★★☆ | Avoid in automated pipelines | | JSON | ★★☆☆☆ | ★☆☆☆☆ | ★★☆☆☆ | ★★★★★ | API payloads only | | HDF5 | ★★★★☆ | ★★★☆☆ | ★★★★☆ | ★★★☆☆ | Scientific arrays, not tabular | | ORC | ★★★★☆ | ★★★★☆ | ★★★★★ | ★★☆☆☆ | Hive/Spark only |

The Decision Rule

•KNIME daily workflows with SSD storage → Parquet + SNAPPY
•KNIME Server workflows writing to NAS or S3 → Parquet + ZSTD level 3
•Archive / compliance snapshots → Parquet + GZIP
•Scratch files, Python-only, maximum speed → Feather v2
•Human-readable output for business users → CSV (export only, never intermediate)

The pattern is simple: write Parquet once, read it fast in Python, skip the JVM bridge entirely for bulk data. The performance and cost benefits are immediate, measurable, and require no changes to business logic.