Optimizing KNIME Workflow Runtime using Parquet
A deep-dive tutorial on replacing KNIME's JVMβPython bridge overhead with Apache Parquet: compression algorithms (SNAPPY, ZSTD, GZIP, LZ4), pyarrow vs fastparquet, why modin doesn't help, and a format comparison to close.
Introduction
Enterprise analytics workflows routinely ingest millions of rows from regulated databases, legacy Excel report chains, and EUC (End-User Computing) remediation sources. In KNIME, when these datasets pass into Python Script nodes, the JVM serialises the entire table across the JavaβPython bridge β row by row, type by type. At a few hundred thousand rows this is tolerable. At five million rows it becomes a workflow killer.
This guide documents a production-tested pattern: writing interim data to Apache Parquet before consuming it in Python. Across real-world KNIME deployments, this single architectural change reduced Python node execution time by 4β8Γ and cut infrastructure compute costs measurably β without touching business logic.
The Root Problem: How KNIME Passes Data to Python
Before choosing Parquet, it helps to understand precisely where the bottleneck lives.
When KNIME's Python Script node receives a table, the runtime:
- β’Serialises the KNIME
BufferedDataTablethrough the JVMβPython bridge (based on Apache Arrow's IPC format internally, but still subject to JVM memory pressure and bridge latency) - β’Deserialises it into a
pandas.DataFrameinside the Python process - β’Re-serialises results back across the bridge on completion
Each column goes through type negotiation. Strings are the worst offenders β KNIME's internal String representation is far heavier than Python's. For a 5-million-row table with 40 columns, this round-trip can cost 30β60 seconds before a single line of business logic runs.
The fix is to sidestep the bridge entirely for large datasets: write to disk in a format Python reads natively and fast.
Why Parquet? Every Competing Format Examined
Not all file formats are equal. Here is an honest assessment:
CSV and TSV
The universal fallback. Every tool reads them; every tool writes them slowly.
- β’No schema: every column is a string until parsed at read time
- β’No compression: a 500 MB CSV is 500 MB on disk
- β’Slow reads: the parser must scan every byte
- β’No predicate pushdown: to read 3 columns of 100, you load all 100
For large KNIME datasets, CSV is the worst possible intermediary.
Excel (.xlsx / .xls)
Worse than CSV for programmatic use. XLSX is a ZIP of XML files. openpyxl reading a 200,000-row sheet can take 45β90 seconds because it parses XML node-by-node. There is no compression benefit β the XML inflates the actual data. Never use Excel as an intermediate format in any automated pipeline.
JSON / NDJSON
JSON is a document format dressed up as a data format. Row-oriented, schema-free, verbose. A number stored as "value": 1234567.89 costs 20+ bytes vs 8 bytes in a typed binary format. NDJSON (newline-delimited) is marginally better for streaming but still offers no columnar advantages.
Apache Feather (Arrow IPC)
Feather is KNIME's closest rival to Parquet. It is essentially Apache Arrow serialised to disk β extremely fast read/write, full type fidelity, excellent Python support. Its weaknesses: no built-in compression in v1 (v2 added LZ4/ZSTD but adoption lags), not splittable for distributed reads, and lower ecosystem support outside Python/R. A legitimate choice for intermediate scratch files where compression is less critical.
HDF5
A scientific data format with a complex hierarchical structure. Read/write performance is good for numeric arrays. Problems: the library (h5py) is a C extension with complex installation, the format is not designed for tabular analytics, and KNIME has no native HDF5 node. Not appropriate here.
Apache ORC
ORC (Optimised Row Columnar) is Parquet's direct competitor in the Hadoop/Hive world. Similar columnar architecture, similar compression support. The decisive problem: minimal Python support. pyarrow can read ORC but write support was only added in v3.0 and remains less mature. In KNIME, there is no native ORC writer node. Use Parquet instead.
Apache Parquet β The Right Choice
Parquet hits every requirement simultaneously:
| Requirement | Parquet | |---|---| | Columnar storage | β Native | | Compression | β SNAPPY, ZSTD, GZIP, LZ4, BROTLI | | Strong typing | β INT32/64, FLOAT, DOUBLE, BYTE_ARRAY, BOOLEAN | | KNIME native support | β Parquet Writer / Reader nodes | | Python ecosystem | β pyarrow, fastparquet, pandas, DuckDB, Polars | | Predicate pushdown | β Skip row-groups at read time | | Splittable | β Distributed reads in Spark/Dask | | Schema evolution | β Add columns without rewriting |
Columnar Storage: Why It Makes Python 4β8Γ Faster
Row-oriented formats (CSV, Excel, JSON) store data like a spreadsheet printed row-by-row. To compute the sum of column 42, you read every byte of every row.
Parquet stores data column-by-column within each row group (default ~128 MB):
Row Group 1
βββ Column "account_id" [INT64, SNAPPY-compressed]
βββ Column "balance" [DOUBLE, SNAPPY-compressed]
βββ Column "date" [INT32 (date), SNAPPY-compressed]
Row Group 2
βββ ...
When your Python script only needs balance and date, Parquet reads two columns and skips everything else. For a 40-column table where your script uses 5 columns, you skip 87.5% of the I/O entirely.
Combined with predicate pushdown β where the library skips entire row groups whose min/max statistics don't match a filter condition β large datasets can be read in a fraction of the time.
Compression Algorithms: A Practical Comparison
Parquet supports multiple compression codecs, each with different trade-offs. Choosing wrong costs either storage or CPU.
| Algorithm | Ratio vs raw | Write speed | Read speed | CPU usage | Best use case | |---|---|---|---|---|---| | UNCOMPRESSED | 1Γ | β β β β β | β β β β β | None | Dev/debug only | | SNAPPY | ~2.5β3Γ | β β β β β | β β β β β | Very low | KNIME default β production | | LZ4 | ~2.5β3Γ | β β β β β | β β β β β | Very low | I/O-bound pipelines | | ZSTD | ~4β5Γ | β β β β β | β β β β β | Moderate | Modern production β best balance | | GZIP (DEFLATE) | ~4β5Γ | β β βββ | β β β β β | High | Archival, cold storage | | BROTLI | ~5β6Γ | β ββββ | β β β ββ | Very high | Static/rarely-read archives |
SNAPPY β Production Default
SNAPPY was designed by Google specifically for speed over compression ratio. It does not try for maximum compression; it tries for consistent, low-latency throughput. For KNIME workflows that write and read Parquet multiple times per execution, SNAPPY is the right default. KNIME's Parquet Writer defaults to SNAPPY for this reason.
ZSTD β The Modern Upgrade
ZSTD (Zstandard, by Facebook) achieves GZIP-level compression at speeds approaching SNAPPY. At compression level 3 (the default), it is a strict upgrade over SNAPPY for most scenarios. If your Parquet files are stored in a shared network directory or S3 bucket where bandwidth is constrained, switching from SNAPPY to ZSTD level 3 reduces transfer overhead while adding only ~15% write time.
import pyarrow.parquet as pq
import pyarrow as pa
# Write with ZSTD compression
pq.write_table(
pa.Table.from_pandas(df),
"output.parquet",
compression="zstd",
compression_level=3, # 1β22; 3 is the practical default
)
GZIP β Only for Cold Storage
GZIP achieves good compression ratios but writes slowly (3β5Γ slower than SNAPPY). Use it only for files that are written once and read rarely β compliance archives, end-of-month snapshots. Never use GZIP in a workflow that re-runs daily.
LZ4 β When Disk Is Your Bottleneck
LZ4 is faster than SNAPPY at the same compression ratio, but has less support in older pyarrow versions. If you are on pyarrow β₯ 4.0 and your bottleneck is disk throughput (e.g., spinning HDDs or slow NAS), LZ4 is worth testing. On modern SSDs the difference from SNAPPY is negligible.
Choosing the Right Codec
# Quick decision guide
# β’ Daily workflow, SSD storage: SNAPPY (fastest, default)
# β’ Network storage or S3: ZSTD level 3 (better size/speed tradeoff)
# β’ Archive / compliance: GZIP (best ratio, write-once)
# β’ Maximum speed, no size concern: UNCOMPRESSED (dev/debug only)
Python Libraries for Parquet: pyarrow vs fastparquet vs Others
pyarrow (Apache Arrow) β Recommended
pyarrow is the reference implementation of the Apache Arrow columnar memory format. It is fast, mature, actively maintained by the Apache Foundation, and is the engine that KNIME itself uses internally.
import pyarrow.parquet as pq
import pyarrow as pa
# Read β full file
df = pq.read_table("data.parquet").to_pandas()
# Read β specific columns only (huge speed win on wide tables)
df = pq.read_table("data.parquet", columns=["account_id", "balance"]).to_pandas()
# Read β with predicate pushdown (skip row groups where date < threshold)
import pyarrow.compute as pc
df = pq.read_table(
"data.parquet",
filters=[("date", ">=", "2024-01-01")],
).to_pandas()
# Write with schema control
schema = pa.schema([
("account_id", pa.int64()),
("balance", pa.float64()),
("date", pa.date32()),
])
pq.write_table(
pa.Table.from_pandas(df, schema=schema),
"data.parquet",
compression="zstd",
)
Why pyarrow over alternatives: It supports the full Parquet spec including nested types, encrypted columns, and row-group statistics. It powers pandas, polars, dask, and DuckDB's Parquet backends. Its memory format is zero-copy compatible with C++, Java, and Rust Arrow implementations.
fastparquet β Lightweight Alternative
fastparquet is a pure-Python Parquet implementation developed by the Dask team. It is smaller and simpler than pyarrow.
When to prefer it:
- β’Installing pyarrow is problematic (restricted environments, constrained containers)
- β’You are already in a Dask/Numba stack where fastparquet integrates directly
When NOT to use it:
- β’Complex nested schemas: fastparquet supports fewer Parquet types
- β’Performance benchmarks consistently show pyarrow 20β40% faster for read-heavy workloads
- β’Encryption and advanced compression (ZSTD, BROTLI, LZ4) support is limited
import fastparquet as fp
# Read
pf = fp.ParquetFile("data.parquet")
df = pf.to_pandas()
# Write
fp.write("data.parquet", df, compression="SNAPPY")
pandas Native Engine Selection
pandas.read_parquet delegates to an engine. Specifying it explicitly removes ambiguity:
# Explicit engine selection
df = pd.read_parquet("data.parquet", engine="pyarrow") # default, recommended
df = pd.read_parquet("data.parquet", engine="fastparquet") # alternative
When both are installed, pandas defaults to pyarrow. Pin this explicitly in production code to avoid silent engine changes across environments.
Why NOT modin, Dask, or Distributed Pandas?
A common suggestion is to replace pandas with modin or dask to speed up Python processing in KNIME. This is the wrong solution for the right problem.
The Actual Bottleneck
The performance issue is not inside the pandas computation β it is in the JVM-to-Python data transfer. A pandas operation that takes 0.5 seconds on 5M rows is irrelevant if the bridge serialisation takes 40 seconds before it can run.
Modin and Dask speed up the computation phase. They do nothing to reduce bridge overhead. You would still wait 40 seconds for the data to arrive before modin's parallel groupby saves you 0.3 seconds.
The Overhead Problem
Both modin and Dask require initialising a distributed scheduler (Ray or Dask.distributed). In a KNIME Python Script node, this happens on every execution, adding 2β8 seconds of worker initialisation before any work runs. For single-node KNIME workflows processing files of reasonable size (< 50M rows), this overhead exceeds any benefit.
The Right Mental Model
| Approach | Solves bridge overhead? | Adds scheduler overhead? | Complexity | |---|---|---|---| | Parquet + pyarrow | β Yes β bypasses bridge | None | Low | | modin (Ray backend) | β No | 3β8s | High | | Dask | β No | 2β5s | High | | pandas as-is | β No | None | Baseline |
Parquet solves the right problem. modin and Dask solve a different problem (in-memory computation at scale) that most KNIME workflows never encounter.
KNIME Implementation
Step 1: Pre-Processing Component Configuration
Add a configuration component at the start of the workflow. Define a unique string ID for each Parquet table β typically the workflow name + a table identifier:
workflow_parquet_id = "euc_remediation_main_20241129"
This ID becomes the filename base. Unique IDs prevent collisions when multiple workflow instances run concurrently on KNIME Server.
Step 2: Define Output Paths with Variable Creator
Use the Variable Creator node to construct full file paths:
parquet_uid = /knime/workflow-data/euc_remediation_main_20241129.parquet
Using flow variables (not hardcoded strings) ensures the path resolves correctly across local executor, KNIME Server, and CI/CD environments without modification.
Step 3: Read Source Data Normally
Use standard KNIME reader nodes β Excel Reader, CSV Reader, DB Reader, etc. β as you normally would. No changes at this stage.
Step 4: Insert the Parquet Writer Node
Connect the table output to a Parquet Writer node. Configure:
- β’Output location: use the flow variable from Step 2
- β’Overwrite policy: set to Overwrite (required for re-runs)
- β’Compression: SNAPPY (default) or ZSTD for network storage
Step 5: Python Script Node β Read from Parquet
Replace the direct table input in your Python Script node with a file-path flow variable and pyarrow:
import pyarrow.parquet as pq
import pandas as pd
# Read only the columns this script actually needs
df = pq.read_table(
flow_variables["parquet_uid"],
columns=["account_id", "product_code", "balance", "report_date"],
).to_pandas()
# All downstream business logic unchanged
Key change: instead of receiving the table from KNIME's bridge (
input_table_1), the script reads from disk via pyarrow. The bridge is used only for the final output table, which is typically much smaller than the input.
Step 6: Validate with Timer Info
Add a Timer Info node immediately after the Parquet Writer and after the Python Script node. Compare the millisecond timestamps. A representative before/after benchmark:
| Stage | Before (CSV via bridge) | After (Parquet + pyarrow) | |---|---|---| | Table β Python bridge | 38 500 ms | β (bypassed) | | Parquet write | β | 3 200 ms | | Parquet read (pyarrow) | β | 4 800 ms | | Python computation | 12 000 ms | 11 500 ms | | Total | 50 500 ms | 19 500 ms |
Net improvement: ~2.6Γ in this scenario. Gains scale with row count. Above 10M rows, improvements of 6β8Γ are common.
Infrastructure Cost Analysis
In cloud or on-premise KNIME Server environments, workflow runtime directly maps to compute cost.
Consider a workflow running on a 4-vCPU executor instance at $0.20/hour:
| Scenario | Daily executions | Runtime per run | Monthly cost | |---|---|---|---| | Before: CSV bridge | 120 | 52 min | $20.80 | | After: Parquet | 120 | 21 min | $8.40 | | Saving | β | β | $12.40 / month |
For high-frequency workflows (hourly runs, event-driven triggers), or environments with 10β50 concurrent workflows, these savings compound significantly. Parquet's smaller file sizes also reduce shared storage costs and accelerate backup windows.
Conclusion: Choosing the Right Intermediate Format
Parquet is not always the right answer β but for structured tabular data in Python analytics pipelines, it is almost always the best one.
| Format | Speed | Compression | Type safety | Ecosystem | Verdict | |---|---|---|---|---|---| | Parquet + SNAPPY | β β β β β | β β β β β | β β β β β | β β β β β | β Best for KNIME pipelines | | Parquet + ZSTD | β β β β β | β β β β β | β β β β β | β β β β β | β Best for network/cloud storage | | Feather v2 (LZ4) | β β β β β | β β β ββ | β β β β β | β β β ββ | Good for Python-only scratch | | CSV | β β βββ | β ββββ | β ββββ | β β β β β | Only for human-readable exports | | Excel | β ββββ | β β βββ | β β βββ | β β β β β | Avoid in automated pipelines | | JSON | β β βββ | β ββββ | β β βββ | β β β β β | API payloads only | | HDF5 | β β β β β | β β β ββ | β β β β β | β β β ββ | Scientific arrays, not tabular | | ORC | β β β β β | β β β β β | β β β β β | β β βββ | Hive/Spark only |
The Decision Rule
- β’KNIME daily workflows with SSD storage β Parquet + SNAPPY
- β’KNIME Server workflows writing to NAS or S3 β Parquet + ZSTD level 3
- β’Archive / compliance snapshots β Parquet + GZIP
- β’Scratch files, Python-only, maximum speed β Feather v2
- β’Human-readable output for business users β CSV (export only, never intermediate)
The pattern is simple: write Parquet once, read it fast in Python, skip the JVM bridge entirely for bulk data. The performance and cost benefits are immediate, measurable, and require no changes to business logic.