ARTICLE AD BOX
I am trying to filter out the URI column from a parquet file having over 50 million rows containing empty string using
import polars as pl lf = pl.scan_parquet("data.parquet") lf.filter(pl.col("URI") == "").collect()Output:
shape: (0, 3) ┌─────┬────────┬───────────┐ │ URI ┆ REMARK ┆ TIMESTAMP │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪════════╪═══════════╡ └─────┴────────┴───────────┘Luckily I had labelled the rows with empty string URI in column REMARK with NO URI so,
lf.filter(pl.col("REMARK") == "NO URI").collect()yields:
shape: (7_767, 3) ┌─────┬────────┬────────────┐ │ URI ┆ REMARK ┆ TIMESTAMP │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪════════╪════════════╡ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ … ┆ … ┆ … │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ │ ┆ NO URI ┆ 1759257000 │ └─────┴────────┴────────────┘Also for confirmation that the URI column string is just empty string
len(lf.filter(pl.col("REMARK") == "NO URI").collect()["URI"][0]) # Outputs 0Is this is a bug in polars or have I missed some important info, and how do I get the rows with empty string?
Python version: 3.14.2
Polars version: 1.35.2
