I have several ndjson files that are nearly 800GB. They come from parsing the Wikipedia dump. I would like to remove duplicates html. As such, I group by "html" and pick the json with the most recent "dateModified".

import duckdb from pathlib import Path inDir = r"E:\Personal Projects\tmp\result" outDir = r"C:\Users\Akira\Documents\enwiktionary.ndjson" inDir = Path(inDir) outDir = Path(outDir) con = duckdb.connect() con.execute("SET threads=5") con.execute("SET memory_limit='12.5GB'") con.execute("SET preserve_insertion_order=false") result = con.sql(f""" COPY( SELECT arg_max(html, dateModified) as html FROM read_ndjson('{inDir / "*enwiktionary*.ndjson"}') GROUP BY url ) TO "{outDir}" """)

Then I get error:

--------------------------------------------------------------------------- OutOfMemoryException Traceback (most recent call last) Cell In[5], line 16 12 con.execute("SET memory_limit='12.5GB'") 13 con.execute("SET preserve_insertion_order=false") ---> 16 result = con.sql(f""" 17 COPY( 18 SELECT 19 arg_max(html, dateModified) as html 20 FROM read_ndjson('{inDir / "*enwiktionary*.ndjson"}') 21 GROUP BY url 22 ) 23 TO "{outDir}" 24 """) OutOfMemoryException: Out of Memory Error: failed to allocate data of size 16.0 MiB (11.6 GiB/11.6 GiB used) Possible solutions: * Reducing the number of threads (SET threads=X) * Disabling insertion-order preservation (SET preserve_insertion_order=false) * Increasing the memory limit (SET memory_limit='...GB') See also https://duckdb.org/docs/stable/guides/performance/how_to_tune_workloads

On the other hand, if I set SET memory_limit='13GB', then the code runs well without error. My laptop has 32GB of RAM and 8 CPU cores (16 threads). I read Memory Management in DuckDB but could not see how to fine tune the parameters.

Which kind of computation that DuckDB has to use RAM and could not spill to disk? How to fine tune the parameter memory_limit given a dataset at hand?

Akira's user avatar

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.