Is there a way to directly convert CSV to Parquet with DuckDB in Java?

6 days ago 6
ARTICLE AD BOX

I am doing some tests comparing DuckDB usage among different languages etc, and I've noticed something strange. In python you can do the following:

duckdb.read_csv(inputFile, max_line_size=10000000, null_padding=True, delimiter=";") .write_parquet(outputFile)

However, this syntax is not available in Java, where you'd have to do the following:

PreparedStatement statement = conn.prepareStatement(String.format(""" PRAGMA memory_limit='16GB'; CREATE TABLE "%s" AS SELECT * FROM read_csv("%s", delim=';'); COPY "%s" TO "%s" (FORMAT parquet); """, path.getFileName(), path.toAbsolutePath(), path.getFileName(), outputdir.getAbsolutePath() + "/" + path.getFileName().toString().replace(".csv",".parquet") )); statement.execute();

The issue with this is, this is far slower. Normallly, I'd expect to be able to do a statement like:

COPY read_csv("INPUTFILE", delim=';') TO "OUTPUTFILE" (FORMAT parquet);

But as far as I can tell, within a copy statement you can only apply these arguments to one side of the statement. Meaning that if you use a non-standard delimiter (delimiter != ",") you need to first copy the entire file into memory.

In my benchmarks I am converting about 35 gigabytes consisting of 25k CSV files as this is essentially what the real world usage will be.

This causes the program to be far slower when using java:

Java: Total time 528.120 (s) Average time 20912133 (ns)

vs

Python: Total Time: 375.7097873687744 (s) Average time: 14896365.527604684 (ns)

Does anyone know of any way I can better approach this? Or is this a shortcoming of the DuckDB syntax?

Read Entire Article