Is there a way to directly convert CSV to Parquet with DuckDB in Java?

6 days ago 6

ARTICLE AD BOX

I am doing some tests comparing DuckDB usage among different languages etc, and I've noticed something strange. In python you can do the following:

duckdb.read_csv(inputFile, max_line_size=10000000, null_padding=True, delimiter=";") .write_parquet(outputFile)

However, this syntax is not available in Java, where you'd have to do the following:

PreparedStatement statement = conn.prepareStatement(String.format(""" PRAGMA memory_limit='16GB'; CREATE TABLE "%s" AS SELECT * FROM read_csv("%s", delim=';'); COPY "%s" TO "%s" (FORMAT parquet); """, path.getFileName(), path.toAbsolutePath(), path.getFileName(), outputdir.getAbsolutePath() + "/" + path.getFileName().toString().replace(".csv",".parquet") )); statement.execute();

The issue with this is, this is far slower. Normallly, I'd expect to be able to do a statement like:

COPY read_csv("INPUTFILE", delim=';') TO "OUTPUTFILE" (FORMAT parquet);

But as far as I can tell, within a copy statement you can only apply these arguments to one side of the statement. Meaning that if you use a non-standard delimiter (delimiter != ",") you need to first copy the entire file into memory.

In my benchmarks I am converting about 35 gigabytes consisting of 25k CSV files as this is essentially what the real world usage will be.

This causes the program to be far slower when using java:

Java: Total time 528.120 (s) Average time 20912133 (ns)

Python: Total Time: 375.7097873687744 (s) Average time: 14896365.527604684 (ns)

Does anyone know of any way I can better approach this? Or is this a shortcoming of the DuckDB syntax?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

Is there a way to directly convert CSV to Parquet with DuckDB in Java?

ARTICLE AD BOX

Related

child template content block gets swapped between two routes even though extends and blocks are correct. why the /technician/ show /cashier/, reversed

python/numpy : generate sound from an inverse fft of a synthetic spectrum

Read all tags from plc Allen Bradley

LEFT SIDEBAR AD