3 min read

Analyzing large files locally in seconds with DuckDB and DataGrip

If you have ever received a huge csv file that you had to analyze or just quickly wanted to peek into to check it’s structure your go-to…
Analyzing large files locally in seconds with DuckDB and DataGrip

If you have ever received a huge csv file that you had to analyze or just quickly wanted to peek into to check it’s structure your go-to tool is usually pandas and a small Python script.

But if you are like me and always have DataGrip (or any other JDBC-compatible SQL IDE) open and for quick routine checks like this prefer SQL compared to Python, this guide is for you!

If you haven’t head about the two pieces of tech we’ll use here’s a short description of each:

DuckDB is an in-process. SQL OLAP database management system · All the benefits of a database, none of the hassle.

DataGrip is a database management environment for developers. It is designed to query, create, and manage databases. Databases can work locally, on a server, or in the cloud.

Set up the environment

  1. Download the DuckDB JDBC driver from Maven.
  2. In DataGrip create a new Driver configuration using the download jar file.

3. Create a new Data Source, the connection URL should be just jdbc:duckdb: as shown in the screenshot below.

4. Generate a bigger .csv file to showcase the functionality.hexdump -v -e '5/1 "%02x""\n"' /dev/urandom |
 awk -v OFS=',' '
   NR == 1 { print "employee", "department", "salary" }
   { print substr($0, 1, 8), substr($0, 9, 2), int(NR * rand()) }' |
 head -n "$1" > data.csv

Using this shell script we can generate a csv with 10000000 lines easily.~/duckdb-datagrip/generate-csv.sh 10000000~/duckdb-datagrip ❯ ls -lah                                                                                                                                                                  
drwxr-xr-x  4 daniel.palma  staff   128B May 28 08:47 .
drwxr-xr-x  8 daniel.palma  staff   256B May 28 08:33 ..
-rw-r--r--  1 daniel.palma  staff   187M May 28 09:16 data.csv
-rwxr--r--  1 daniel.palma  staff   217B May 28 09:14 generate-csv.sh

As you can see we generated an almost 200MB .csv file in a second (at least on my machine). This is a fairly realistic use case (although a bit bigger than usual).

5. Let’s get to querying!select * from '/path/to/data.csv';

Reading the whole 200MB file took less than a second and the types have been parsed perfectly by DuckDB automatically.

Let’s see how’s the performance with some basic analytical queries.> select count(*) from '/path/to/data.csv'1 row retrieved starting from 1 in 683 ms (execution: 663 ms, fetching: 20 ms)

Not bad! What about a window function? Let’s try to find the employees in each department with the second highest salary.> WITH ranking AS
          (SELECT *, RANK() OVER (PARTITION BY employee, department ORDER BY salary Desc) AS rnk
           FROM '/Users/daniel.palma/Personal/duckdb-datagrip/data.csv')
 SELECT employee, department, salary
 FROM ranking
 WHERE rnk = 252 rows retrieved starting from 1 in 16 s 557 ms (execution: 16 s 544 ms, fetching: 13 ms)

Alright that took a bit longer, but I’m still comfortable with this.

Conclusion

This is just scratching the surface of what DuckDB can do, make sure to check it out!