Writing parquets with Azure with Java

In this guide I will show you how to write parquet files from a vanilla java code. In order to do that we will work with AvroParquertWriter<GenericRecord> and with Path and Configuration from the hdfs libraries.

Why Java?

It is very common when working with parquets to work with Apache Spark, But in many data flow architectures we don’t want to use Spark for microservices, for example, a service which parse small chunks of data and saves them to blob storage. In this case working with Spark will be an overkill and will cost us in performence. We can write a microservice in Java and deploy it to kuberenetes, which will save us alot of compute time, and money.

Dependencies

Add the following dependencies to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-avro</artifactId>
        <version>1.11.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-azure</artifactId>
        <version>3.3.0</version>
    </dependency>
    <dependency>
        <groupId>com.rainerhahnekamp</groupId>
        <artifactId>sneakythrow</artifactId>
    </dependency>
</dependencies>

We will use the parquet-avro to actually write, and hadoop-common and hadoop-azure to connect to our desired storage.

Blob Storage Configuration

import org.apache.hadoop.conf.Configuration;


String accountName = "your-account-name"
String accessKey = "your-access-key"
String containerName = "target-container"
Configuration conf = new Configuration(); 
conf.set(String.format("fs.azure.account.key.%s.blob.core.windows.net", accountName), accessKey);
pathTemplate = String.format("wasbs://%s@%s.blob.core.windows.net/", containerName, accountName);

Data Lake Gen2 Configuration

import org.apache.hadoop.conf.Configuration;


String accountName = "your-account-name"
String accessKey = "your-access-key"
String containerName = "target-container"
Configuration conf = new Configuration(); 
conf.set(String.format("fs.azure.account.key.%s.dfs.core.windows.net", accountName), accessKey);
pathTemplate = String.format("abfs://%s@%s.dfs.core.windows.net/", containerName, accountName);

Writing Parquet

After we configured our Configuration object for the storage we use, we can use AvroParquetWriter as usual, just like hdfs storage.

public void writeBulk(String path, Schema schema, Collection<GenericRecord> records) throws IOException {
    String newPath = pathTemplate + path;
    try (ParquetWriter<GenericRecord> parquetWriter =
                 AvroParquetWriter.<GenericRecord>builder(new Path(newPath))
                         .withConf(conf)
                         .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
                         .withCompressionCodec(compressionCodecName)
                         .withSchema(schema)
                         .build()) {
        records.forEach(sneaked(parquetWriter::write));
    }
}

This function will write all of the records to your path in Azure Storage.

Creating GenericRecord

Schema schema = SchemaBuilder
            .builder()
            .record("record")
            .fields()
            .requiredLong("id")
            .requiredString("name")
            .endRecord();
GenericRecord record1 = new GenericRecordBuilder(schema).set("id", 1L).set("name", "irony").build();
GenericRecord record2 = new GenericRecordBuilder(schema).set("id", 2L).set("name", "dev").build();
List<GenericRecord> recordList = new ArrayList<>();
recordList.add(record1);
recordList.add(record2);

Example

writeBulk("newParquert.parquet", schema, records)

And thats it, you got a new parquet in your Azure Storage!

Writing parquets with Azure with Java#

Why Java?#

Dependencies#

Blob Storage Configuration#

Data Lake Gen2 Configuration#

Writing Parquet#

Creating GenericRecord#

Example#