Writing parquets with Azure with Java
In this guide I will show you how to write parquet files from a vanilla java code.
In order to do that we will work with AvroParquertWriter<GenericRecord>
and with Path
and Configuration
from the hdfs
libraries.
Why Java?
It is very common when working with parquets to work with Apache Spark
, But in many data flow architectures we don’t want to use Spark for microservices, for example, a service which parse small chunks of data and saves them to blob storage.
In this case working with Spark will be an overkill and will cost us in performence. We can write a microservice in Java and deploy it to kuberenetes, which will save us alot of compute time, and money.
Dependencies
Add the following dependencies to your pom.xml
:
<dependencies>
<dependency>
<groupId>org.apache.parquet</groupId>
<artifactId>parquet-avro</artifactId>
<version>1.11.1</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<version>3.3.0</version>
</dependency>
<dependency>
<groupId>com.rainerhahnekamp</groupId>
<artifactId>sneakythrow</artifactId>
</dependency>
</dependencies>
We will use the parquet-avro
to actually write, and hadoop-common
and hadoop-azure
to connect to our desired storage.
Blob Storage Configuration
import org.apache.hadoop.conf.Configuration;
String accountName = "your-account-name"
String accessKey = "your-access-key"
String containerName = "target-container"
Configuration conf = new Configuration();
conf.set(String.format("fs.azure.account.key.%s.blob.core.windows.net", accountName), accessKey);
pathTemplate = String.format("wasbs://%s@%s.blob.core.windows.net/", containerName, accountName);
Data Lake Gen2 Configuration
import org.apache.hadoop.conf.Configuration;
String accountName = "your-account-name"
String accessKey = "your-access-key"
String containerName = "target-container"
Configuration conf = new Configuration();
conf.set(String.format("fs.azure.account.key.%s.dfs.core.windows.net", accountName), accessKey);
pathTemplate = String.format("abfs://%s@%s.dfs.core.windows.net/", containerName, accountName);
Writing Parquet
After we configured our Configuration
object for the storage we use, we can use AvroParquetWriter
as usual, just like hdfs storage.
public void writeBulk(String path, Schema schema, Collection<GenericRecord> records) throws IOException {
String newPath = pathTemplate + path;
try (ParquetWriter<GenericRecord> parquetWriter =
AvroParquetWriter.<GenericRecord>builder(new Path(newPath))
.withConf(conf)
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(compressionCodecName)
.withSchema(schema)
.build()) {
records.forEach(sneaked(parquetWriter::write));
}
}
This function will write all of the records to your path in Azure Storage
.
Creating GenericRecord
Schema schema = SchemaBuilder
.builder()
.record("record")
.fields()
.requiredLong("id")
.requiredString("name")
.endRecord();
GenericRecord record1 = new GenericRecordBuilder(schema).set("id", 1L).set("name", "irony").build();
GenericRecord record2 = new GenericRecordBuilder(schema).set("id", 2L).set("name", "dev").build();
List<GenericRecord> recordList = new ArrayList<>();
recordList.add(record1);
recordList.add(record2);
Example
writeBulk("newParquert.parquet", schema, records)
And thats it, you got a new parquet in your Azure Storage!