Custom Encoders vs. UDT's

Encoders use run-time reflection and code generation to build expressions to read/write and encode/decode contents directly to/from the low-level binary row format developed as part of Project Tungsten a few Spark releases ago (circa 1.4/1.5).

This runtime reflection and code generation allows Encoders to be used across different versions of Spark as binary compatibility is not required.

Spark provides encoders out of the box for most common types (ie. String, Int, etc) and simple Abstract Data Types (List, Tuple, etc) known by Spark.

Always prefer Encoders as the binary-compatible generated code is always binary-compatible - and improving in performance/efficiency with newer versions of Spark.

If Encoders do not work for you data types, you can use UDTs.  Fortunately, the UDT public API has remained stable since Spark 1.3, so binary compatibility shouldn't be an issue with newer versions of Spark.

 

Have more questions? Submit a request

Comments