Using Shapeless for Data Cleaning in Apache Spark

When it comes to importing data into a BigData infrastructure like Hadoop, Apache Spark is one of the most used tools for ETL jobs. Because input data – in this case CSV – has often invalid values, a data cleaning layer is needed. Most tasks in data cleaning are very specific and therefore need to be implemented depending on your data, but some tasks can be generalized. In this post, I’ll not go into Spark, ETL or BigData in general, but provide one approach to clean null / empty values off a data set. [Read More]

Java Libs in Scala - A bit more Functional

Every Java library can be used in Scala, which is, for me, one of the good parts of the JVM world. But Java libs are mostly object-oriented and not functional, therefore full of side effects and somtimes “ugly” to use in Scala. But there are some approaches how to make Java libs (or their interfaces) more functional, so they can almost be used like a Scala lib. Java 8 Type Conversion Many Java types like Map or List, but also functional types (Java 8) like Optional<T> have Scala pendents. [Read More]

Scala Compiler Tuning

As my Scala projects go on, I want to share some compiler configuration and tricks with you, which I use on many projects. Some tiny configuration options can greatly improve your code and warn you about things, you would probably never discover. Basically, you can pass compiler options to scalac using console arguments: $ scalac -deprecation -unchecked -Xlint something.scala If you are using SBT, it’s even simpler… You can just use the following configurations snippet in your build. [Read More]