## Using Shapeless for Data Cleaning in Apache Spark

When it comes to importing data into a BigData infrastructure like Hadoop, Apache Spark is one of the most used tools for ETL jobs. Because input data – in this case CSV – has often invalid values, a data cleaning layer is needed. Most tasks in data cleaning are very specific and therefore need to be implemented depending on your data, but some tasks can be generalized. In this post, I’ll not go into Spark, ETL or BigData in general, but provide one approach to clean null / empty values off a data set. [Read More]

## Java Libs in Scala - A bit more Functional

Every Java library can be used in Scala, which is, for me, one of the good parts of the JVM world. But Java libs are mostly object-oriented and not functional, therefore full of side effects and somtimes “ugly” to use in Scala. But there are some approaches how to make Java libs (or their interfaces) more functional, so they can almost be used like a Scala lib. Java 8 Type Conversion Many Java types like Map or List, but also functional types (Java 8) like Optional<T> have Scala pendents. [Read More]

## Understanding Stemmers (Natural Language Processing)

I am interested in NLP and have already some experience with Apache Solr. It’s time to dig a little in-deep regarding stemmers. First of all, I was looking for a general definition of what a stemmer is, and I found this one, which IMHO is quite good: stemmer — an algorithm for removing inflectional and derivational endings in order to reduce word forms to a common stem So what a stemmer does is nothing more, than converting words to their word stem. [Read More]

## If pragmatism raises technical debt, call it oversimplification (rant)

The word “pragmatism” or “pragmatic” is, in my personal opinion, the most overrated word in agile development. Many people use this as a buzzword without knowing what it means. I hear people saying “He solved that complex problem in half an hour, he’s so pragmatic!” and think for myself “Yeah, but that ‘solution’ probably causes other devs three times more effort than a sustainable solution would take.” Okay, but that’s only my anger speaking. [Read More]