Scalding is a powerful framework for writing complex data processing applications on Apache Hadoop. It’s concise and expressive - almost to a fault. It’s dangerously easy to pack gobs of subtle business logic into just a few lines of code. If you’re writing real data processing applications and not just ad-hoc reports, unit testing is a must. However tests can get unwieldy to manage as job complexity grows and the arity of data increases.
Testing this job end-to-end would be fragile because there is so much going on and it would be tedious and noisy to build fake data to isolate and highlight edge cases. The pivot operations on lines 20-22 only deal with browser and country yet test data with all 10 fields is required including valid timestamps and user agents just to get to the pivot logic.
There are a few ways to tackle this and an approach I like is to use extension methods to breakdown the logic into smaller chunks of testable code. The result might look something like this.
Each block of code depends on only a few fields so it doesn’t require mocking the entire input set.
123456789101112131415161718
importDsl._objectComplicatedJob{implicitclassComplicatedJobRichPipe(pipe:Pipe){// this chunk of code is testable in isolationdefcountCountryByBrowser():Pipe={pipe.map('country->'country){c:String=>if(c=="us")celse"other"}.groupBy('browser,'country){_.size('count)}.groupBy('browser){_.pivot(('country,'count)->('us,'other))}}...}}
In this example only browser and country are required so setting up test data is reasonably painless and the intent of the test case isn’t lost in a sea of tuples. Granted, this approach requires creating a helper job to set up the input and capture the output for test assertions, but I think it’s a worthwhile trade off to reveal such a clear test case.
importComplicatedJob._importComplicatedJobTests._@RunWith(classOf[JUnitRunner])classComplicatedJobTestsextendsFunSuitewithShouldMatchers{test("should count and pivot rows into columns"){valinput=List[InputTuple](("firefox","us"),("chrome","us"),("safari","us"),("firefox","us"),("firefox","br"),("chrome","de"))valexpected=Set[OutputTuple](("firefox",2,1),("safari",1,0),("chrome",1,1))count(input){_.toSetshouldequal(expected)}}}objectComplicatedJobTests{typeInputTuple=(String,String)typeOutputTuple=(String,Int,Int)// this is a helper job to set up the inputs and outputs// for the chunk of code we're trying to testclassCountCountryByBrowser(args:Args)extendsJob(args){Tsv("input",('browser,'country)).read.countCountryByBrowser()// this is what we're testing.project('browser,'us,'other).write(Tsv("output"))}// helper method to run our test jobdefcount(input:List[InputTuple])(fn:List[OutputTuple]=>Unit){importDsl._JobTest[CountCountryByBrowser].source(Tsv("input",('browser,'country)),input).sink[OutputTuple](Tsv("output")){b=>fn(b.toList)}.run.finish}}