Welcome back!

If you’re new to this blog series, head on over to part 1 to catch up on what we’re doing here. That first post sets the stage, gives motivation to the problem, and sets the parameters of the challenge using the following text as our test set:

text <- c(
  "Pneumonoultramicroscopicsilicovolcanoconiosis is a really long word", 
  "It's a name that has been invented for a lung disease caused by breathing in very small pieces of ash or dust", 
  "Pneumono refers to the lung. Ultra means extremely, microscopic means tiny, silica is sand, volcano is self-evident, and coniosis is scarring.")

In this post, I’m going to walk through my initial attempt at solving the text-wrapping problem at hand. Heads up – it’s over-engineered and it’s slow. But hey! It tells a story and hopefully you can learn a thing or two as I walk through the code. That’s what this is all about anyway.

So let’s get started.

Solution 1: Recursion

Who doesn’t love a good recursive function? Especially when it’s not the de facto Fibonacci example. The first time I looked at this problem, I saw a recursive solution:

  1. Use stringi::stri_wrap() to break the input string into chunks by width, split at whitespace
  2. Check if any of those chunks exceed the allotted width, and if so find the first segment that does
  3. Capture everything that comes before and after the log word separately
  4. Grab width - 1 characters out of the long word, and concatenate a hyphen onto that segment, and additionally grab the excess
  5. Take the excess and everything following the long word, then recurse until there are no long words left in the string
  6. When the break condition is met (i.e. no more long words), concatenate everything remaining, delimited by newlines (\n)

Like I said… overengineered

Recursion
#' A recursive approach to string splitting
#'
#' While this example is the most complicated, it's also the second least
#' efficient. Recursion is fun, but there's a lot of redundant work going on
#' inside this function
#'
#' @param s The string to split
#' @param width Longest allowable width of a line
#' @param indent Indent passed through to stri_wrap
#' @param exdent Exdent passed through to stri_wrap
#' @param join Character to join the lines together (default to \\n)
#'
#' @return Wrapped string
overflow_wrap1 <- function(s, width=80, indent = 0, exdent = 0, join = "\n") {
   # Don't go down this hole if we can avoid it
   if (nchar(s) < width) {
      return(s)
   }
   # Run str_wrap to get the break vector
   str_vec <- stri_wrap(s, width, indent = indent, exdent = exdent, simplify=TRUE)
   # Do any of the wrapped segments exceed the string width?
   if (any(which(nchar(str_vec) > width))) {
      # Get the first segment that exceeds width
      fe <- min(which(nchar(str_vec) > width))
      if (length(str_vec) > 1) {
         # If the exceeded word is the last word, need to handle that case
         if (fe == length(str_vec)) {
            pre <- str_vec[1:fe-1]
            post <- character()
         } else {
            # Get everything before and after the long word
            pre <- str_vec[1:fe-1]
            post <- str_vec[fe + 1: (length(str_vec) - fe)]
         }
      } else {
         # If it's just one long word we have to play around
         pre <- character()
         post <- character()
      }
      # Get the long word
      lw <- str_vec[fe]
      # Split the long word and concatenate with a dash on the end
      trimmed <- paste0(str_sub(lw, 1, width-1), "-")
      excess <- str_sub(lw, width)
      # Now process again and continue wrapping
      str_vec <- c(pre, trimmed, overflow_wrap1(paste(c(excess, post), collapse=" "), width, indent, exdent, join))
   }
   # Join everything back together
   paste(str_vec, collapse=join)
}
# Vectorized version of overflow_wrap1
overflow_wrapper1 <- Vectorize(overflow_wrap1, vectorize.args = "s")

Using stringr::str_wrap() nicely breaks the string out by the word, but when a single string exceeds the alotted with (without any whitespace), stringr::str_wrap() allows it to overflow. stringr::str_wrap() returns a string with newline characters inserted. If you go down into stringi::stri_wrap() instead, you get a nicely chunked character vector.

As a quick aside, from the first post of this series someone pointed out that there’s more to stringr::str_wrap() than just wrapping words nicely based on allotted width. This function actually uses the Knuth’s word wrapping algorithm under the hood, which minimizes the amount of trailing white space on each line – aiming for the most aesthetically pleasing output. This makes the speed of our baseline results here all the more impressive. Learn more in the stringi::str_wrap() documentation

So first question, why do we need to recurse? Why can’t we just find all the long strings and break them apart after stringi::stri_wrap() has been run? Because stringi::stri_wrap() optimizes those break points based on your set width. By breaking up the long strings, you’ll likely be left with certain wrapped lines that are very short – and the overall appearance will be quite poor. Everything prior to that long string will look fine, but everything after it needs to be wrapped again at the proper locations.

Take this small segment of the solutions:

microscop-
ic 
means
tiny,
silica

“Microscopic” needed to wrap the “ic” onto the next line, but “means” could still fit in the allotted width after “ic”. The splitting was already run, and now the proper break locations after the “ic” in “Microscopic” need to be reset.

If it feels redundant and like we’re doing to much rework – that’s because it is. While it was a fun solution to write, it’s not very elegant. Take a look and digest the code. There are a couple extra pieces included here to handle edge cases as well, as well as the vectorization of the function to work on character vectors instead of single strings.

Measuring Performance

So how well does it work? Or, how poorly? Let’s find out.

Now would be a good time to grab some coffee, go to lunch. This is gonna take a while

wrap1_bench <- microbenchmark::microbenchmark(
   overflow_wrapper1(rep(text, 1000), width=10),
   unit='s'
)
microbenchmark:::autoplot.microbenchmark(wrap1_bench)
Graph measuring performance

So wow – this is a dramatic difference. we now have a mean execution time of about 2.1 seconds, versus the .04 from our baseline. That’s quite a bit of overhead, and the overhead is quite clear from this example as well. Each time a string exceeds the allotted width, we’re essentially running the baseline example again. If the width is 10, using our first example sentence “Pneumonoultramicroscopicsilicovolcanoconiosis is a really long word”, that means we’re running stringi::stri_wrap() 5 times!

Needless to say, the performance of this solution is quite poor. So let’s see if we can do better next week!

As a reminder – you can submit your solutions in our GitHub repo, which has submission instructions in the README.

Back to Blog