November 22, 2013

Word capitalization with Haskell


We have this problem, given a string of words, capitalize just the first letter, but the main trouble here is the lack of a common pattern to distinguish between one and another, for example:

"helloanyonethere" -> "HelloAnyoneThere"
"himartin" -> "HiMartin"
"hellorange" -> "HelloRange"
"thereach" -> "TheReach"

Well, the trouble can be addressed using Haskell, list and the always opportune recursion. The input information would be a dictionary that contains all possible words and the string to analyze; the output should be an string with the capitalized letters.

Now, lets explain the code step by step, first we need a function that search a string in a list of strings, which is simple, just tell me if the element is contained in the list.


isInDic :: String -> [String] -> Bool
isInDic str dic = elem str dic


Next, the magic happens with two big (but equally structured) functions, one that considers as priority bigger words and another that starts from smaller words (dropMaxDic <-> dropMinDic). The only difference between these two functions is the comparison of the current index N, if it is less or equal to the maximum allowed word size or, in the other hand, greater than or equal to the minimum allowed word size.


dropMaxDic :: Int -> String -> [String] -> [String]
dropMaxDic n str dic 
    | n == minWordSize    = []
    | otherwise = 
        if  (foundInDic == True)
        then
            capitalize : dropMaxDic maxWordSize removed dic
        else
            dropMaxDic (n - 1) str dic
    where
        wordToSearch = take n str
        capitalize = toUpper (wordToSearch!!0) : drop 1 wordToSearch
        foundInDic = isInDic wordToSearch dic
        removed = drop n str


Apart from that, they just recursively call, have you found a word, then concatenate the first capital along with the rest of the word and continue searching for more given a Mix/Max size. On the main loop we just load the british english dictionary, parse it so that bigger words are at the beginning and call the desired function on each test. The output of this program is shown as follows:


*CameCase> main
["Hello","Anyone","There"]
["Him","Art","In"]
--- Martin is not part of the
--- dictionary, that is why it
--- is split into "Art" & "In"
["Hello","Range"]
["There"]  
--- Here, it was more important to catch
--- "There" than "The","Reach"

--- This section goes from
--- Min to Max
["Hell"]
["Him","Ar","Ti"]
["Hell","Or","An","Ge"]
["The","Re","Ac"]


As we can see, the second section (Min -> Max) do not show any of the desired strings as output, this is mainly restricted to the use of the dictionary. Of course, this procedure is not efficient at all!!, I am aware of that, because it behaves (in the worst case scenario) as O(n^n), first because every word (either Max or Min) makes "n" searches to the dictionary, then if nothing is found, continue with the following iteration (Min + 1 | Max - 1). n searches for n possible words it is not efficient at all!!, but there are some improvements that could take place:

  • Have a dictionary with a reduced set of words, those that are probabilistically more precise to appear.
  • Parse the dictionary and transform it into a tree, where a lookup can take just O(n log n) to search for an entry.
  • Parse the string as a "trie" structure, where the alphabet would mean the first entries and possible words could be constructed out of the connected leafs to the root.
  • Execute Max and Min concurrently and merge the end results. This would only output more accurate results but perform badly still.
  • Use a bullet proof library for parsing strings.

  • Become a better programmer and design well from the beginning :).

Therefore, for now and as an educational improvement on Haskell usage, this seems to be a fair program. Links:


No comments:

Post a Comment