Handling Byte-Order-Mark Characters in Groovy
Author: Paul King
Published: 2024-07-11 08:00PM
A recent article
showed how to process Byte Order Mark (BOM) characters
within text files when coding in Java. In particular, often manual removal of those characters might
be needed when processing text files. The article showed how to remove the BOM characters when using
the InputStream
and Reader
classes as well as how to do it using NIO
functionality. It also showed
how the BOMInputStream
class in Apache Commons IO
could be used. It automatically skips over the BOM characters.
Those examples can be run as is in Groovy (albeit after fixing a bug in the first example) but the (complete!) idiomatic solution in Groovy is:
println new File('file.txt').text
That’s right, Groovy automatically detects
the encoding, and removes BOM characters,
when using the getText()
method
along with others like eachLine
, splitEachLine
,
readLines
, withReader
, and filterLine
.
The same functionality can be obtained using
the newReader
method too on files and URLs.
When needed there are variants that let you specify the encoding should you wish to explicitly declare it. In that case, you’d need to handle the BOM characters manually.
Groovy’s methods like getText
call an underlying
CharsetToolkit
class. You can also use that class directly
should you wish to learn more about the encoding
of a file.