Before gene prediction, the sequence should be masked for interspersed
repeats. These repeats are degenerate copies of transposable elements,
and make up about a third of the human genome. The protein coding
portion of a gene almost never contains interspersed repeats, therefore
masking results in better gene prediction.
Low complexity and simple repeats are short, repetitive sequences
such as TATATATATA or GAGATAGAGAGA. Genes sometimes do contain such
repeats, so by default they are not masked. If you see whole exons
consisting of such repeats in your results, you may want to re-run
prediction with the simple repeats masked.
The default settings for masking are mask interspersed, don't mask
low complexity and simple repeats. This is the way we run N-SCAN
for whole genomes.