Remove Extra New Lines and Whitespace Using Java Regex

Sometimes when extracting text from another item may result in formatting issues that involve extra blank lines or leading/trailing whitespace on each line.

This commonly occurs when extracting from HTML elements or XML documents.

The following String regular expressions can fix the following issues.

(?m) = multi-line mode

The following removes the leading/trailing whitespace from each line in the string.

node.getTextContent().replaceAll(
"(?m)^[\s&&[^n]]+|[\s+&&[^n]]+$", "");
Example:
The quick brown fox
jumps over
the lazy dog.
Result:
The quick brown fox
jumps over
the lazy dog.

The following removes extra blank lines from the string.

node.getTextContent().replaceAll("(?m)^[ t]*r?n", "");
Exmaple:
The quick brown fox
jumps over
the lazy dog.
Result:
The quick brown fox
jumps over
the lazy dog.

References
Regular.Expressions.info – Specifying Modes Inside The Regular Expression


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.