The website owner is the Brazilian government. Knowing how the government works, it would be very difficult to find and contact anyone who even knows what a HEAD request is. Even then, the typical government website is developed by a contractor, the contract may have been expired, not have any development room for fixes like this, etc. In all, I find it next to impossible to get the website owner to cooperate and change this behaviour.
Why is the crawler even attempting a HEAD request before trying GET? It’s not like document webpages would be retrievable by anything but GET, and using only GET is what web browsers do. I think the crawlers should always try to GET directly (after trying to parse GET robots.txt, of course).