Crawl failure: Consumidor.gov.br

KnossosDomovoi · July 9, 2021, 8:00pm

I tried to add the service Consumidor.gov.br, however the crawler can’t read any documents there. I tried several different crawlers. Are they all being blocked by the service?

The error message:

Screenshot 2021-07-09 at 16-54-29 Terms of Service; Didn't Read - Phoenix

the bot gets the status code 405 Method not allowed, which is strange. On the browser the pages do open just fine:

Agnes_de_Lion · July 9, 2021, 9:17pm

The website might block any form of crawling and thus prevents the documents to be crawled.
I’ve sometimes encountered that error code and didn’t find any from of workaround of it.

justin · July 11, 2021, 1:04pm

The website does not allow HEAD requests (which the crawler does before GETting any content). The only workaround would be contacting the site owner as of now

KnossosDomovoi · July 14, 2021, 6:32pm

The website owner is the Brazilian government. Knowing how the government works, it would be very difficult to find and contact anyone who even knows what a HEAD request is. Even then, the typical government website is developed by a contractor, the contract may have been expired, not have any development room for fixes like this, etc. In all, I find it next to impossible to get the website owner to cooperate and change this behaviour.

Why is the crawler even attempting a HEAD request before trying GET? It’s not like document webpages would be retrievable by anything but GET, and using only GET is what web browsers do. I think the crawlers should always try to GET directly (after trying to parse GET robots.txt, of course).

justin · July 16, 2021, 10:30pm

The crawler is doing a HEAD request first to

Ensure the content type is crawlable
Ensure the content size is not bigger than X (in testing)
Checking the status code of the request
Its saving resources (bandwidth and RAM) as we have to spawn a new chrome instance for each GET request and each requests associated bandwidth, stuff doesn’t grow on trees.

KnossosDomovoi · July 18, 2021, 10:10pm

Ok. But we could consider have a different crawler instance that skips the HEAD request just for those cases where the site owner won’t cooperate in fixing that. Also maybe a manual copy/paste feature for sites that block bots, use CAPTCHAS or otherwise try to make the document difficult to crawl (that should maybe also warrant a new BAD or BLOCKER case for when sites don’t want to allow their ToS to be crawled).

Agnes_de_Lion · July 19, 2021, 8:45am

https://tosdr.atlassian.net/browse/EDIT-9

justin · July 23, 2021, 8:11pm

I have created a Jira Epic for this Jira