Quantcast
Channel: Thumbtack Engineering
Viewing all articles
Browse latest Browse all 44

Googlebot makes POST requests via AJAX

$
0
0

Googlebot is constantly evolving to better capture the web’s content. Over the past few years we’ve seen Googlebot submit GET forms and execute JavaScript. But we’ve always taken it for granted that Googlebot would never execute a POST request, nor would any other well-behaved web crawler.

We were wrong about that. Recently, we started observing Googlebot making POST requests to thumbtack.com. As far as we can tell, such requests have not been openly observed before. These Apache access log excerpts show a few examples:

66.249.71.47 - - [04/Sep/2011:04:53:52 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ma/malden/dog-walking/dog-walking-and-pet-care-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.198 - - [25/Sep/2011:04:27:50 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/ca/solana-beach/wedding-photographers/photography-cary-pennington-photography" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.72.207 - - [04/Oct/2011:09:53:08 +0000] "POST /act/site/clienterror HTTP/1.1" 200 36 "http://www.thumbtack.com/tx/san-antonio/painting/residential-commercial-construction-services" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

We’ve verfied the requests are coming from real Google crawler IP addresses:

$ dig -x 66.249.71.47 +short
crawl-66-249-71-47.googlebot.com.
$ dig crawl-66-249-71-47.googlebot.com. +short
66.249.71.47

The source of the requests is our client-side JavaScript error tracking code, which installs a global JavaScript error handler and attempts to POST to our server when unhandled errors are detected on the client. The requests from Googlebot include traceback information, so it appears the code was genuinely executed and not simply parsed to extract links.

Now, this isn’t necessarily harmful behavior. In discussing request safety, RFC 2616 sec. 9.1.1 states:

The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.

In this case, the JavaScript code makes an unprompted POST request upon page load, not resulting from any user action. One might say that the request fits the above definition and is therefore safe, regardless of the request method. We conclude simply that this is a interesting new feature of Googlebot and one that webmasters should be aware of.


Viewing all articles
Browse latest Browse all 44

Trending Articles