Bots that impersonate Googlebot

Anyone can act like a bot just by using the Googlebot useragent in a request. Sometimes crawlers do that to see what other bots might see. Sometimes it’s to circumvent robots.txt directives that apply to them, but not to Googlebot. Sometimes people hope to get a glimpse at cloaking. Whatever the reason, these kinds of requests can be annoying since they make log file analysis much harder.

Motivation for this excursion:


View tweet

For Googlebot, there’s a way of verifying that a request comes from Google through reverse IP lookup. This is randomly mentioned here.

Faking a Googlebot

Faking Googlebot is easy enough using curl:

curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
  http://johnmu.com/

Recognizing a fake Googlebot

Using the reverse IP lookup, you can verify an IP address fairly quickly:

$ nslookup 54.149.84.83
83.84.149.54.in-addr.arpa	name = ec2-54-149-84-83.us-west-2.compute.amazonaws.com.

$ nslookup 66.249.65.238
238.65.249.66.in-addr.arpa	name = crawl-66-249-65-238.googlebot.com.

Unfortunately, anyone can advertise an IP address as belonging to a specific hostname, so this is not perfect. The way to confirm it is to check the hostname. Only the owner of the domain name can specify that. If the IP address matches, then it’s a valid IP address for that hostname.

$ # We don't need to check this one, since it's clearly not a Googlebot IP address, but anyway

$ host ec2-54-149-84-83.us-west-2.compute.amazonaws.com
ec2-54-149-84-83.us-west-2.compute.amazonaws.com has address 54.149.84.83

$ host crawl-66-249-65-238.googlebot.com.
crawl-66-249-65-238.googlebot.com has address 66.249.65.238

The last one there returned the original IP address, therefore this hostname is valid, and it’s a valid Googlebot IP address. If a hostname is not from Googlebot, we don’t even need to check the host command.

Mass checks

Let’s check a bunch of server logs :-). This is from a variety of sites for 2020.

$ cat *.log >total.txt

$ # Get all IP addresses with Googlebot useragents

$ cat total.txt \
   | grep "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
   | awk '{print $1}' | sort | uniq -c | sort -nr | awk '{print $2}' >ips.txt

$ # Check reverse IP lookup - list the obvious bad

$ while read li; do nslookup $li | grep name | grep -v "googlebot.com.$" \
  | sed -e "s/^/$li is bad: /" ; done <ips.txt

The top obvious bad ones I stumbled upon are:

46.229.173.68 is bad: 68.173.229.46.in-addr.arpa	name = bot.semrush.com.
46.229.173.66 is bad: 66.173.229.46.in-addr.arpa	name = bot.semrush.com.
46.229.173.67 is bad: 67.173.229.46.in-addr.arpa	name = bot.semrush.com.
73.247.74.109 is bad: 109.74.247.73.in-addr.arpa	name = c-73-247-74-109.hsd1.il.comcast.net.
79.142.76.203 is bad: 203.76.142.79.in-addr.arpa	name = swe-net-ip.as51430.net.
193.142.146.4 is bad: 4.146.142.193.in-addr.arpa	name = ie.chimpjust.net.
51.68.50.3 is bad: 3.50.68.51.in-addr.arpa	name = ip3.ip-51-68-50.eu.
95.163.174.88 is bad: 88.174.163.95.in-addr.arpa	name = 95.163.174.88.dynamic-pppoe.dt.ipv4.wtnet.de.
89.139.229.72 is bad: 72.229.139.89.in-addr.arpa	name = 89-139-229-72.bb.netvision.net.il.
86.165.228.31 is bad: 31.228.165.86.in-addr.arpa	name = host86-165-228-31.range86-165.btcentralplus.com.
86.132.140.155 is bad: 155.140.132.86.in-addr.arpa	name = host86-132-140-155.range86-132.btcentralplus.com.
82.2.0.77 is bad: 77.0.2.82.in-addr.arpa	name = cpc81189-farn9-2-0-cust76.6-2.cable.virginm.net.
78.109.29.246 is bad: 246.29.109.78.in-addr.arpa	name = 246.29.109.78.hosting.ua.
39.110.213.227 is bad: 227.213.110.39.in-addr.arpa	name = fs276ed5e3.tkyc511.ap.nuro.jp.
213.32.83.105 is bad: 105.83.32.213.in-addr.arpa	name = ip105.ip-213-32-83.eu.
207.237.32.84 is bad: 84.32.237.207.in-addr.arpa	name = 207-237-32-84.s338.c3-0.elm-ubr2.qens-elm.ny.cable.rcncustomer.com.
184.73.90.224 is bad: 224.90.73.184.in-addr.arpa	name = ec2-184-73-90-224.compute-1.amazonaws.com.
178.165.56.235 is bad: 235.56.165.178.in-addr.arpa	name = 178-165-56-235-kh.maxnet.ua.
91.232.188.5 is bad: 5.188.232.91.in-addr.arpa	name = router-nat.ricom.org.
91.121.150.229 is bad: 229.150.121.91.in-addr.arpa	name = ns358486.ip-91-121-150.eu.
88.99.97.100 is bad: 100.97.99.88.in-addr.arpa	name = d97db8de0.fastvps-server.com.
78.180.194.138 is bad: 138.194.180.78.in-addr.arpa	name = 78.180.194.138.dynamic.ttnet.com.tr.
66.154.113.244 is bad: 244.113.154.66.in-addr.arpa	name = 66.154.113.244.static.quadranet.com.
62.210.157.10 is bad: 10.157.210.62.in-addr.arpa	name = 62-210-157-10.rev.poneytelecom.eu.
54.38.123.240 is bad: 240.123.38.54.in-addr.arpa	name = ip240.ip-54-38-123.eu.
54.149.84.83 is bad: 83.84.149.54.in-addr.arpa	name = ec2-54-149-84-83.us-west-2.compute.amazonaws.com.
51.91.176.137 is bad: 137.176.91.51.in-addr.arpa	name = ip137.ip-51-91-176.eu.
34.76.60.69 is bad: 69.60.76.34.in-addr.arpa	name = 69.60.76.34.bc.googleusercontent.com.
216.127.173.250 is bad: 250.173.127.216.in-addr.arpa	name = 20201212.cloudcone.com.
155.94.138.170 is bad: 170.138.94.155.in-addr.arpa	name = 155.94.138.170.static.quadranet.com.
149.28.63.32 is bad: 32.63.28.149.in-addr.arpa	name = 149.28.63.32.vultr.com.
147.135.191.81 is bad: 81.191.135.147.in-addr.arpa	name = ip81.ip-147-135-191.eu.
107.178.236.6 is bad: 6.236.178.107.in-addr.arpa	name = 6.236.178.107.gae.googleusercontent.com.
103.143.76.150 is bad: 150.76.143.103.in-addr.arpa	name = magicugly.net.

The most common of these is an obviously SEO tool (cough), some are random user IP addresses (probably “non-datacanter proxies” - so computers basically taken over for monetization purposes and used as proxies), some are running on cloud providers like Amazon AWS & Google Cloud.

Let’s doublecheck the ones that returned a googlebot.com hostname – are they all from google?

$ while read li; do nslookup $li | grep name | grep "googlebot.com.$" \
  | sed -e "s/^/$li /" ; done <ips.txt >potentiallygoogle.txt

$ wc -l potentiallygoogle.txt
293 potentiallygoogle.txt

$ # 293 potentially-Google hostnames & IP addresses; let's add a fake one to test for

$ echo "10.20.30.40 40.30.20.10.in-addr.arpa name = crawl-66-249-65-219.googlebot.com." \
  >>potentiallygoogle.txt

$ while read li; do e=($li); host ${e[4]} | grep -v "${e[0]}" \
  | sed -e "s/^/${e[0]} -- /"; done <potentiallygoogle.txt

10.20.30.40 -- crawl-66-249-65-219.googlebot.com has address 66.249.65.219

$ # All checked, no (other) fake hostnames spotted

Comments / questions

There's currently no commenting functionality here. If you'd like to comment, please use Mastodon and mention me ( @hi@johnmu.com ) there. Thanks!

Related pages