A tale of a Tweet
Posted by Mike Haller
on Tuesday, June 8. 2010
at 21:25
in Communities
What happens in the first minute after you tweet?When you post an update to your Twitter status (engl. to tweet) which contains a URL, there is going to be some automated reaction from the network. Let's examine what happens after I've tweeted the following:

The first thing happening within seconds is that Twitter's own bot (Twitterbot/0.1) performs a request to see if the URL is valid. The IP 128.242.241.133 is hosted at dedicatedserver.com, an NTT company located in San Jose. The data center seems to be the same where Twitter itself is hosted. They do not retrieve the contents (they're using the HEAD command instead of GET), perhaps to resolve redirects from shortened URLs.
At the same time, 38.113.234.181 is receiving the blog's post content. The bot is called Voyager/1.0. The IP is routed through "PSINet, Inc." and has the ID COGENT-NB-0002. Someone else already found out in 2007 that this IP is used by Kosmix.com crawler:
38.113.234.181 resolves to crawl1.cosmixcorp.com, and cosmixcorp.com redirects to kosmix.com - a California, USA-based outfit which appears to be legit in a "we're a cool California start-up" kind of way. Not quite sure what they're doing (hey - it's Web 2.0), but it evidently involves crawling without an identifiable bot UA.
Also in the first second after the tweet, two requests from 72.13.91.40 and 72.13.91.42 retrieve parts of the HTML content. I say parts, because the blog post has 34kb of HTML content, but those bots (written in Java) only retrieve 20kb. Perhaps they're using compression and Voyager doesn't? Or they just cancelled in the middle of the download, who knows? 72.13.91.42 is routed by Energy Group Networks (EGIHosting EGIHOSTING-1 , Edgios Inc. EDGIOS), administrated by Scott Brookshire, appears to be a low-cost streaming hosting provider.
In the second second (haha), another bot appears which reads the whole blog post. It's called NjuiceBot, trying to cover as Firefox. Originating from 85.114.136.243, the bot also downloads the FavIcon and the root of the blog. Funny about this IP is that its hosted by "SK-Gaming via gamed.de Gameserver" - a gaming hosting company. Some people, like me, don't understand what a game server has to do with tweets.
The next few requests are pretty boring, so I'll keep it short:
Google's Googlebot comes next from 66.249.71.205 Another bot called OneRiot/1.0 from 216.24.142.46. And something called metaURI from 75.101.232.27, which provides meta-information access about URLs via API. Unidentified bots from 65.52.29.84, 65.52.2.3 and 70.37.65.108. All three do receive the full HTML content of the blog post (but no images). All three do not identify themselves as being bots, which is bad behavior. Another bot called mxbot/1.0 from 67.207.201.160 which is the first reading robots.txt. Congratulations for being the first bot out of ten for adhering to the rules.
Then, Amazon from 75.101.170.136 tries to resolve URLs by issueing a HEAD request using PycURL/7.18.2
And another one called Twingly Recon from 79.99.6.106 using HEAD.
And someone, deployed on Google AppEngine, who doesn't know how to parse URLs correctly. It failed parsing /archives/150-How-to-implement-password-policies-using-business-rules-modeling.html and cut the URL off after the first dash character, resulting in this little request with 40kb (my blog "redirects" 404 file not found to the root page).
74.125.75.1 - - [03/Jun/2010:23:23:32 +0200] "GET /archives/150 HTTP/1.1" 200 41431 "-" "AppEngine-Google; (+http://code.google.com/appengine; appid: linksalpha)"
Then a bot from 64.13.147.188 called abby/1.0 (See here)
Finally, the last within 60 seconds of the first request, an inefficient robot called TweetmemeBot from 89.151.116.53 requests the same page twice, but each only partially (20kb instead of 35kb), also not adhering to robots.txt
After so many mechanoids, it's time for some humans. After a couple of more bots, the first human being (seemingly) joins the party after 16 minutes.
And he is a Mac fan boy:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6
Let's look where he came from. His IP 211.43.152.54 resolves to San Jose.
But wait, what's that? The IP 211.43.152.61 also appears in the logs, very unnatural to see similar IPs from humans. The other IP uses Mozilla/5.0 Firefox/3.0.5. Hm, the IPs both originate from Korea (SEOUL Gasan-dong Geumcheon-gu Seoul) from a network called UTILUS-KR, operated by Lee SeungHo. The company name is CDNetworks (http://www.us.cdnetworks.com/). Great, so this is not a human either (why are they using Macs for robots anyway?! Isn't this a waste of design?)
Okay, the next one which seems like a human being comes from 91.23.219.243 and has downloaded the whole blogpost, all the images, all the CSS etc. The user agent is Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100501 Iceweasel/3.5.9 (like Firefox/3.5.9) The IP resolves to Deutsche Telekom. That is very likely a human being.
Thank god, at least someone really clicked on that URL.
Google again. Bot called Googlebot-Image/1.0 and downloads all the images. Yeah. Comes from 66.249.71.205
Uh oh, someone without user agent. IP resolves to 173-233-65-10.turnkeyinternet.net
173.233.65.10 - - [03/Jun/2010:23:41:04 +0200] "GET /uploads/ScoringPart1.serendipityThumb.png HTTP/1.1" 200 4274 "-" "-"
Then someone from 94.242.42.95 (Russia) using Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.0 (KHTML, like Gecko) Chrome/6.0.408.1 Safari/534.0 and Referer shows Twitter.
Many minutes later, Bloggsi comes along and plays nice by downloading the robots.txt After that, it continues to download the whole blog, post by post. As referer, it shows feedburner.
88.198.69.113 - - [04/Jun/2010:00:01:17 +0200] "GET /robots.txt HTTP/1.1" 200 106 "-" "Bloggsi/1.0 (http://bloggsi.com/)"
Another bot testing for valid URL:
204.236.254.109 - - [04/Jun/2010:00:20:56 +0200] "HEAD /archives/150-How-to-implement-password-policies-using-business-rules-modeling.html HTTP/1.1" 200 - "-" "PostRank/2.0 (postrank.com)"
And that, my dear friends, is what happens to your tweets - the sad story about "being social" with robots.
