A tale of a Tweet

Posted by Mike Haller on Tuesday, June 8. 2010 at 21:25 in Communities
What happens in the first minute after you tweet?

When you post an update to your Twitter status (engl. to tweet) which contains a URL, there is going to be some automated reaction from the network. Let's examine what happens after I've tweeted the following:



The first thing happening within seconds is that Twitter's own bot (Twitterbot/0.1) performs a request to see if the URL is valid. The IP 128.242.241.133 is hosted at dedicatedserver.com, an NTT company located in San Jose. The data center seems to be the same where Twitter itself is hosted. They do not retrieve the contents (they're using the HEAD command instead of GET), perhaps to resolve redirects from shortened URLs.

At the same time, 38.113.234.181 is receiving the blog's post content. The bot is called Voyager/1.0. The IP is routed through "PSINet, Inc." and has the ID COGENT-NB-0002. Someone else already found out in 2007 that this IP is used by Kosmix.com crawler:

38.113.234.181 resolves to crawl1.cosmixcorp.com, and cosmixcorp.com redirects to kosmix.com - a California, USA-based outfit which appears to be legit in a "we're a cool California start-up" kind of way. Not quite sure what they're doing (hey - it's Web 2.0), but it evidently involves crawling without an identifiable bot UA.


Also in the first second after the tweet, two requests from 72.13.91.40 and 72.13.91.42 retrieve parts of the HTML content. I say parts, because the blog post has 34kb of HTML content, but those bots (written in Java) only retrieve 20kb. Perhaps they're using compression and Voyager doesn't? Or they just cancelled in the middle of the download, who knows? 72.13.91.42 is routed by Energy Group Networks (EGIHosting EGIHOSTING-1 , Edgios Inc. EDGIOS), administrated by Scott Brookshire, appears to be a low-cost streaming hosting provider.

In the second second (haha), another bot appears which reads the whole blog post. It's called NjuiceBot, trying to cover as Firefox. Originating from 85.114.136.243, the bot also downloads the FavIcon and the root of the blog. Funny about this IP is that its hosted by "SK-Gaming via gamed.de Gameserver" - a gaming hosting company. Some people, like me, don't understand what a game server has to do with tweets.

The next few requests are pretty boring, so I'll keep it short:

Google's Googlebot comes next from 66.249.71.205 Another bot called OneRiot/1.0 from 216.24.142.46. And something called metaURI from 75.101.232.27, which provides meta-information access about URLs via API. Unidentified bots from 65.52.29.84, 65.52.2.3 and 70.37.65.108. All three do receive the full HTML content of the blog post (but no images). All three do not identify themselves as being bots, which is bad behavior. Another bot called mxbot/1.0 from 67.207.201.160 which is the first reading robots.txt. Congratulations for being the first bot out of ten for adhering to the rules.

Then, Amazon from 75.101.170.136 tries to resolve URLs by issueing a HEAD request using PycURL/7.18.2
And another one called Twingly Recon from 79.99.6.106 using HEAD.

And someone, deployed on Google AppEngine, who doesn't know how to parse URLs correctly. It failed parsing /archives/150-How-to-implement-password-policies-using-business-rules-modeling.html and cut the URL off after the first dash character, resulting in this little request with 40kb (my blog "redirects" 404 file not found to the root page).
74.125.75.1 - - [03/Jun/2010:23:23:32 +0200] "GET /archives/150 HTTP/1.1" 200 41431
"-" "AppEngine-Google; (+http://code.google.com/appengine; appid: linksalpha)"


Then a bot from 64.13.147.188 called abby/1.0 (See here)
Finally, the last within 60 seconds of the first request, an inefficient robot called TweetmemeBot from 89.151.116.53 requests the same page twice, but each only partially (20kb instead of 35kb), also not adhering to robots.txt

After so many mechanoids, it's time for some humans. After a couple of more bots, the first human being (seemingly) joins the party after 16 minutes.

And he is a Mac fan boy:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6

Let's look where he came from. His IP 211.43.152.54 resolves to San Jose.

But wait, what's that? The IP 211.43.152.61 also appears in the logs, very unnatural to see similar IPs from humans. The other IP uses Mozilla/5.0 Firefox/3.0.5. Hm, the IPs both originate from Korea (SEOUL Gasan-dong Geumcheon-gu Seoul) from a network called UTILUS-KR, operated by Lee SeungHo. The company name is CDNetworks (http://www.us.cdnetworks.com/). Great, so this is not a human either (why are they using Macs for robots anyway?! Isn't this a waste of design?)

Okay, the next one which seems like a human being comes from 91.23.219.243 and has downloaded the whole blogpost, all the images, all the CSS etc. The user agent is Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100501 Iceweasel/3.5.9 (like Firefox/3.5.9) The IP resolves to Deutsche Telekom. That is very likely a human being.

Thank god, at least someone really clicked on that URL.

Google again. Bot called Googlebot-Image/1.0 and downloads all the images. Yeah. Comes from 66.249.71.205

Uh oh, someone without user agent. IP resolves to 173-233-65-10.turnkeyinternet.net
173.233.65.10 - - [03/Jun/2010:23:41:04 +0200]
"GET /uploads/ScoringPart1.serendipityThumb.png HTTP/1.1" 200 4274 "-" "-"


Then someone from 94.242.42.95 (Russia) using Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.0 (KHTML, like Gecko) Chrome/6.0.408.1 Safari/534.0 and Referer shows Twitter.

Many minutes later, Bloggsi comes along and plays nice by downloading the robots.txt After that, it continues to download the whole blog, post by post. As referer, it shows feedburner.
88.198.69.113 - - [04/Jun/2010:00:01:17 +0200] "GET /robots.txt HTTP/1.1"
200 106 "-" "Bloggsi/1.0 (http://bloggsi.com/)"


Another bot testing for valid URL:
204.236.254.109 - - [04/Jun/2010:00:20:56 +0200]
"HEAD /archives/150-How-to-implement-password-policies-using-business-rules-modeling.html HTTP/1.1"
200 - "-" "PostRank/2.0 (postrank.com)"


And that, my dear friends, is what happens to your tweets - the sad story about "being social" with robots.

Thomas
Nice one. What about the first human reader using Opera? Did I get the price?

Add Comment

Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications
 
Submitted comments will be subject to moderation before being displayed.
 

About

My name is Mike Haller and I'm a software developer and architect at Bosch Software Innovations in Germany. I love programming, playing games and reading books. I like good food, making photos and learning and mentoring about the craftsmanship of commercial software development. Stack Overflow profile for mhaller

Quicksearch