Sunday, April 10, 2016

Downloading Images from Twitter

    So, continuing my exploration of the twitter API using tweepy, I realized exploring different user's tweets is ridiculously easy. So, what exciting things can one do with these tweets??

    So here I was, scrolling twitter in my free time when I came across a lovely Özil wallpaper. Generally Arsenal's twitter feed includes some great pictures. Similarly for movie stars or artists, who post their pictures on twitter. It takes a lot of time to download them manually. With the power of tweepy, why not just automate the entire process?

Downloading Images from a user can be divided into 3 major sub-routines:
  • Downloading tweets by the user, and filtering out the tweets having some media. 
  • Extracting out the links from these tweets. 
  • Downloading each of links. 

    Since twitter limits the number tweets one can download at a time, we need to keep track of the ID of the last tweet and then use max_id to download more tweets.

1
2
3
4
5
6
7
temp_raw_tweets = api.user_timeline(screen_name=username, max_id=last_tweet_id, include_rts=False, exclude_replies=True)

if len(temp_raw_tweets) == 0:
    break
else:
    last_tweet_id = int(temp_raw_tweets[-1].id-1)
    raw_tweets = raw_tweets + temp_raw_tweets

    Once the tweets are downloaded, extract the ones which have media links. This can be done by checking if tweet's entity has a media value or not. If there is no media attached with the tweet, an empty list is returned and the process continues, otherwise, it is added to the list containing all the links.

1
2
3
4
5
6
tweets_with_media = set()

for tweet in all_tweets:
    media = tweet.entities.get('media',[])
    if (len(media)>0):
        tweets_with_media.add(media[0]['media_url'])

    After the links have been extracted, download all the the links. Urllib2 or requests are the primary choices for downloading files and writing the data in the file binary. However, the module wget makes it completely easy and hassle-free. All the downloaded files are stored in the directory "twitter_images", inside the folder named "user _ handle".

1
2
for url in media_url:
    wget.download(url)




Ah! Scripting makes life so much more easier :)


Link to the code: Github


Friday, April 1, 2016

Streaming tweets with python using tweepy

    So, I recently started experimenting with the Twitter's API. Twitter provides many API's for users to work with, the major ones being Streaming API, REST API and Search API. Twitter requires special tokens before it gives out its data. You need to generate your own tokens should you want to use Twitter's data.

    Now there are plenty of modules available for manipulating twitter's data, the one I used was tweepy, as it was recommended by many developers and had a very good documentation available. Another quite useful module was twython, but I decided to stick with tweepy.

    First step: Authentication via tokens. Goto apps.twitter.com and create a new app to get your tokens. If you're having troubles, this link shows precisely what to do. Then, import OAuthHandler from tweepy and pass the keys and tokens to this handler to create an API object.

1
2
3
auth = OAuthHandler(t.CONSUMER_KEY, t.CONSUMER_SECRET)
auth.set_access_token(t.ACCESS_TOKEN,t.ACCESS_TOKEN_SECRET)
api = API(auth)

    Tweepy has a simple class designed especially for grabbing real time streaming data called StreamListener, which can be inherited and tailored as per our requirement. This data sent by twitter's servers are gigantic in size,they send every information there is available regarding that tweet and hence there is a lot of data you might not need.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{"created_at":"Thu Mar 24 21:07:52 +0000 2016","id":713110104418172928,"id_str":"713110104418172928",
"text":"New Offer Bet \u00a310 Get \u00a320 FREE - Bet Now: https:\/\/t.co\/8OLizB0qka #twitter92 #Arsenal https:\/\/t.co\/y2HKha3IFC","source":"\u003ca href=\"https:\/\/www.socialoomph.com\" rel=\"nofollow\"\u003eSocialOomph\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,
"user":{"id":2762185416,"id_str":"2762185416","name":"Cash Offers","screen_name":"betting_Big","location":null,"url":null,"description":"#1 Twitter for Free Money Bets","protected":false,"verified":false,"followers_count":2797,"friends_count":1674,"listed_count":112,"favourites_count":2,"statuses_count":38601,"created_at":"Sun Aug 24 12:04:07 +0000 2014","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)",
"geo_enabled":false,"lang":"en","contributors_enabled":false,
"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6",
"profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/619882722266357760\/gctGZjOJ_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/619882722266357760\/gctGZjOJ_normal.jpg",
"profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2762185416\/1436627381",
"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,
"entities":{"hashtags":[{"text":"twitter92","indices":[66,76]},{"text":"Arsenal","indices":[77,85]}],"urls":[{"url":"https:\/\/t.co\/8OLizB0qka","expanded_url":"http:\/\/bit.ly\/Bet10Gt20","display_url":"bit.ly\/Bet10Gt20","indices":[42,65]}],"user_mentions":[],"symbols":[],"media":[{"id":713110103898132480,"id_str":"713110103898132480",
"indices":[86,109],"media_url":"http:\/\/pbs.twimg.com\/media\/CeV53HyXEAAjBl7.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CeV53HyXEAAjBl7.jpg","url":"https:\/\/t.co\/y2HKha3IFC","display_url":"pic.twitter.com\/y2HKha3IFC","expanded_url":"http:\/\/twitter.com\/betting_Big\/status\/713110104418172928\/photo\/1","type":"photo",
"sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":520,"h":214,"resize":"fit"},"small":{"w":340,"h":140,"resize":"fit"},"medium":{"w":520,"h":214,"resize":"fit"}}}]},"extended_entities":{"media":[{"id":713110103898132480,"id_str":"713110103898132480","indices":[86,109],
"media_url":"http:\/\/pbs.twimg.com\/media\/CeV53HyXEAAjBl7.jpg","media_url_https":"https:\/\/pbs.twimg.com\/media\/CeV53HyXEAAjBl7.jpg","url":"https:\/\/t.co\/y2HKha3IFC","display_url":"pic.twitter.com\/y2HKha3IFC","expanded_url":"http:\/\/twitter.com\/betting_Big\/status\/713110104418172928\/photo\/1",
"type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":520,"h":214,"resize":"fit"},"small":{"w":340,"h":140,"resize":"fit"},
"medium":{"w":520,"h":214,"resize":"fit"}}}]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1458853672495"}


    This data has a dictionary type structure, and hence can be easily manipulated by using a json parser. Twitter's developer page comes in handy when you need to find what data you need to extract from the data.


1
2
3
4
5
j = json.loads(data)
line1 = "@" + str(j['user']['screen_name']) + " on " + j['created_at'][:-11] + ", language= "+ j["lang"] + ": "
line2 = '\n' + j['text']
text = line1 + line2
print text + "\n\n"


    The tweets can be filtered to give tweets relating to a particular topic or keyword, by setting the search settings in the data streams.




    The REST API works a bit differently than the the Streaming API, it goes and searches for tweets that have already been posted before instead of taking them in real time. Twitter's rate limiting can be a real issue here, as it only allows upto 100 or 200 queries(maybe even less?) during one search. Just like in Streaming API, the data sent here is immense and we need to filter out the required data. All of these capabilities can allow us to create a pretty powerful python based twitter client. The possibilities are, as they say, is endless!


Link to the code: Github



Wednesday, March 23, 2016

Download YouTube Playlist

    It's been ages since I wrote my last post, I almost forgot about it completely. Anyways, now I'm back at home for the Holi holidays with a lot of time to kill, I can write one or two posts :P :P

    So, last weekend, I was about to leave for my home the next day and wanted to download Episodes of Last Week Tonight (its a news Satire & its hilarious; must watch!) to take back home and kill time. This was supposed to pretty easy, all I had to do was paste the playlist link in some app and viola!!

But as it always turns out, it wasn't.

    I tried various programs in Windows, but all of them had some problems. Some wouldn't allow me to download more than a certain number of videos in a playlist, some would only download only a small part of each video without registering and few who actually worked throttled the speed to an absolute moribund state. It was pretty frustrating, and I decided to switch to Linux, because hey, it's *almost* always better!

    So, booted into Linux and searched the awesome open-source community for the solution to my problems and it was only a matter of time till I found the perfect module: youtube-dl. It had everything, and more functionality and options than any downloader you could find on windows. My work was done. Copied the link of the playlist and tried to run the script but urrrghhh!! this showed up:


    Didn't matter which playlist I chose, it threw the same error. But it was downloading single videos seamlessly. Bored out of my mind, desperate for the videos and for the sense of *achieving something*, I decided to use python to do this. It was finally the time I actually implemented anything I learned in the past 2 years to an actual use.

    So, how to do this. I knew I could string multiple commands in linux together by using && operator. So I tried downloading two single videos together by joining one of them via && operator and it worked. So, by mathematical induction, I could do this for n videos (it's a pretty good analogy..... and it makes you sound smart :P). So, the main problem was solved. Next, I needed the links for all the videos in the playlist. I'd started web-scraping with python a few days ago and it seemed like a good practice. The YouTube page was pretty complicated(for a beginner), but after a while I figured it out.

#Scapes for links from soup object, adds hostname infront and appends to final list
for x in soup.find_all("a",{"class":"pl-video-title-link yt-uix-tile-link yt-uix-sessionlink  spf-link "}):
    links.append("https://www.youtube.com" + x["href"])


Tried the script on a few playlists and it worked!!! Part 1 was done.


Next up was the main downloader.
This was fairly easy to do, all I needed to do was concatenate the links with the youtube-dl commands and join all of these via the && command.

#concatenate all the links into a single command
for i in range(len(names)):
    command +="youtube-dl -ciwf best "+names[i]+" && " 

command = command[:-4]

    I first tried to run the download method on a different thread, or by forking out a new process, but I realized youtube-dl already did that and it would always go out of my control, so I dropped the whole idea. Made a few specifications, like storing the downloaded videos in a different folder etc. and I was done.

























Took me around 2-3 hours to complete, but finally did it!!

Started the downloads and went to sleep(it was 5am already). The satisfaction from completing this was immense and I had the best sleep in weeks!!

Download the code: YouTube Playlist downloader