Google News App

Hey long time.Exam Season + Continuous rape in CP labs + Other reasons kept me out of touch for a month.Then I thought I would get back to python and the first idea that struck me was website parsing(i.e retrieving info from a website).I started with the idea of extracting the headlines from google news.

The first problem I faced was proxy authentication,each student can access the net with a username and password.I came across this blog http://www.wkoorts.com/wkblog/2008/10/27/python-proxy-client-connections-requiring-authentication-using-urllib2-proxyhandler/ which told me how to do it.

 name=raw_input('Enter name ')
password=raw_input('Enter password ')
proxy=raw_input('Enter proxy ')
port=raw_input('Enter port ')
auth='http://%s:%s@%s:%s'%(name,password,proxy,port) 

Then you build a handler and import urllib2 to retrieve the html sourcecode from a webpage

handler=urllib2.ProxyHandler({'http':auth})
opener=urllib2.build_opener(handler)
urllib2.install_opener(opener) 

After that you parse it through BeautifulSoup(\m/) which helps you retrieve the required tags.

For example,I wanted the data related to the tag ‘span’ and attribute ‘class’ with values ‘titletext’,’section-name’,’esc-lead-article-source’  .So this is what I did.


soup=BeautifulSoup(webhtml)

divtag=soup.findAll('span',{'class':['titletext','section-name','esc-lead-article-source']}) 

Then it was more of logic and observing pattterns to extract the required information.

The last problem I faced was not getting it in unicode format for which I referred this.

https://github.com/jayrambhia/Utils/blob/master/Utils/BSutils.py

So came up with this 100 line  code by which you could read the headlines from your terminal directly without opening your webbrowser.Here is my code

'''
Created by:S.Manoj Kumar

Date:17.3.2010

Use:To filter out headlines from Google News
'''

''' Importing Modules'''
from BeautifulSoup import BeautifulSoup
import urllib2
import re

''' replacing characters like '&#39' '''
def getPrintUnicode(soup):
body=''
if isinstance(soup, unicode):
soup = soup.replace(''',"'")
soup = soup.replace('"','"')
soup = soup.replace(' ',' ')
soup = soup.replace('>','>')
soup = soup.replace('&lt;','<')
soup = soup.replace('&raquo;','<<')
soup = soup.replace('&amp;','&')
body = body + soup
else:
if not soup.contents:
return ''
con_list = soup.contents
for con in con_list:
body = body + getPrintUnicode(con)
return body

''' getting proxy authentication '''

def proxy_get():
name=raw_input('Enter name ')
password=raw_input('Enter password ')
proxy=raw_input('Enter proxy ')
port=raw_input('Enter port ')
auth='http://%s:%s@%s:%s'%(name,password,proxy,port)
handler=urllib2.ProxyHandler({'http':auth})
opener=urllib2.build_opener(handler)
urllib2.install_opener(opener)
print "Hey %s todays headlines are" %(name)

''' getting html source code of google news'''
def url_get():
web=urllib2.urlopen('http://news.google.co.in/')
webhtml=web.read()
soup_get(webhtml)

''' parsing with BeautifulSoup'''
def soup_get(webhtml):
soup=BeautifulSoup(webhtml)
parse(soup)

def parse(soup):
'''finding tag,span with req attributes'''
divtag=soup.findAll('span',{'class':['titletext','section-name','esc-lead-article-source']})
listdiv = []
length=len(divtag)
'''converting into a list of strings'''
for index in range(length):
listdiv.append(str(divtag[index]))
listcpy=[element for element in listdiv]
i=0
index = []
'''indexing with required attributes and sorting the list'''
for element in listdiv:
value = re.search('class="esc-lead-article-source"',element)
if value!= None:

index.append(listdiv.index(element)+i)
listdiv.remove(element)
i=i+1
newindex=[num-1 for num in index]
section_index=[]
for element in listcpy:
val=re.search('section-name',element)
if val!= None:
section_index.append(listcpy.index(element))

for num in section_index:
newindex.append(num)
newindex.sort()
a=[]
for num in newindex:
a.append(listcpy[num])
newstring=(''.join(a))

soup=BeautifulSoup(newstring)

span=soup.findAll('span')
news(span)
'''printing the headlines'''

def news(span):
for news in span:
print getPrintUnicode(news)

def main():
proxy_get()
url_get()

if __name__=='__main__':
main()

I will come up with something better next time and increase my knowledge in OOP

Advertisements

4 comments

  1. Sorry for the wrong indentation.My bad!!!

  2. Nice Work. Great post.
    Work on indentation. And also when including any URLs, hyperlink them!

    1. Thanks.Ya ill do it the nxt tym 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: