爬历年作文

哈哈，考研狗出来诈尸啦！由于最近在复习英语作文，虽然有了一本书，但是平时一直拿着也不方便，就想着从网上找别人整理好的。但是看了看，没有找到。正巧新东方有历年的，只是每次打开太麻烦，干脆整理下吧。一看，从05-15，有的还有多篇范文，粘贴太麻烦，图片还得一个一个保存，干脆爬成md文档生成html转pdf吧新东方不要怪我啊！我可没有做商业用途（手动滑稽）

代码思路非常简单，直接抓取所有的作文链接，然后用beautifulsoup解析，保存成文档就可以啦废话不多说，直接上代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46


# -*- coding: utf-8 -*-

from urllib import request
from bs4 import BeautifulSoup

root='http://m.koolearn.com/news/1070329.html'

def get_years():
	urls=[]
	with request.urlopen(root) as f:
		if f.status == 200:
			data=f.read().decode('utf-8')
			soup=BeautifulSoup(data,"html.parser")
			for link in soup.find_all('a'):
				if '真题范文及解析' in link.text:
					urls.append(link['href'])
	return urls
					
def get_articles(urls):
	passage=open('passage.txt','w')
	for url in urls:
		print("parsing passage %s"%url)
		with request.urlopen(url) as f:	
			data=f.read().decode('utf-8')
			soup=BeautifulSoup(data,"html.parser")
			print(soup.title.get_text())
			article = soup.find('div',class_='mt40')
			articles=list(article.children)
			articles=articles[2:len(articles)]
			for text in articles:
				try:
					if(text.img):
						passage.write('<img src=\"'+text.img['src']+'\" />')
					else:
						passage.write(text.get_text()+'\r\n')
				except Exception as e:
					pass
			passage.write('\r\n\r\n\r\n')
	passage.close()

def main():
	urls=get_years()
	get_articles(urls)
			
if __name__ == '__main__':
	main()

就是这么多，估计都一下能够看懂，本来想把新东方的标签一并保存的，可惜担心有css，而且也懒得再做了，毕竟咱可是热爱学习争分夺秒的人！

好啦，不说啦，老夫要去学习啦！

这逼装的我自己都不认识我自己啦！

文章目录