The most comprehensive database of modern Chinese poetry and foreign poetry 最全的中国近现代诗以及外国诗数据库

Chinese version | English version


Chinese_version

前言

数据格式

数据格式的构造可以去查看之前的Modern poetry 现代诗数据库爬取过程

获取数据

Modern poetry 的外国诗数据部分仍然是来自中国诗歌库的外文诗集。网上其实还有其它的诗歌网站,但是因为版权原因不好爬取。所以如果又想要提交诗歌数据,一定要先注意版权内容。

爬取这个网站的难度只有一点——编码

因为外文诗歌的语言类型繁杂,编码的选取就异常重要。与此同时,因为网站本身数据采集的时候就有乱码,爬取下来的数据需要进行特殊处理(替换掉乱码内容)。

对于网站上的大部分语言(西欧语言),大多数情况下都可以选取 ISO-8859-15 编码来爬取网站并保存数据。针对俄语就直接采用 utf-8 即可。(甚至 gbk 都可以正常显示,不愧是友好联邦!)

import uuid
import re
import requests
import json

requests.adapters.DEFAULT_RETRIES = 3
s = requests.session()
s.proxies = {"http": "115.159.31.195:8080", "http": "116.196.115.209:8080","http":"119.41.236.180:8010"}
s.keep_alive = False
link = 'https://www.shigeku.org/shiku/ws/ww/index.htm'
headers = { 'Connection': 'close',"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3464.0 Safari/537.36"}   

def findAll(regex, seq):
    resultlist=[]
    pos=0
    while True:
        result = regex.search(seq, pos)
        if result is None:
            break
        resultlist.append(seq[result.start():result.end()])
        pos = result.start()+1
    return resultlist

def parse(List):
    while '' in List:
        List.remove('')
    return List


def cleantxt(raw):
    return re.sub(r'[^\x00-\x7f]', '', raw).strip()

def parseString(string):
    str_= ''
    flag = 1
    for ele in string:
        if ele == "<":
           flag = 0
        elif ele == '>':
           flag = 1
           continue
        if flag == 1:
            str_ += ele

    str_ = str_.replace('\r','').replace(";",'').replace("&nbsp",'').replace(u"\u0081",'').replace(u"\u008b",'').replace(u"\u008a",'').replace("&quot",'').strip()   
    return str_

def author():
    print("Start!")
    html = s.get(link, headers=headers)
    if(html.status_code == requests.codes.ok):
        txt = html.text
        authorCountry = re.findall('<p align=left>(.*?)</p>', txt, re.S)
        authorCountry = parse(authorCountry)
        authorList = re.findall('<div id="navcontainer">(.*?)</div>', txt, re.S)   
        for i in range(0, 18):
            authorListFinal = []
            country = authorCountry[i]
            country = cleantxt(parseString(country))
            nameListPre = authorList[i+2]
            nameList = re.findall('<li id="navlistli1">(.*?)</li>', nameListPre, re.S)
            for k in nameList:
                name = parseString(k)
                src = re.findall('<a href="(.*?)"', k, re.S)
                src = src[0]
                index = name.find("(")
                if index != -1:  
                    name = name[name.find("by")+3:index-1]
                else:
                    name = name[name.find("by")+3:]
                authorDict = {}
                idAuthor = uuid.uuid3(uuid.NAMESPACE_URL, name)
                authorDict['name'] = name
                authorDict['src'] = src.replace('.htm','')
                authorDict['id'] = str(idAuthor)
                authorDict['description'] = ""
                authorListFinal.append(authorDict)

            print("Finish ", country)
            json.dump(authorListFinal,open(country + '-author.json','w'), ensure_ascii=False)
            
    print("Finish!")

def poem():
    authorPoemPre = json.load(open('author.json', 'r'))
    prefix = "https://www.shigeku.org/shiku/ws/ww"
    for i in range(3,len(authorPoemPre)):
        poemList = []
        dictAuthor = authorPoemPre[i]
        src = dictAuthor['src'] + '.htm'
        poemHtml = s.get(prefix + '/' + src, headers=headers)
        print("Download finish!")
        poemHtml.encoding = 'ISO-8859-1'
        txt = poemHtml.text
        pattern = re.compile("<hr>(.*?)<hr>",re.S)
        tempHrList = findAll(pattern, txt)
        for m in tempHrList:
            poem = {"author":dictAuthor['name']}

            content = parse(parseString(m).split('\n'))
            for k in range(0,len(content)):
                content[k] = content[k].strip()
                if k > 0:
                    for a in range(0,10):
                        content[k] = content[k].replace(str(a),'')
            content = parse(content)
            title = content[0]
            content = content[1:]
            with open("content.txt", 'a',encoding="iso-8859-1") as fp:
                for k in content:
                    fp.write(k + " ")
                
            poem['title'] = title
            poem['paragraphs'] = content
            poem['id'] = dictAuthor['id']
            poemList.append(poem)

        print("Finish ",dictAuthor['name'])
        json.dump(poemList,open(dictAuthor['name'] + '.json','w',encoding="ISO-8859-1"), ensure_ascii=False)

    print("Finish!")

author()
poem()
text()

这个程序会首先把诗人依照国家分类保存到 Json 文件里,接着会从每个国家的 Json 里读取诗人信息,爬取该诗人的诗歌,并以该诗人的名字命名。(注意:当需要获取诗歌信息的时候要把对应国家的 Json 文件更改为 author.json )

针对服务器拒绝连接的问题,这个程序设置了 proxies (来源站大爷 - 免费代理 IP),如果还是遇到中断的问题,可以重新启动程序。

数据分析

数据分析必然还是不可缺少的,这里因为语言差异较大,我仅针对英国以及美国诗人做了词云的分析。

from wordcloud import WordCloud
import PIL.Image as image

def analyze(file):
    with open(file,encoding="iso-8859-15") as fp:
        text = fp.read()
        wordcloud = WordCloud(background_color=(255,255,255), width=1600, height=800).generate(text)
        image_produce = wordcloud.to_image()
        image_produce.save('cloud.png',quality=95,subsampling=0)
        image_produce.show()

analyze('cloud.txt')

相比中文来说,英文的词云分析就比较简单了,不需要分词,直接分析即可。结果如下:

英美诗歌

English_version

Introduction

Json format

The construction of the data format can be seen in the previous Modern poetry database crawling process

Gather data

Modern poetry's data on foreign poetry is still from the Chinese poetry library. There are other poetry websites on the web, but they are hard to access because of copyright. So if you want to submit poetry data again, be sure to pay attention to the copyright content first.

There's only one difficulty in getting to this site -- encoding.

Because of the multifarious language types of foreign poetry, the selection of codes is extremely important. At the same time, because the site itself has some problems when gathering data, the crawling of data need a special processing (replace the error code content).

For most languages on the site (western European languages), ISO-8859-15 can be selected to crawl the site and save the data in most cases. Use utf-8 directly for Russian. (even GBK can be displayed normally, worthy of being a friendly federation of China!)

import uuid
import re
import requests
import json

requests.adapters.DEFAULT_RETRIES = 3
s = requests.session()
s.proxies = {"http": "115.159.31.195:8080", "http": "116.196.115.209:8080","http":"119.41.236.180:8010"}
s.keep_alive = False
link = 'https://www.shigeku.org/shiku/ws/ww/index.htm'
headers = { 'Connection': 'close',"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3464.0 Safari/537.36"}   

def findAll(regex, seq):
    resultlist=[]
    pos=0
    while True:
        result = regex.search(seq, pos)
        if result is None:
            break
        resultlist.append(seq[result.start():result.end()])
        pos = result.start()+1
    return resultlist

def parse(List):
    while '' in List:
        List.remove('')
    return List


def cleantxt(raw):
    return re.sub(r'[^\x00-\x7f]', '', raw).strip()

def parseString(string):
    str_= ''
    flag = 1
    for ele in string:
        if ele == "<":
           flag = 0
        elif ele == '>':
           flag = 1
           continue
        if flag == 1:
            str_ += ele

    str_ = str_.replace('\r','').replace(";",'').replace("&nbsp",'').replace(u"\u0081",'').replace(u"\u008b",'').replace(u"\u008a",'').replace("&quot",'').strip()   
    return str_

def author():
    print("Start!")
    html = s.get(link, headers=headers)
    if(html.status_code == requests.codes.ok):
        txt = html.text
        authorCountry = re.findall('<p align=left>(.*?)</p>', txt, re.S)
        authorCountry = parse(authorCountry)
        authorList = re.findall('<div id="navcontainer">(.*?)</div>', txt, re.S)   
        for i in range(0, 18):
            authorListFinal = []
            country = authorCountry[i]
            country = cleantxt(parseString(country))
            nameListPre = authorList[i+2]
            nameList = re.findall('<li id="navlistli1">(.*?)</li>', nameListPre, re.S)
            for k in nameList:
                name = parseString(k)
                src = re.findall('<a href="(.*?)"', k, re.S)
                src = src[0]
                index = name.find("(")
                if index != -1:  
                    name = name[name.find("by")+3:index-1]
                else:
                    name = name[name.find("by")+3:]
                authorDict = {}
                idAuthor = uuid.uuid3(uuid.NAMESPACE_URL, name)
                authorDict['name'] = name
                authorDict['src'] = src.replace('.htm','')
                authorDict['id'] = str(idAuthor)
                authorDict['description'] = ""
                authorListFinal.append(authorDict)

            print("Finish ", country)
            json.dump(authorListFinal,open(country + '-author.json','w'), ensure_ascii=False)
            
    print("Finish!")

def poem():
    authorPoemPre = json.load(open('author.json', 'r'))
    prefix = "https://www.shigeku.org/shiku/ws/ww"
    for i in range(3,len(authorPoemPre)):
        poemList = []
        dictAuthor = authorPoemPre[i]
        src = dictAuthor['src'] + '.htm'
        poemHtml = s.get(prefix + '/' + src, headers=headers)
        print("Download finish!")
        poemHtml.encoding = 'ISO-8859-1'
        txt = poemHtml.text
        pattern = re.compile("<hr>(.*?)<hr>",re.S)
        tempHrList = findAll(pattern, txt)
        for m in tempHrList:
            poem = {"author":dictAuthor['name']}

            content = parse(parseString(m).split('\n'))
            for k in range(0,len(content)):
                content[k] = content[k].strip()
                if k > 0:
                    for a in range(0,10):
                        content[k] = content[k].replace(str(a),'')
            content = parse(content)
            title = content[0]
            content = content[1:]
            with open("content.txt", 'a',encoding="iso-8859-1") as fp:
                for k in content:
                    fp.write(k + " ")
                
            poem['title'] = title
            poem['paragraphs'] = content
            poem['id'] = dictAuthor['id']
            poemList.append(poem)

        print("Finish ",dictAuthor['name'])
        json.dump(poemList,open(dictAuthor['name'] + '.json','w',encoding="ISO-8859-1"), ensure_ascii=False)

    print("Finish!")

author()
poem()
text()

The program first stores the poet in a Json file by country, then reads the poet from each country's Json, crawls the poet's poem, and names it after the poet. (note: change the Json file of the corresponding country to author.json when you need to get the poem information)

In view of the server rejected the connection problem, the program sets the proxies, if you face problems when crawling, you can restart the program.

Data analysis

Data analysis is inevitably still indispensable. Here, due to the great language differences, I only made the analysis of ci clouds for British and American poets.

from wordcloud import WordCloud
import PIL.Image as image

def analyze(file):
    with open(file,encoding="iso-8859-15") as fp:
        text = fp.read()
        wordcloud = WordCloud(background_color=(255,255,255), width=1600, height=800).generate(text)
        image_produce = wordcloud.to_image()
        image_produce.save('cloud.png',quality=95,subsampling=0)
        image_produce.show()

analyze('cloud.txt')

Compared with Chinese, the word cloud analysis in English is relatively simple. There is no need for word segmentation, just direct analysis. The results are as follows:

British and American poetry