Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】
作者:wanlifeipeng 发布时间:2023-05-14 08:03:02
本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:
第一版: 效率低
# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
word = []
words_dict= {}
for letter in f.read():
if letter.isalnum():
word.append(letter)
elif letter.isspace(): #空白字符 空格 \t \n
if word:
word = ''.join(word).lower() #转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
#处理最后一个单词
if word:
word = ''.join(word).lower() # 转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
for k,v in words_dict.items():
print(k,v)
运行结果:
we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1
第二版:
缺点:遇到大文件要一次读入内存,性能不好
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
data = f.read()
word_reg = re.compile(r'\w+')
#word_reg = re.compile(r'\w+\b')
word_list = word_reg.findall(data)
word_list = [word.lower() for word in word_list] #转小写
word_set = set(word_list) #避免重复查询
# words_dict = {}
# for word in word_set:
# words_dict[word] = word_list.count(word)
# 简洁写法
words_dict = {word: word_list.count(word) for word in word_set}
for k,v in words_dict.items():
print(k,v)
运行结果:
on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1
第三版:
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
#line_words = word_reg.findall(line)
#比上面的正则更加简单
line_words = line.split()
word_list.extend(line_words)
word_set = set(word_list) # 避免重复查询
words_dict = {word: word_list.count(word) for word in word_set}
for k, v in words_dict.items():
print(k, v)
运行结果:
childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1
第四版:使用Counter
统计
# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
line_words = line.split()
word_list.extend(line_words)
words_dict = dict(collections.Counter(word_list)) #使用Counter统计
for k, v in words_dict.items():
print(k, v)
运行结果:
We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1
注:这里使用的测试文本test.txt如下:
We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.
PS:这里再为大家推荐2款相关统计工具供大家参考:
在线字数统计工具:
http://tools.jb51.net/code/zishutongji
在线字符统计与编辑工具:
http://tools.jb51.net/code/char_tongji
希望本文所述对大家Python程序设计有所帮助。
来源:http://www.cnblogs.com/hupeng1234/p/6680491.html
猜你喜欢
- 概述要访问一个变量的内容,可以直接使用其名称。如果该变量是一个数组,可以使用变量名称和关键字或索引的组合来访问其内容。像其他变量一样,使用运
- 我对定格动画非常喜爱,也曾经在大学毕业时期制作过一部个人定格动画MV.恰当给CDC博客写文之机,给大家介绍下定格动画,分享下这门独特的拍摄艺
- 这篇论坛文章着重介绍了Access数据库出现0x80004005问题的解决方法,更多内容请参考下文:项目做了三个月了,终于也差不多完成了,昨
- numpy随机打乱数据方法np.random.shuffleimport numpy as np#实验可得每次shuffle后数据都被打乱,
- 01直接生成这类方法是利用基本程序软件包numpy的随机数产生方法来生成各类用于聚类算法数据集合,也是自行制作轮子的生成方法。一、基础类型1
- # 源码如下:#!/usr/bin/env python#coding=utf-8import osfrom PIL import Imag
- 最近项目需要抓包功能,并且抓包后要对数据包进行存库并分析。抓包想使用tcpdump来完成,但是tcpdump抓包之后只能保存为文件,我需要将
- 以前看到 andy的关于“Quiet Structure”觉的很不错,于是今天到她的个人站点上逛逛,发现不少好的文章,今天介绍的是
- 表单的验证是开发WEB应用程序中常遇到的一关。有时候我们必须保证表单的某些项必须填写、必须为数字、必须是指定的位数等等,这时候就要用到表单验
- 内容摘要:除了内部性能增强和优化外,IIS6.0版本的 Active Server Pages(ASP)&nb
- 环境准备前提已经安装好python、pycharm,配置了对应的环境变量。1、安装selenium模块文件–>设置
- 为了测试一组网页是否能够访问,采取python中的requests包进行批量的访问测试,并输出访问结果。一、requests包的安装 打开命
- 1. 实例描述在平时编程的过程中,会经常在网上翻译一些单词,本文使用Python制作一款翻译小工具,不仅可以自己用,还可以嵌入到程序当中。运
- 本来在网上有不少关于这方面的文章,可是我找了好久也没看到把(可能我的搜索水平有线把)不过倒是聊天室的很多。如何统计会员再线状态,希望对刚开始
- 说明: a、以下字符中数据库名forum,数据库服务器名WWW-2443D34E558\SQL2005(或者127.0.0.1) b、查看s
- JavaScript Dom编程 学习书籍选择JavaScript Dom编程学习,很多朋友无疑对如何选择入门的书籍,比较头疼。或许也是他们
- Python生成随机验证码,需要使用PIL模块,具体内容如下安装:pip3 install pillow基本使用1. 创建图片from PI
- 在本章中,我们将重点介绍RSA密码加密的不同实现及其所涉及的功能.您可以引用或包含此python文件以实现RSA密码算法实现.加密算法模块&
- 在良好的数据库设计基础上,能有效地使用索引是SQL Server取得高性能的基础,SQL Server采用基于代价的优化模型,它对每一个提交
- (一)原理 小偷程序实际上是通过了XML中的XMLHTTP组件调用其它网站上的网页。比如新闻小偷程序,