Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】
作者:wanlifeipeng 发布时间:2023-05-14 08:03:02
本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考,具体如下:
第一版: 效率低
# -*- coding:utf-8 -*-
#!python3
path = 'test.txt'
with open(path,encoding='utf-8',newline='') as f:
word = []
words_dict= {}
for letter in f.read():
if letter.isalnum():
word.append(letter)
elif letter.isspace(): #空白字符 空格 \t \n
if word:
word = ''.join(word).lower() #转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
#处理最后一个单词
if word:
word = ''.join(word).lower() # 转小写
if word not in words_dict:
words_dict[word] = 1
else:
words_dict[word] += 1
word = []
for k,v in words_dict.items():
print(k,v)
运行结果:
we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1
第二版:
缺点:遇到大文件要一次读入内存,性能不好
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path,'r',encoding='utf-8') as f:
data = f.read()
word_reg = re.compile(r'\w+')
#word_reg = re.compile(r'\w+\b')
word_list = word_reg.findall(data)
word_list = [word.lower() for word in word_list] #转小写
word_set = set(word_list) #避免重复查询
# words_dict = {}
# for word in word_set:
# words_dict[word] = word_list.count(word)
# 简洁写法
words_dict = {word: word_list.count(word) for word in word_set}
for k,v in words_dict.items():
print(k,v)
运行结果:
on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1
第三版:
# -*- coding:utf-8 -*-
#!python3
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
#line_words = word_reg.findall(line)
#比上面的正则更加简单
line_words = line.split()
word_list.extend(line_words)
word_set = set(word_list) # 避免重复查询
words_dict = {word: word_list.count(word) for word in word_set}
for k, v in words_dict.items():
print(k, v)
运行结果:
childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1
第四版:使用Counter
统计
# -*- coding:utf-8 -*-
#!python3
import collections
import re
path = 'test.txt'
with open(path, 'r', encoding='utf-8') as f:
word_list = []
word_reg = re.compile(r'\w+')
for line in f:
line_words = line.split()
word_list.extend(line_words)
words_dict = dict(collections.Counter(word_list)) #使用Counter统计
for k, v in words_dict.items():
print(k, v)
运行结果:
We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1
注:这里使用的测试文本test.txt如下:
We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.
PS:这里再为大家推荐2款相关统计工具供大家参考:
在线字数统计工具:
http://tools.jb51.net/code/zishutongji
在线字符统计与编辑工具:
http://tools.jb51.net/code/char_tongji
希望本文所述对大家Python程序设计有所帮助。
来源:http://www.cnblogs.com/hupeng1234/p/6680491.html


猜你喜欢
- 在数据库开发过程中,当你检索的数据只是一条记录时,你所编写的事务语句代码往往使用SELECT INSERT 语句。但是我们常常会遇到这样情况
- 作者:Robert Lair and Jason Lefebvr Intensity Software, Inc.
- 自 PHP 5.4.0 起,PHP 实现了代码复用的一个方法,称为 traits。Traits 是一种为类似 PHP 的单继承语言而准备的代
- 利用pygame实现了简易版飞机大战。源代码如下:# -*- coding:utf-8 -*-import pygameimport sys
- sort 包源码解读前言我们的代码业务中很多地方需要我们自己进行排序操作,go 标准库中是提供了 sort 包是实现排序功能的,这里来看下生
- 4个不常用HTML标签optgroup、sub、sup和bdo运行代码框:<title>4个不常用HTML标签optgroup、
- 先来了解一下收/发邮件有哪些协议:SMTP协议 SMTP(Simple Mail Transfer Protocol),即简单邮件传输协议。
- 上一篇已经介绍了celery的基本知识,本篇以一个小项目为例,详细说明django框架如何集成celery进行开发。本系列文章的开发环境:w
- python __init__.py 和 __all__作用一、__init__.py1、导入文件夹包的时候,会运行写在该文件夹包下的__i
- 本文实例讲述了PHP排序二叉树基本功能实现方法。分享给大家供大家参考,具体如下:这里演示了排序二叉树节点的插入,中序遍历,极值的查找和特定值
- 本文实例讲述了python异常处理、自定义异常、断言原理与用法。分享给大家供大家参考,具体如下:什么是异常:当程序遭遇某些非正常问题的时候就
- 我们编写程序最终目的还是来解决实际问题,所以必然会遇到输入输出的交互问题,python中提供了input函数用来获取用户的输入,我们可以用以
- python的smtplib提供了一种很方便的途径发送电子邮件。它对smtp协议进行了简单的封装。下面是一个利用smtplib,实现QQ邮箱
- ISNULL 使用指定的替换值替换 NULL。 &nb
- 在用wordpress这个博客的时候,我很奇怪的发现,最近写的内容排在第一页,而最早写的成了最后页。这显然有悖逻辑,正常的情况应该是最早写的
- 本文实例讲述了JS+CSS实现闪烁字体效果的方法。分享给大家供大家参考,具体如下:<div id="blink"&
- 操作系统:Windows2000,IIS5出现症状:使用ASPJPEG时执行Server.CreateObject("Persit
- 本文实例讲述了wxPython使用系统剪切板的方法。分享给大家供大家参考。具体如下:程序运行效果如下图所示:主要代码如下:import wx
- 今天学习下Go语言如何集成Gin框架编写Restful Web API的基本操作。Gin框架简化了Go原生语言构建Web应用程序的复杂度,在
- Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Micr