python实现将html表格转换成CSV文件的方法
作者:秋风秋雨 发布时间:2023-08-25 00:48:41
标签:python,html,CSV
本文实例讲述了python实现将html表格转换成CSV文件的方法。分享给大家供大家参考。具体如下:
使用方法:python html2csv.py *.html
这段代码使用了 HTMLParser 模块
#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
# Hello, this program is written in Python - http://python.org
programname = 'html2csv - version 2002-09-20 - http://sebsauvage.net'
import sys, getopt, os.path, glob, HTMLParser, re
try: import psyco ; psyco.jit() # If present, use psyco to accelerate the program
except: pass
def usage(progname):
''' Display program usage. '''
progname = os.path.split(progname)[1]
if os.path.splitext(progname)[1] in ['.py','.pyc']: progname = 'python '+progname
return '''%s
A coarse HTML tables to CSV (Comma-Separated Values) converter.
Syntax : %s source.html
Arguments : source.html is the HTML file you want to convert to CSV.
By default, the file will be converted to csv with the same
name and the csv extension (source.html -> source.csv)
You can use * and ?.
Examples : %s mypage.html
: %s *.html
This program is public domain.
Author : Sebastien SAUVAGE <sebsauvage at sebsauvage dot net>
http://sebsauvage.net
''' % (programname, progname, progname, progname)
class html2csv(HTMLParser.HTMLParser):
''' A basic parser which converts HTML tables into CSV.
Feed HTML with feed(). Get CSV with getCSV(). (See example below.)
All tables in HTML will be converted to CSV (in the order they occur
in the HTML file).
You can process very large HTML files by feeding this class with chunks
of html while getting chunks of CSV by calling getCSV().
Should handle badly formated html (missing <tr>, </tr>, </td>,
extraneous </td>, </tr>...).
This parser uses HTMLParser from the HTMLParser module,
not HTMLParser from the htmllib module.
Example: parser = html2csv()
parser.feed( open('mypage.html','rb').read() )
open('mytables.csv','w+b').write( parser.getCSV() )
This class is public domain.
Author: Sébastien SAUVAGE <sebsauvage at sebsauvage dot net>
http://sebsauvage.net
Versions:
2002-09-19 : - First version
2002-09-20 : - now uses HTMLParser.HTMLParser instead of htmllib.HTMLParser.
- now parses command-line.
To do:
- handle <PRE> tags
- convert html entities (&name; and &#ref;) to Ascii.
'''
def __init__(self):
HTMLParser.HTMLParser.__init__(self)
self.CSV = '' # The CSV data
self.CSVrow = '' # The current CSV row beeing constructed from HTML
self.inTD = 0 # Used to track if we are inside or outside a <TD>...</TD> tag.
self.inTR = 0 # Used to track if we are inside or outside a <TR>...</TR> tag.
self.re_multiplespaces = re.compile('\s+') # regular expression used to remove spaces in excess
self.rowCount = 0 # CSV output line counter.
def handle_starttag(self, tag, attrs):
if tag == 'tr': self.start_tr()
elif tag == 'td': self.start_td()
def handle_endtag(self, tag):
if tag == 'tr': self.end_tr()
elif tag == 'td': self.end_td()
def start_tr(self):
if self.inTR: self.end_tr() # <TR> implies </TR>
self.inTR = 1
def end_tr(self):
if self.inTD: self.end_td() # </TR> implies </TD>
self.inTR = 0
if len(self.CSVrow) > 0:
self.CSV += self.CSVrow[:-1]
self.CSVrow = ''
self.CSV += '\n'
self.rowCount += 1
def start_td(self):
if not self.inTR: self.start_tr() # <TD> implies <TR>
self.CSVrow += '"'
self.inTD = 1
def end_td(self):
if self.inTD:
self.CSVrow += '",'
self.inTD = 0
def handle_data(self, data):
if self.inTD:
self.CSVrow += self.re_multiplespaces.sub(' ',data.replace('\t',' ').replace('\n','').replace('\r','').replace('"','""'))
def getCSV(self,purge=False):
''' Get output CSV.
If purge is true, getCSV() will return all remaining data,
even if <td> or <tr> are not properly closed.
(You would typically call getCSV with purge=True when you do not have
any more HTML to feed and you suspect dirty HTML (unclosed tags). '''
if purge and self.inTR: self.end_tr() # This will also end_td and append last CSV row to output CSV.
dataout = self.CSV[:]
self.CSV = ''
return dataout
if __name__ == "__main__":
try: # Put getopt in place for future usage.
opts, args = getopt.getopt(sys.argv[1:],None)
except getopt.GetoptError:
print usage(sys.argv[0]) # print help information and exit:
sys.exit(2)
if len(args) == 0:
print usage(sys.argv[0]) # print help information and exit:
sys.exit(2)
print programname
html_files = glob.glob(args[0])
for htmlfilename in html_files:
outputfilename = os.path.splitext(htmlfilename)[0]+'.csv'
parser = html2csv()
print 'Reading %s, writing %s...' % (htmlfilename, outputfilename)
try:
htmlfile = open(htmlfilename, 'rb')
csvfile = open( outputfilename, 'w+b')
data = htmlfile.read(8192)
while data:
parser.feed( data )
csvfile.write( parser.getCSV() )
sys.stdout.write('%d CSV rows written.\r' % parser.rowCount)
data = htmlfile.read(8192)
csvfile.write( parser.getCSV(True) )
csvfile.close()
htmlfile.close()
except:
print 'Error converting %s ' % htmlfilename
try: htmlfile.close()
except: pass
try: csvfile.close()
except: pass
print 'All done. '
希望本文所述对大家的Python程序设计有所帮助。


猜你喜欢
- 使用使用navicat连接远程linux mysql数据库出现10061未知故障,设置使用ssh连接后出现2013故障本机环境:win10
- python 字典操作提取key,value dictionaryName[key] = value1.为字典增加一项 2.访问字典中的值
- 技术背景对于一些连续运行或者长时间运行的Python程序而言,如服务器的后端,或者是长时间运行的科学计算程序。当我们涉及到一些中途退出的操作
- 1. 内部重构#2. 外部重构#website/blog/urls.pywebsite/website/urls.py3. 两种参数处理方式
- 前言QTableWidget是Qt程序中常用的显示数据表格的控件,类似于c#中的DataGrid。QTableWidget是QTableVi
- 本文是一篇关于《Effective Python》书中一节的学习笔记,记录了示例代码和思路。如果函数要产生一系列结果,那么最简单的做法就是把
- Python标准库的OS模块对操作系统的API进行了封装,并且使用统一的API访问不同操作系统的相同功能。OS模块包含与操作系统的系统环境、
- 前言需求是将两个list同时进行遍历,然后同步的将每个元素add到一个dict中,虽然有麻烦的方式,比如直接用list的数组下标可以实现,但
- 一、架构介绍Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方
- 如下所示:import numpy as npnp.set_printoptions(threshold='nan')来源:
- 字符编码我们已经讲过了,字符串也是一种数据类型,但是,字符串比较特殊的是还有一个编码问题。因为计算机只能处理数字,如果要处理文本,就必须先把
- 本文实例讲述了python实现的接收邮件功能。分享给大家供大家参考,具体如下:一 简介本代码实现从网易POP3服务器接收邮件二 代码impo
- 这篇文章主要介绍了Python enumerate函数遍历数据对象组合过程解析,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定
- 不知大家对精华区的表格排序终极优化是否还有记忆,当时讨论的结果曾以为是最快的JS排序了,实则不然,按前段时间我发的DHTML性能提升帖(转译
- 首先必须将权重也转为Tensor的cuda格式;然后将该class_weight作为交叉熵函数对应参数的输入值。class_weight =
- 在用Linux(OS:Centos 7.2)时看到有一行代码是:export PYTHONPATH=$PYTHONPATH:/home/us
- 本文介绍一款工具 go-callvis,它能够将 Go 代码的调用关系可视化出来,并提供了可交互式的 web 服务。go get -u gi
- 一、安装github:https://github.com/kubernetes-client/python安装pip install ku
- 在MySQL中可以使用IF()、IFNULL()、NULLIF()、ISNULL()函数进行流程的控制。1、IF()函数的使用IF(expr
- 如果不想允许随意修改一个类的某个属性,常用的方法是使用property装饰器以及在属性前加下划线。class V: def __