Content #
下面的Python代码用于统计文件中单词出现的频率信息(记录单词出行的行号及列号),为方便处理找不到的键,代码中使用了defaultdict。
import sys
import re
import collections
WORD_RE = re.compile('\w+')
index = collections.defaultdict(list)
with open(sys.argv[1], encoding='utf-8') as fp:
for line_no, line in enumerate(fp, 1):
for match in WORD_RE.finditer(line):
word = match.group()
column_no = match.start()+1
location = (line_no, column_no)
index[word].append(location)
# print in alphabetical order
for word in sorted(index, key=str.upper):
print(word, index[word])