统计文件中单词出现的频率信息(setdefault)

January 8, 2023

Content #

下面的Python代码用于统计文件中单词出现的频率信息（记录单词出行的行号及列号），其中有标记的三行代码需要两次查询字典，我们可以将其改写成只有一行代码，且不必于次查询字典。

import sys
import re
WORD_RE = re.compile('\w+')
index = {}
with open(sys.argv[1], encoding='utf-8') as fp:
    for line_no, line in enumerate(fp, 1):
      for match in WORD_RE.finditer(line):
         word = match.group()
         column_no = match.start()+1
         location = (line_no, column_no)
         occurrences = index.get(word, []) #BAD
         occurrences.append(location)      #BAD
         index[word] = occurrences         #BAD
# print in alphabetical order
for word in sorted(index, key=str.upper):
    print(word, index[word])

改写方法：

index.setdefault(word, []).append(location)

统计文件中单词出现的频率信息(setdefault)

统计文件中单词出现的频率信息(setdefault)

Content #

From #

Links #