[[toc]]
The year before last, I made a repository called BulkImportToAnki on Github. Until tody, I found there was a issue opened on Dec 22 2018. I’m very sorry for I didn’t notice this, and sorry for I didn’t write any documents about this repository. So this is the reason I write this article.
Long long time ago, I got a PDF file, which contains 100 sentences for IELTS from my classmate. I didn’t think that’s a good idea to memorize all this by reading this file directly.
I knew Anki is good for memorizing. But how to convert this PDF to cards thatsupport in Anki? Here was a solution I discovered:
@flowstart st=>start: Start stage1=>operation: Extract text from PDF stage2=>operation: Split the text and save a CSV file stage3=>operation: Import the CSV file to Anki e=>end: End
st->stage1->stage2->stage3->e @flowend
Start: Install pdftotext
I decided to use Python to deal with this. I followed this,
and installed pdftotext
, which can extract text from PDF file.
Extract text from PDF
I could extract text from my pdf file by using:
import pdftotext
# Load your PDF
with open("100_sentences.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
pages = []
for page in pdf:
pages.append(page)
print(pages[0])
print(pages[1])
::: details Output
7000 雅思词汇用 100 个句子记完!
100 套真题中提炼而出的 100 百个经典句子,包涵了 7000 个雅思词汇。
1. Typical of the grassland dwellers of the continent is the American antelope, or
pronghorn.
1.美洲羚羊,或称叉角羚,是该大陆典型的草原动物。
2. Of the millions who saw Haley’s comet in 1986, how many people will live long
enough to see it return in the twenty-first century.
2. 1986 年看见哈雷慧星的千百万人当中,有多少人能够长寿到足以目睹它在二十一世纪
的回归呢?
3. Anthropologists have discovered that fear, happiness, sadness, and surprise
are universally reflected in facial expressions.
3.人类学家们已经发现,恐惧,快乐,悲伤和惊奇都会行之于色,这在全人类是共通的。
4. Because of its irritating effect on humans, the use of phenol as a general
antiseptic has been largely discontinued.
4.由于苯酚对人体带有刺激性作用,它基本上已不再被当作常用的防腐剂了。
5. In group to remain in existence, a profit-making organization must, in the
long run, produce something consumers consider useful or desirable.
5.任何盈利组织若要生存,最终都必须生产出消费者可用或需要的产品。
6. The greater the population there is in a locality; the greater the need there is
for water, transportation, and disposal of refuse.
6.一个地方的人口越多,其对水,交通和垃圾处理的需求就会越大。
7. It is more difficult to write simply, directly, and effectively than to employ
flowery but vague expressions that only obscure one’s meaning.
7.简明,直接,有力的写作难于花哨,含混而意义模糊的表达。
1 / 16
8. With modern offices becoming more mechanized, designers are attempting
to personalize them with warmer, less severe interiors.
8.随着现代办公室的日益自动化,设计师们正试图利用较为温暖而不太严肃的内部装饰来使
其具有亲切感。
9. The difference between libel and slander is that libel is printed while slander
is spoken.
9.诽谤和流言的区别在于前者是书面的,而后者是口头的。
10. The knee is the joints where the thigh bone meets the large bone of the
lower leg.
10.膝盖是大腿骨和小腿胫的连接处。
11. Acids are chemical compounds that, in water solution, have a sharp taste, a
corrosive action on metals, and the ability to turn certain blue vegetable dyes
red.
11.酸是一种化合物,它在溶于水时具有强烈的气味和对金属的腐蚀性,并且能够使某些蓝
色植物染料变红。
12. Billie Holiday’s reputation as a great jazz-blues singer rests on her ability to
give emotional depth to her songs.
12. Billie Holiday’s 作为一个爵士布鲁斯乐杰出歌手的名声建立在能够赋予歌曲感情深度
的能力。
13. Essentially, a theory is an abstract, symbolic representation of what is
conceived to be reality.
13.理论在本质上是对认识了的现实的一种抽象和符号化的表达。
14. Long before children are able to speak or understand a language, they
communicate through facial expressions and by making noises.
14.儿童在能说或能听懂语言之前,很久就会通过面部表情和靠发出噪声来与人交流了。
2 / 16
:::
Here came to the question: ::: tip How to get the information I need?
- I need all those pair sentences of Chinese and English.
- I don’t need
- the first two lines at the begin in the first pages.
- the numbers on the foot of each page like ‘1 / 16, 2 / 16’. :::
Split the text and save a CSV file
::: details All code
import pdftotext
# Load your PDF
with open("100_sentences.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Iterate over all the pages
pages = ''
for page in pdf:
# Remove page number
page = page[:-10]
# Remove all '\n' and extra space
page = page.replace('\n', '').strip()
pages += page
print(pages)
import re
text = re.split('[0-9]+\.', pages)[1:]
print(text)
pairs_list = []
counter = 0
for i in range(len(text)):
# print(sentence[i])
if i%2 == 1:
# pairs_list.append((li[i-1].replace('\n',''),li[i].replace('\n','')))
pairs_list.append((text[i-1],text[i]))
counter +=1
for pair in pairs_list:
print(pair,'\n')
import pandas as pd
df = pd.DataFrame(pairs_list,columns=['Front','Back'])
# print(df)
df.to_csv('DataExportToAnki.csv', encoding='utf-8')
# It's not elegant I knew, but it worked.
:::
So that , I converted the PDF to a csv file, which could be import to Anki easily.
Import the CSV file to Anki
In my computer, I opened Anki-> File -> Import
, and then chose the csv file I made.
Because the first colum was number field, so I ignored Field 1.
I’m not sure whether or not a csv file could be imported to the mobile Anki directly.
Even so, you can export a package using Anki-> File -> Export
, and import to your phone.
Thank for reading. If you have any questions, leave comments below.