【Python】文章からStop wordsを除きたい

・お題：以前、文章の特徴を解析しようと思い、文章を数値化した。aやtheなどのあまり意味の無い単語がたくさん出てきて、解析の邪魔になるので、除きたい。

・自然言語解析であまり使われない代名詞や冠詞などのことを、stop wordsと呼び、前処理として除かれる場合が多い。こういうことに使われる有名ライブラリといえば、spaCyだと思う。日本語だとGiNZAというライブラリが有名らしい。

・まず、spaCyをインストールしてからインポートしてみたところ、インポートの時点でエラーが出た。

Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

ということで、英語用の追加データも入れた（python -m spacy download enをConda promptで動かす）。

・環境ができたので、文章を準備する。Biopythonを使って、キーワードを含む文献のアブストを1報取得した。

from Bio import Entrez
Entrez.email = "abc@def.ghi"#自分の使えるメールアドレスを記載
handle = Entrez.esearch(db="pubmed", term="cancer", retmax="1")
record = Entrez.read(handle)

from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=record["IdList"], rettype="medline",retmode="text")
records = Medline.parse(handle)
records = list(records)
records[0]["AB"]

で

"1. Both PI3K signaling quality (interaction partners, effectors) and quantity (strength, kinetics) differ across contexts, and are under dynamic regulation. 2. Rather than as a simple ON/OFF cellular switch,......

が返ってきた。

・stop wordsを除く前に、とりあえず文章をスペースで区切って単語のリストにしてみる。

text.split()

で以下が返ってきた。

['1.',
'Both',
'PI3K',
'signaling',
'quality',
'(interaction',
'partners,',
'effectors)',
'and',
'quantity',
'(strength,',
'kinetics)',

・カッコつきの単語や、カンマ付きの単語が目立っている。気持ち悪いのでこの辺を先に除去しておいた方が良さそう。とりあえず除去してみる。

import re
text=re.sub(r"[.,)(}{!?:;]","", text)

・これで、textは以下になる。確かに、カンマ、ピリオド、括弧などが消えたっぽい。

"1 Both PI3K signaling quality interaction partners effectors and quantity strength kinetics differ across contexts and are under dynamic regulation 2 Rather than as a simple ON/OFF cellular switch the PI3K signaling pathway should be viewed as a .......

・ついでに、すべて小文字に変換してから単語のリストを作成した。

text=text.lower() #小文字に変換
words=text.split() #スペースで分割してリスト化

['1',
'both',
'pi3k',
'signaling',
'quality',
'interaction',
'partners',
'effectors',
'and',

・次にstop wordsを除去する。spacyからSTOP_WORDSを読み込んでみると、STOP_WORDSが除去する単語の集まり（コレクション）であることが分かる。

from spacy.lang.en.stop_words import STOP_WORDS
STOP_WORDS

{"'d",
"'ll",
"'m",
"'re",
"'s",
"'ve",
'a',
'about',
'above',

・ということは、先ほどのリストの単語をひとつづつ取り出し、このコレクションに含まれていないもののリストを作成すれば、stop wordsを除いたことになる。

words=[w for w in words if w not in STOP_WORDS]

・これでwordsは以下になった。確かに、先ほどよりも個性的な単語ばかりになったような気がする。

['1',
'pi3k',
'signaling',
'quality',
'interaction',
'partners',
'effectors',
'quantity',
'strength',
'kinetics',
'differ',
'contexts',
'dynamic',
・全部流すと以下のような感じ。

#アブストをとって来る。

from Bio import Entrez
Entrez.email = "abc@def.ghi"#自分の使えるメールアドレスを記載
handle = Entrez.esearch(db="pubmed", term="cancer", retmax="1")
record = Entrez.read(handle)

from Bio import Medline
handle = Entrez.efetch(db="pubmed", id=record["IdList"], rettype="medline",retmode="text")
records = Medline.parse(handle)
records = list(records)

text=records[0]["AB"]

#アブストを処理する

import re
text=re.sub(r"[.,)(}{!?:;]","", text) #記号を除去
text=text.lower() #小文字に変換
words=text.split() #スペースで分割してリスト化

from spacy.lang.en.stop_words import STOP_WORDS
words=[w for w in words if w not in STOP_WORDS]

・これで、wordsはstop wordsを除去した単語のリストになった筈。

おわり。