tokenize-file

2019-12-03

tokenize-file是什么

什么是tokenize-file,Read a file, tokenize it, and spit out a handy JSON.

tokenize-file使用教程帮助文档

tokenize-file

Build Status Coverage Status

Read a file, tokenize it, and spit out a handy JSON.

Installation

npm i tokenize-file -S

Example

var tokenizeFile = require("tokenize-file");

tokenizeFile("path/to/file.txt", tokens => {
  console.log(tokens.filter(d => !d.stop_word && d.pos !== "N"));
});

API

tokenizeFile(_path/to/file_name_, _callback_)

Read a file, tokenize it, and spit out the JSON of the tokens. The tokenized data is passed as an array of objects to the callback function. In the array, each token is an object, represented as:

{
  value: "String", // the token
  count: Number, // the number of times it appears in the file
  pos: "String" // the token's Penn Treebank POS tag,
  stop_word: Boolean // whether the token value is a stop word, which can be filtered out in some analyses
}

tokenizeFile can read any type of file supported by textract:

  • HTML, HTM
  • ATOM, RSS
  • Markdown
  • XML, XSL
  • PDF
  • DOC, DOCX
  • ODT, OTT (experimental, feedback needed!)
  • RTF
  • XLS, XLSX, XLSB, XLSM, XLTX
  • CSV
  • ODS, OTS
  • PPTX, POTX
  • ODP, OTP
  • ODG, OTG
  • PNG, JPG, GIF
  • DXF
  • application/javascript
  • All text/* mime-types.

The POS tags are:

POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word d’hoevre
IN preposition/subordinating conjunction in, of, like
JJ adjective big
JJR adjective, comparative bigger
JJS adjective, superlative biggest
LS list marker 1)
MD modal could, will
NN noun, singular or mass door
NNS noun plural doors
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend‘s
PRP personal pronoun I, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

上一篇:jeezy
下一篇:textract
相关文章

首次访问,人机识别验证

扫描下方二维码回复 1024 获取验证码,验证完毕后 永久 无须验证

操作步骤:[打开微信]->[扫描上侧二维码]->[关注 FedJavaScript 的微信] 输入 1024 获取验证码

验证码有误,请重新输入