tokenizers提供了当今最常用的tokenizers的实现,重点关注性能和多功能性。官网:GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
在FreeBSD系统python3.10下直接pip安装tokenizers报错,
pip install tokenizers
Looking in indexes: https://mirror.baidu.com/pypi/simple
Collecting tokenizers
Using cached https://mirror.baidu.com/pypi/packages/59/9a/7ba038f101d74fea0861c8f82e188441fe99b2a26fb0991da35f2850f9f3/tokenizers-0.19.0.tar.gz (320 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... -
error
error: subprocess-exited-with-error
× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [93 lines of output]
Updating crates.io index
Downloading crates ...
Downloaded darling_macro v0.20.8
Downloaded windows-targets v0.52.5
Downloaded derive_builder_macro v0.20.0
Downloaded monostate-impl v0.1.12
Downloaded pyo3-macros v0.21.2
Downloaded fastrand v2.0.2
Downloaded itoa v1.0.11
Downloaded macro_rules_attribute-proc_macro v0.2.0
Downloaded memoffset v0.9.1
Downloaded rayon-cond v0.3.0
Downloaded autocfg v1.2.0
Downloaded env_filter v0.1.0
Downloaded macro_rules_attribute v0.2.0
Downloaded indoc v2.0.5
Downloaded monostate v0.1.12
Downloaded thiserror-impl v1.0.58
Downloaded anstyle v1.0.6
Downloaded either v1.11.0
Downloaded num-integer v0.1.46
Downloaded thiserror v1.0.58
Downloaded pkg-config v0.3.30
Downloaded darling v0.20.8
Downloaded derive_builder_core v0.20.0
Downloaded env_logger v0.11.3
Downloaded pyo3-build-config v0.21.2
Downloaded num-complex v0.4.5
Downloaded anstream v0.6.13
Downloaded quote v1.0.36
Downloaded getrandom v0.2.14
Downloaded derive_builder v0.20.0
Downloaded tempfile v3.10.1
Downloaded bitflags v2.5.0
Downloaded proc-macro2 v1.0.81
Downloaded ryu v1.0.17
Downloaded serde_derive v1.0.198
Downloaded log v0.4.21
Downloaded num-traits v0.2.18
Downloaded pyo3-macros-backend v0.21.2
Downloaded matrixmultiply v0.3.8
Downloaded darling_core v0.20.8
Downloaded indicatif v0.17.8
Downloaded pyo3-ffi v0.21.2
Downloaded serde v1.0.198
Downloaded cc v1.0.94
Downloaded numpy v0.21.0
Downloaded memchr v2.7.2
Downloaded unicode-segmentation v1.11.0
Downloaded aho-corasick v1.1.3
Downloaded serde_json v1.0.116
Downloaded esaxx-rs v0.1.10
Downloaded rayon v1.10.0
Downloaded regex v1.10.4
Downloaded syn v2.0.59
warning: spurious network error (3 tries remaining): [28] Timeout was reached (download of `regex-automata v0.4.6` failed to transfer more than 10 bytes in 30s)
于是改为手动安装
下载源代码
git clone https://github.com/huggingface/tokenizers
进入目录并编译安装
cd tokenizers/bindings/python
pip install -e .
一般这样编译安装之后,后面即使碰到pip安装不同版本的tokenizers,也会顺利通过。