准确率中等,总比手打一遍好
convert pdf to txt
how to use
1. sudo apt install -y poppler-utils
2. pdftotext *.pdf file.txt
how to convert imgae to txt
how to use,but not easy to do
1. sudo apt install -y tesseract-ocr tesseract-ocr-chi-sim
2. convert file.png out.tif
3. tesseract out.tif out.txt
Installing Tesseract in Ubuntu / Linux
sudo apt-get install -y tesseract-ocr tesseract-ocr-chi-sim
Further, you can install any language packages if required.
Now, before you start using Tesseract, you need to convert the files (png/jpg) to tif format (input format supported by tesseract). Use the following command (you may need to install imagemagick package) –
convert file_name.png out_file_name.tif
Now, you can try reading the content using Tesseract.
tesseract your_scanned_file.tif output_content