Node module for extracting text from various file types

added by bpwndaddy
8/4/2015 3:26:21 PM

308 Views

Currently Extracts... HTML, HTM Markdown XML, XSL PDF DOC, DOCX ODT, OTT (experimental, feedback needed!) RTF XLS, XLSX, XLSB, XLSM, XLTX ODS, OTS PPTX, POTX ODP, OTP ODG, OTG PNG, JPG, GIF DXF application/javascript All text/* mime-types. In almost all cases above, what textract cares about is the mime type. So .html and .htm, both possessing the same mime type, will be extracted. Other extensions that share mime types with those above should also extract successfully. For example, application/vnd.ms-excel is the mime type for .xls, but also for 5 other mime types. Does textract not extract from files of the type you need? Add an issue or submit a pull request. It many cases textract is already capable, it is just not paying attention to the mime type you may be interested in.


0 comments