Lynx works OK and mine defaults to utf-8. I use a sed filter I built to convert extended ASCII stuff to be US-ASCII compliant. Here is my filter so far:
https://every.sdf.org/.webshare/TXT.txtIf anyone knows of a proxy I could give a web URL to and receive a simple .txt version back of the article, please let me know! Otherwise, I might be tempted to create one. Maybe a gopher service?
I don't know about a proxy, but I wonder how far @m150 could get with the following command:
$ lynx -dump -nolist ${URL} > ${FILENAME}.txt
If a site is too dependent on JS, this won't work, but if there's text hidden under entirely too much JS this might be enough to extract it. You'll still want to massage it using sed, though.
That's what I did when retrieving and cleaning the Limyaael Rants.Lynx works OK and mine defaults to utf-8. I use a sed filter I built to convert extended ASCII stuff to be US-ASCII compliant. Here is my filter so far:
https://every.sdf.org/.webshare/TXT.txtThanks starbreaker! That's actually a very elegant way. Always impressed to see the wonders of piping commands. Someone else mentioned:
textify.itWhich I still haven't tested.
I just tried textify.it with my websites. It seems to balk at processing tables that contain links.