Computer Hope

Software => Computer programming => Topic started by: TheWaffle on March 23, 2017, 05:58:28 PM

Title: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on March 23, 2017, 05:58:28 PM: I am working on writing a python(3) script to take data out of an xls file, and place it in an sqlite database. The xls file is downloaded from the Internet, and encoded in a way I haven't been able to get to work with the xlrd library.

Code: [Select]
Python 3.5.2+ (default, Sep 22 2016, 12:18:14) [GCC 6.2.0 20160927] on linux Type "copyright", "credits" or "license()" for more information. >>> import xlrd >>> file = xlrd.open_workbook("test.xls") WARNING *** file size (262435) not 512 + multiple of sector size (512) WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero Traceback (most recent call last): File "<pyshell#1>", line 1, in <module> file = xlrd.open_workbook("test.xls") File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook ragged_rows=ragged_rows, File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls bk.parse_globals() File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals self.handle_writeaccess(data) File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess strg = unpack_unicode(data, 0, lenlen=2) File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode strg = unicode(rawstrg, 'utf_16_le') File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda> unicode = lambda b, enc: b.decode(enc) File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
I was reading through the docs, and it didn't give a clear way to test how the file is encoded.
Link Related: http://xlrd.readthedocs.io/en/latest/unicode.html

I did try other codes I have heard of, but I've yet to get one to work. Any thoughts?
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: BC_Programmer on March 23, 2017, 07:29:06 PM: Sounds like the file is missing the CODEPAGE information. You'll have to override it as documented here (http://www.lexicon.net/sjmachin/xlrd.html).

Code: [Select]
file = xlrd.open_workbook('test.xls', encoding_override="utf_16_le")
You'll have to experiment with different encodings depending on what kind it actually is.
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on March 24, 2017, 04:20:52 PM: I have been try different codecs, but have yet to get one to work. I may write a script that will just try them all...

An interesting thing I have found is that no matter what codec I have given as a argument, I always get the same error:
Code: [Select]
>>> xlrd.open_workbook('test.xls', encoding_override="iso8859_15") WARNING *** file size (262435) not 512 + multiple of sector size (512) WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook ragged_rows=ragged_rows, File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls bk.parse_globals() File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals self.handle_writeaccess(data) File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess strg = unpack_unicode(data, 0, lenlen=2) File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode strg = unicode(rawstrg, 'utf_16_le') File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda> unicode = lambda b, enc: b.decode(enc) File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: BC_Programmer on March 24, 2017, 04:40:23 PM: OK, then the file probably does have a CODEPAGE but the file itself is corrupted. Can you open the file in Excel/LibreOffice?
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on March 28, 2017, 05:53:34 PM: Yes, I can open it in LibreOffice, and even if I change the file extension libreoffice doesn't complain. It is not an xlsx for sure because it isn't a compressed file. (I can see the text of the spread sheet with a hex editor)
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on April 03, 2017, 04:30:35 PM: I have been investigating the file a bit with a hex editor...
I found the data of the first cell of the downloaded file here:
Code: [Select]
4F 00 72 00 64 00 65 00 72 00 20 00 47 00 75 00 69 00 64 00 65 00Libre Office interprets it as "Order Guide"

Now I copied this cell to a blank workbook and saved it.
I found "Order Guide" in the document as:
Code: [Select]
4F 72 64 65 72 20 47 75 69 64 65
Does anyone with more experience with encodings have any idea what the downloaded file could be encoded with? If not does anyone know where else to look in the file?
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: BC_Programmer on April 03, 2017, 04:40:18 PM: First example is UTF-16 Little Endian (LE).
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on April 07, 2017, 06:26:07 PM: I've tried most common codecs with no success.
So maybe the file is utf_16_le.

Assuming the file is intentionally corrupted (100% reproducible), with python I would like to attempt to reconstruct it.

Code: [Select]
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated dataPosition 108 I am assuming means the 108th byte in the file.
0x20 from what I understand is usually a space, I will add 00 before it.
brb
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on April 07, 2017, 07:49:40 PM: The error message was not helpful, no byte 0x20 was near byte position 108, nor near positions of 108*2^n.
>:(
So I decided to loop through the file with:
Code: [Select]
file = open("prices.xls", encoding="utf-16-le") byte = file.read(1) while byte != "": byte = file.read(1) file.close()and I got this error:
Code: [Select]
Traceback (most recent call last): File "./test.py", line 5, in <module> byte = file.read(1) File "/usr/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1018-1019: illegal encodingBytes in position 1018-1019 are: 00 00
???

On a side note, if I open the file in libreoffice and save as an xls, I can open it with xlrd, and the file shrinks significantly. 262.5kb to 217.6kb.

Any ideas on what to do next?
I do not want my script to be dependent on libreoffice being installed.
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: BC_Programmer on April 08, 2017, 12:58:22 AM: Quote
Assuming the file is intentionally corrupted (100% reproducible)
Where did it come from?
Title: Re: I need to determine an unkown encoding of a .XLS file.
Post by: TheWaffle on April 08, 2017, 08:40:21 AM: The working par t of the script, fills out a couple webforms, and downloads the xls file using beautiful soup.
This is the part that does it:
Code: [Select]
request = browser.session.get(myURL, stream=True) with open(path + os.sep +"test.xls", "wb") as output: output.write(request.content) output.close()
Quote
Assuming the file is intentionally corrupted (100% reproducible)
Sorry for not being clear to what is happening.
So the file that my code downloads, matches the file that I can download using Firefox. I know the xls file is generated by some webservice. That xls file I assume is corrupted is some form.
To clean up the xls file my current scripted solution is to run this:
Code: [Select]
from subprocess import call call(["unoconv", "-d", "spreadsheet", "--format=ods", path + os.sep +"test.xls"]) call(["unoconv", "-d", "spreadsheet", "--format=xls", path + os.sep +"test.ods"])While this solution works, I am not happy with it because it adds dependencies of libreoffice and unoconv.
What I would like to do is accomplish the same end result, but without adding nonpython dependencies.
Any idea where to start?