Welcome guest. Before posting on our computer help forum, you must register. Click here it's easy and free.

Author Topic: I need to determine an unkown encoding of a .XLS file.  (Read 15378 times)

0 Members and 1 Guest are viewing this topic.

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
I need to determine an unkown encoding of a .XLS file.
« on: March 23, 2017, 05:58:28 PM »
I am working on writing a python(3) script to take data out of an xls file, and place it in an sqlite database.  The xls file is downloaded from the Internet, and encoded in a way I haven't been able to get to work with the xlrd library.

Code: [Select]
Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
[GCC 6.2.0 20160927] on linux
Type "copyright", "credits" or "license()" for more information.
>>> import xlrd
>>> file = xlrd.open_workbook("test.xls")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    file = xlrd.open_workbook("test.xls")
  File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
    unicode = lambda b, enc: b.decode(enc)
  File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data

I was reading through the docs, and it didn't give a clear way to test how the file is encoded. 
Link Related: http://xlrd.readthedocs.io/en/latest/unicode.html

I did try other codes I have heard of, but I've yet to get one to work. Any thoughts?

BC_Programmer


    Mastermind
  • Typing is no substitute for thinking.
  • Thanked: 1140
    • Yes
    • Yes
    • BC-Programming.com
  • Certifications: List
  • Computer: Specs
  • Experience: Beginner
  • OS: Windows 11
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #1 on: March 23, 2017, 07:29:06 PM »
Sounds like the file is missing the CODEPAGE information. You'll have to override it as documented here.

Code: [Select]
file = xlrd.open_workbook('test.xls', encoding_override="utf_16_le")

You'll have to experiment with different encodings depending on what kind it actually is.
I was trying to dereference Null Pointers before it was cool.

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #2 on: March 24, 2017, 04:20:52 PM »
I have been try different codecs, but have yet to get one to work. I may write a script that will just try them all...

An interesting thing I have found is that no matter what codec I have given as a argument, I always get the same error:
Code: [Select]
>>> xlrd.open_workbook('test.xls', encoding_override="iso8859_15")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
    bk.parse_globals()
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
    self.handle_writeaccess(data)
  File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
    strg = unpack_unicode(data, 0, lenlen=2)
  File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
    strg = unicode(rawstrg, 'utf_16_le')
  File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
    unicode = lambda b, enc: b.decode(enc)
  File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data

BC_Programmer


    Mastermind
  • Typing is no substitute for thinking.
  • Thanked: 1140
    • Yes
    • Yes
    • BC-Programming.com
  • Certifications: List
  • Computer: Specs
  • Experience: Beginner
  • OS: Windows 11
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #3 on: March 24, 2017, 04:40:23 PM »
OK, then the file probably does have a CODEPAGE but the file itself is corrupted. Can you open the file in Excel/LibreOffice?
I was trying to dereference Null Pointers before it was cool.

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #4 on: March 28, 2017, 05:53:34 PM »
Yes, I can open it in LibreOffice, and even if I change the file extension libreoffice doesn't complain.  It is not an xlsx for sure because it isn't a compressed file. (I can see the text of the spread sheet with a hex editor)

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #5 on: April 03, 2017, 04:30:35 PM »
I have been investigating the file a bit with a hex editor...
I found the data of the first cell of the downloaded file here:
Code: [Select]
4F 00 72 00 64 00 65 00 72 00 20 00 47 00 75 00 69 00 64 00 65 00
Libre Office interprets it as "Order Guide"

Now I copied this cell to a blank workbook and saved it.
I found  "Order Guide" in the document as:
Code: [Select]
4F 72 64 65 72 20 47 75 69 64 65

Does anyone with more experience with encodings have any idea what the downloaded file could be encoded with? If not does anyone know where else to look in the file?

BC_Programmer


    Mastermind
  • Typing is no substitute for thinking.
  • Thanked: 1140
    • Yes
    • Yes
    • BC-Programming.com
  • Certifications: List
  • Computer: Specs
  • Experience: Beginner
  • OS: Windows 11
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #6 on: April 03, 2017, 04:40:18 PM »
First example is UTF-16 Little Endian (LE).
I was trying to dereference Null Pointers before it was cool.

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #7 on: April 07, 2017, 06:26:07 PM »
I've tried most common codecs with no success.
So maybe the file is utf_16_le.

Assuming the file is intentionally corrupted (100% reproducible), with python I would like to attempt to reconstruct it.

Code: [Select]
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
Position 108 I am assuming means the 108th byte in the file.
0x20 from what I understand is usually a space, I will add 00 before it.
brb

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #8 on: April 07, 2017, 07:49:40 PM »
The error message was not helpful, no byte 0x20 was near byte position 108, nor near positions of 108*2^n.
 >:(
So I decided to loop through the file with:
Code: [Select]
file = open("prices.xls", encoding="utf-16-le")
byte = file.read(1)
while byte != "":
    byte = file.read(1)
file.close()
and I got this error:
Code: [Select]
Traceback (most recent call last):
  File "./test.py", line 5, in <module>
    byte = file.read(1)
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1018-1019: illegal encoding
Bytes in position 1018-1019 are: 00 00
 ???

On a side note, if I open the file in libreoffice and save as an xls, I can open it with xlrd, and the file shrinks significantly. 262.5kb to 217.6kb. 

Any ideas on what to do next?
I do not want my script to be dependent on libreoffice being installed.

BC_Programmer


    Mastermind
  • Typing is no substitute for thinking.
  • Thanked: 1140
    • Yes
    • Yes
    • BC-Programming.com
  • Certifications: List
  • Computer: Specs
  • Experience: Beginner
  • OS: Windows 11
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #9 on: April 08, 2017, 12:58:22 AM »
Quote
Assuming the file is intentionally corrupted (100% reproducible)
Where did it come from?
I was trying to dereference Null Pointers before it was cool.

TheWaffle

    Topic Starter


    Hopeful
  • Thanked: 4
    • Yes
  • Computer: Specs
  • Experience: Beginner
  • OS: Linux variant
Re: I need to determine an unkown encoding of a .XLS file.
« Reply #10 on: April 08, 2017, 08:40:21 AM »
The working par t of the script, fills out a couple webforms, and downloads the xls file using beautiful soup.
This is the part that does it:
Code: [Select]
request = browser.session.get(myURL, stream=True)
with open(path + os.sep +"test.xls", "wb") as output:
    output.write(request.content)
    output.close()
Quote
Assuming the file is intentionally corrupted (100% reproducible)
Sorry for not being clear to what is happening.
So the file that my code downloads, matches the file that I can download using Firefox. I know the xls file is generated by some webservice.  That xls file I assume is corrupted is some form.
To clean up the xls file my current scripted solution is to run this:
Code: [Select]
from subprocess import call
call(["unoconv", "-d", "spreadsheet", "--format=ods", path + os.sep +"test.xls"])
call(["unoconv", "-d", "spreadsheet", "--format=xls", path + os.sep +"test.ods"])
While this solution works, I am not happy with it because it adds dependencies of libreoffice and unoconv.
What I would like to do is accomplish the same end result, but without adding nonpython dependencies.
Any idea where to start?