Computer Hope
Software => Computer programming => Topic started by: TheWaffle on March 23, 2017, 05:58:28 PM
-
I am working on writing a python(3) script to take data out of an xls file, and place it in an sqlite database. The xls file is downloaded from the Internet, and encoded in a way I haven't been able to get to work with the xlrd library.
Python 3.5.2+ (default, Sep 22 2016, 12:18:14)
[GCC 6.2.0 20160927] on linux
Type "copyright", "credits" or "license()" for more information.
>>> import xlrd
>>> file = xlrd.open_workbook("test.xls")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
file = xlrd.open_workbook("test.xls")
File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
bk.parse_globals()
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
self.handle_writeaccess(data)
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
strg = unpack_unicode(data, 0, lenlen=2)
File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
I was reading through the docs, and it didn't give a clear way to test how the file is encoded.
Link Related: http://xlrd.readthedocs.io/en/latest/unicode.html
I did try other codes I have heard of, but I've yet to get one to work. Any thoughts?
-
Sounds like the file is missing the CODEPAGE information. You'll have to override it as documented here (http://www.lexicon.net/sjmachin/xlrd.html).
file = xlrd.open_workbook('test.xls', encoding_override="utf_16_le")
You'll have to experiment with different encodings depending on what kind it actually is.
-
I have been try different codecs, but have yet to get one to work. I may write a script that will just try them all...
An interesting thing I have found is that no matter what codec I have given as a argument, I always get the same error:
>>> xlrd.open_workbook('test.xls', encoding_override="iso8859_15")
WARNING *** file size (262435) not 512 + multiple of sector size (512)
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/xlrd/__init__.py", line 441, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 116, in open_workbook_xls
bk.parse_globals()
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1180, in parse_globals
self.handle_writeaccess(data)
File "/usr/local/lib/python3.5/dist-packages/xlrd/book.py", line 1145, in handle_writeaccess
strg = unpack_unicode(data, 0, lenlen=2)
File "/usr/local/lib/python3.5/dist-packages/xlrd/biffh.py", line 303, in unpack_unicode
strg = unicode(rawstrg, 'utf_16_le')
File "/usr/local/lib/python3.5/dist-packages/xlrd/timemachine.py", line 30, in <lambda>
unicode = lambda b, enc: b.decode(enc)
File "/usr/lib/python3.5/encodings/utf_16_le.py", line 16, in decode
return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
-
OK, then the file probably does have a CODEPAGE but the file itself is corrupted. Can you open the file in Excel/LibreOffice?
-
Yes, I can open it in LibreOffice, and even if I change the file extension libreoffice doesn't complain. It is not an xlsx for sure because it isn't a compressed file. (I can see the text of the spread sheet with a hex editor)
-
I have been investigating the file a bit with a hex editor...
I found the data of the first cell of the downloaded file here:
4F 00 72 00 64 00 65 00 72 00 20 00 47 00 75 00 69 00 64 00 65 00
Libre Office interprets it as "Order Guide"
Now I copied this cell to a blank workbook and saved it.
I found "Order Guide" in the document as:
4F 72 64 65 72 20 47 75 69 64 65
Does anyone with more experience with encodings have any idea what the downloaded file could be encoded with? If not does anyone know where else to look in the file?
-
First example is UTF-16 Little Endian (LE).
-
I've tried most common codecs with no success.
So maybe the file is utf_16_le.
Assuming the file is intentionally corrupted (100% reproducible), with python I would like to attempt to reconstruct it.
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x20 in position 108: truncated data
Position 108 I am assuming means the 108th byte in the file.
0x20 from what I understand is usually a space, I will add 00 before it.
brb
-
The error message was not helpful, no byte 0x20 was near byte position 108, nor near positions of 108*2^n.
>:(
So I decided to loop through the file with:
file = open("prices.xls", encoding="utf-16-le")
byte = file.read(1)
while byte != "":
byte = file.read(1)
file.close()
and I got this error:
Traceback (most recent call last):
File "./test.py", line 5, in <module>
byte = file.read(1)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 1018-1019: illegal encoding
Bytes in position 1018-1019 are: 00 00
???
On a side note, if I open the file in libreoffice and save as an xls, I can open it with xlrd, and the file shrinks significantly. 262.5kb to 217.6kb.
Any ideas on what to do next?
I do not want my script to be dependent on libreoffice being installed.
-
Assuming the file is intentionally corrupted (100% reproducible)
Where did it come from?
-
The working par t of the script, fills out a couple webforms, and downloads the xls file using beautiful soup.
This is the part that does it:
request = browser.session.get(myURL, stream=True)
with open(path + os.sep +"test.xls", "wb") as output:
output.write(request.content)
output.close()
Assuming the file is intentionally corrupted (100% reproducible)
Sorry for not being clear to what is happening.
So the file that my code downloads, matches the file that I can download using Firefox. I know the xls file is generated by some webservice. That xls file I assume is corrupted is some form.
To clean up the xls file my current scripted solution is to run this:
from subprocess import call
call(["unoconv", "-d", "spreadsheet", "--format=ods", path + os.sep +"test.xls"])
call(["unoconv", "-d", "spreadsheet", "--format=xls", path + os.sep +"test.ods"])
While this solution works, I am not happy with it because it adds dependencies of libreoffice and unoconv.
What I would like to do is accomplish the same end result, but without adding nonpython dependencies.
Any idea where to start?