Re: AW: Indexation of Excel files newer than 2007

2019-05-08 Thread ad...@extremeshok.com
Piler should drop the usage of al those outdated libraries and use 
https://tika.apache.org/

> On 08 May 2019, at 10:52, Katterl Christian  wrote:
> 
> In at least my case, this does not seem to work.
>  
> BR, Christian
>  
>  
>  
> Von: Janos SUTO  
> Gesendet: Montag, 6. Mai 2019 11:33
> An: Piler User 
> Betreff: Re: Indexation of Excel files newer than 2007
>  
> Newer office files, eg. xlsx, etc should be handled internally by the parser, 
> provided that you have libzip package installed as well as the header files, 
> libzip-dev or similar.
> 
> Janos
> From: Katterl Christian 
> Sent: Mon May 06 10:19:07 GMT+02:00 2019
> To: Piler User 
> Subject: AW: Indexation of Excel files newer than 2007
> 
>  
> Hello again,
>  
> 
> for docx, there would be: https://github.com/ankushshah89/python-docx2txt
>  
> 
> Unfortunately, I am not a software-developer to make the adoptions by myself.
>  
> 
> BR Christian
>  
> 
>  
> 
> Von: Martin Nadvornik  
> Gesendet: Montag, 6. Mai 2019 09:46
> An: Piler User 
> Betreff: Re: Indexation of Excel files newer than 2007
>  
> 
> Hello Christian,
> 
> catdoc is not capable of processing new office formats. As far as I know 
> there is no intention for catdoc to implement this in a foreseeable future. 
> The same problem exists for xls2csv. You could theoretically try to call 
> unoconv (https://github.com/unoconv/unoconv) before catdoc, but it will 
> probably have a big performance impact since it launches libre office / open 
> office for the conversion. But if you try this I would be interested in your 
> results since being limited to index only old office formats is also 
> something we would like to overcome. Alternatively if you can find an open 
> source software which is capable of efficiently extracting plain text from 
> current office formats it should be easily implementable into piler 
> (basically a few lines in extract.c as far as I can tell). For excel there is 
> https://github.com/xevo/xls2csv and https://github.com/nagirrab/xls2csv which 
> claim to be cabable of proccessing xlsx files. But I haven't looked into them 
> yet.
> 
> Kind Regards
> Martin
> 
> Am 06.05.2019 um 06:45 schrieb Katterl Christian:
> Hello,
>  
> Indexation of Excel files newer than Excel 2007 fails in my installation.
> I am using catdoc 0.95 and it tells:
>  
> This file looks like ZIP archive or Office 2007 or later file.
> Not supported by catdoc
>  
> The Excel-File has been created using Excel 2010.
>  
> BR, Christian
> 
> 
> Christian Katterl
> Teamleader Technical IT 
> 
> 
> 
> Asamer Baustoffe AG
> Unterthalham Straße 2
> 4694 Ohlsdorf
> Austria
> tel  +43 50 799 - 2511
> mobile +43 664 811 54 99
> email c.katt...@asamer.at
> www.abag.at
> 
> This message is confidential. It may not be disclosed to, or used by, anyone 
> other than the addressee. If you receive this message by mistake, please 
> advise the sender.
> Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334
> 
>  


AW: Indexation of Excel files newer than 2007

2019-05-08 Thread Katterl Christian
In at least my case, this does not seem to work.

BR, Christian



Von: Janos SUTO 
Gesendet: Montag, 6. Mai 2019 11:33
An: Piler User 
Betreff: Re: Indexation of Excel files newer than 2007

Newer office files, eg. xlsx, etc should be handled internally by the parser, 
provided that you have libzip package installed as well as the header files, 
libzip-dev or similar.
Janos


 Christian Katterl
Teamleader Technical IT

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel +43 50 799 - 2511

mobile  +43 664 811 54 99
c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this
message by mistake, please advise the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334


From: Katterl Christian
Sent: Mon May 06 10:19:07 GMT+02:00 2019
To: Piler User
Subject: AW: Indexation of Excel files newer than 2007


Hello again,


for docx, there would be: https://github.com/ankushshah89/python-docx2txt


Unfortunately, I am not a software-developer to make the adoptions by myself.


BR Christian




Von: Martin Nadvornik 
mailto:martin.nadvor...@diakonie.at>>
Gesendet: Montag, 6. Mai 2019 09:46
An: Piler User mailto:piler-user@list.acts.hu>>
Betreff: Re: Indexation of Excel files newer than 2007


Hello Christian,

catdoc is not capable of processing new office formats. As far as I know there 
is no intention for catdoc to implement this in a foreseeable future. The same 
problem exists for xls2csv. You could theoretically try to call unoconv 
(https://github.com/unoconv/unoconv) before catdoc, but it will probably have a 
big performance impact since it launches libre office / open office for the 
conversion. But if you try this I would be interested in your results since 
being limited to index only old office formats is also something we would like 
to overcome. Alternatively if you can find an open source software which is 
capable of efficiently extracting plain text from current office formats it 
should be easily implementable into piler (basically a few lines in extract.c 
as far as I can tell). For excel there is https://github.com/xevo/xls2csv and 
https://github.com/nagirrab/xls2csv which claim to be cabable of proccessing 
xlsx files. But I haven't looked into them yet.

Kind Regards
Martin
Am 06.05.2019 um 06:45 schrieb Katterl Christian:
Hello,

Indexation of Excel files newer than Excel 2007 fails in my installation.
I am using catdoc 0.95 and it tells:

This file looks like ZIP archive or Office 2007 or later file.
Not supported by catdoc

The Excel-File has been created using Excel 2010.

BR, Christian


Christian Katterl
Teamleader Technical IT

[cid:image001.png@01D503F5.230293E0]

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel  +43 50 799 - 2511
mobile +43 664 811 54 99
email c.katt...@asamer.at<mailto:c.katt...@asamer.at>
www.abag.at<https://www.abag.at>

This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this message by mistake, please advise 
the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334





AW: Indexation of Excel files newer than 2007

2019-05-06 Thread Katterl Christian
Hi,

I will try that.
I had libzip installed, but not libzip-dev as I could not see a hint in the  
installation-manual (or did i overlook it?) that it’s needed.

I will try now with libzip-dev installed.

BR, Christian

Von: Janos SUTO 
Gesendet: Montag, 6. Mai 2019 11:33
An: Piler User 
Betreff: Re: Indexation of Excel files newer than 2007

Newer office files, eg. xlsx, etc should be handled internally by the parser, 
provided that you have libzip package installed as well as the header files, 
libzip-dev or similar.
Janos


 Christian Katterl
Teamleader Technical IT

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel +43 50 799 - 2511

mobile  +43 664 811 54 99
c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this
message by mistake, please advise the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334


From: Katterl Christian
Sent: Mon May 06 10:19:07 GMT+02:00 2019
To: Piler User
Subject: AW: Indexation of Excel files newer than 2007


Hello again,


for docx, there would be: https://github.com/ankushshah89/python-docx2txt


Unfortunately, I am not a software-developer to make the adoptions by myself.


BR Christian




Von: Martin Nadvornik 
mailto:martin.nadvor...@diakonie.at>>
Gesendet: Montag, 6. Mai 2019 09:46
An: Piler User mailto:piler-user@list.acts.hu>>
Betreff: Re: Indexation of Excel files newer than 2007


Hello Christian,

catdoc is not capable of processing new office formats. As far as I know there 
is no intention for catdoc to implement this in a foreseeable future. The same 
problem exists for xls2csv. You could theoretically try to call unoconv 
(https://github.com/unoconv/unoconv) before catdoc, but it will probably have a 
big performance impact since it launches libre office / open office for the 
conversion. But if you try this I would be interested in your results since 
being limited to index only old office formats is also something we would like 
to overcome. Alternatively if you can find an open source software which is 
capable of efficiently extracting plain text from current office formats it 
should be easily implementable into piler (basically a few lines in extract.c 
as far as I can tell). For excel there is https://github.com/xevo/xls2csv and 
https://github.com/nagirrab/xls2csv which claim to be cabable of proccessing 
xlsx files. But I haven't looked into them yet.

Kind Regards
Martin
Am 06.05.2019 um 06:45 schrieb Katterl Christian:
Hello,

Indexation of Excel files newer than Excel 2007 fails in my installation.
I am using catdoc 0.95 and it tells:

This file looks like ZIP archive or Office 2007 or later file.
Not supported by catdoc

The Excel-File has been created using Excel 2010.

BR, Christian


Christian Katterl
Teamleader Technical IT

[cid:image001.png@01D503F5.230293E0]

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel  +43 50 799 - 2511
mobile +43 664 811 54 99
email c.katt...@asamer.at<mailto:c.katt...@asamer.at>
www.abag.at<https://www.abag.at>

This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this message by mistake, please advise 
the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334





AW: Indexation of Excel files newer than 2007

2019-05-06 Thread Katterl Christian
Hello again,

for docx, there would be: https://github.com/ankushshah89/python-docx2txt

Unfortunately, I am not a software-developer to make the adoptions by myself.

BR Christian


Von: Martin Nadvornik 
Gesendet: Montag, 6. Mai 2019 09:46
An: Piler User 
Betreff: Re: Indexation of Excel files newer than 2007

Hello Christian,

catdoc is not capable of processing new office formats. As far as I know there 
is no intention for catdoc to implement this in a foreseeable future. The same 
problem exists for xls2csv. You could theoretically try to call unoconv 
(https://github.com/unoconv/unoconv) before catdoc, but it will probably have a 
big performance impact since it launches libre office / open office for the 
conversion. But if you try this I would be interested in your results since 
being limited to index only old office formats is also something we would like 
to overcome. Alternatively if you can find an open source software which is 
capable of efficiently extracting plain text from current office formats it 
should be easily implementable into piler (basically a few lines in extract.c 
as far as I can tell). For excel there is https://github.com/xevo/xls2csv and 
https://github.com/nagirrab/xls2csv which claim to be cabable of proccessing 
xlsx files. But I haven't looked into them yet.

Kind Regards
Martin
Am 06.05.2019 um 06:45 schrieb Katterl Christian:
Hello,

Indexation of Excel files newer than Excel 2007 fails in my installation.
I am using catdoc 0.95 and it tells:

This file looks like ZIP archive or Office 2007 or later file.
Not supported by catdoc

The Excel-File has been created using Excel 2010.

BR, Christian


Christian Katterl
Teamleader Technical IT

[cid:image001.png@01D503F5.230293E0]

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel  +43 50 799 - 2511
mobile +43 664 811 54 99
email c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this message by mistake, please advise 
the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334



 Christian Katterl
Teamleader Technical IT

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel +43 50 799 - 2511

mobile  +43 664 811 54 99
c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this
message by mistake, please advise the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334




AW: Indexation of Excel files newer than 2007

2019-05-06 Thread Katterl Christian
Hello,

i found out, that xlsx2csv (https://pypi.org/project/xlsx2csv/#files) is able 
to read the contents of xlsx-files. It's python-based.
For Implementation into piler, I am probably the wrong one, but i think that 
could be possible?

BR, Christian


Von: Martin Nadvornik 
Gesendet: Montag, 6. Mai 2019 09:46
An: Piler User 
Betreff: Re: Indexation of Excel files newer than 2007

Hello Christian,

catdoc is not capable of processing new office formats. As far as I know there 
is no intention for catdoc to implement this in a foreseeable future. The same 
problem exists for xls2csv. You could theoretically try to call unoconv 
(https://github.com/unoconv/unoconv) before catdoc, but it will probably have a 
big performance impact since it launches libre office / open office for the 
conversion. But if you try this I would be interested in your results since 
being limited to index only old office formats is also something we would like 
to overcome. Alternatively if you can find an open source software which is 
capable of efficiently extracting plain text from current office formats it 
should be easily implementable into piler (basically a few lines in extract.c 
as far as I can tell). For excel there is https://github.com/xevo/xls2csv and 
https://github.com/nagirrab/xls2csv which claim to be cabable of proccessing 
xlsx files. But I haven't looked into them yet.

Kind Regards
Martin
Am 06.05.2019 um 06:45 schrieb Katterl Christian:
Hello,

Indexation of Excel files newer than Excel 2007 fails in my installation.
I am using catdoc 0.95 and it tells:

This file looks like ZIP archive or Office 2007 or later file.
Not supported by catdoc

The Excel-File has been created using Excel 2010.

BR, Christian


Christian Katterl
Teamleader Technical IT

[cid:image001.png@01D503F1.C9E2E6F0]

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel  +43 50 799 - 2511
mobile +43 664 811 54 99
email c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this message by mistake, please advise 
the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334



 Christian Katterl
Teamleader Technical IT

Asamer Baustoffe AG
Unterthalham Straße 2
4694 Ohlsdorf
Austria
tel +43 50 799 - 2511

mobile  +43 664 811 54 99
c.katt...@asamer.at
www.abag.at


This message is confidential. It may not be disclosed to, or used by, anyone 
other than the addressee. If you receive this
message by mistake, please advise the sender.
Firmenbuch: Landesgericht Wels, FN: 407726y, ATU 68646334