Re: [libreoffice-users] A word of warning about PDF text

2014-01-31 Thread Dominique Michel
Le Fri, 31 Jan 2014 13:22:41 +1000,
Peter West li...@pbw.id.au a écrit :

 A word of warning about text retrieved from PDF documents.
 
 Recovering text blocks from PDFs is inherently risky.  PDF is a page 
 definition format, and so it has no notion of the semantics of the
 text it contains. It places bits of text at certain positions on the
 page. You can create a whole page of text by taking the individual
 characters and their attributes and position on the page, shuffling
 them, and writing them to the file.  That will produce a readable
 file, but try extracting the text from that file. Unless you have a
 very, very smart text extractor that reverse-engineers the process of
 creating the page, then calculates the _visual_ order of the text
 elements, you will end up with gibberish.
 
 _Most_ pdf text, _most_ of the time, is laid on the page in visual 
 order, but in even the best-behaved files, you are likely to be
 surprised.
 
 If you don't _know_ that your PDF text extractor program is
 completely visually accurate by design, don't tell your boss that
 you can easily extract that PDF text, without allowing time for
 proof-reading every page. You will get burned.

It is why I open the pdf file into a separated program and use the
mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
have full control on how the text will appear when I select it.

And I use other programs like pdfimages, pdftppm and convert to
extract the images directly from the pdf. They can be turned or
mirrored, it is why convert is useful too. When they are split in small
pieces, pdftoppm give me an exact copy of each page of the pdf, each
page into a ppm file, which is converted in jpeg. In that case, gimp is
useful to extract only the images from these files and cut the text.

The script I use for the images is joined. To use it, place it
somewhere in your path, control it is executable, go into the
directory where your pdf file is, and run 'pdf2jpg'. It will only issue
a help message. Be aware it will extract all the pdf files in that
directory on the fly. Be also aware that, if the final output is jpeg
files, ppm files are automatically used as middle men when needed,
the conversion will be much slower and they can use a lot of space on
the disk.

So, if you want to extract pictures from a 100MB pdf file, count at
least 2GB of temporary disk usage to be safe in all cases. (estimation
from memory, so make you own tests if you don't have a lot of free disk
space)

Also, with some distributions, you may have to adjust the name of the
pdfimages and pdftoppm commands in the script. They are part of poppler
on gentoo (poppler-utils or something like that on Debian), in the past,
they was part of xpdf.

Dominique

 
 I don't know how LO extracts PDF text; perhaps it is very
 sophisticated. I have my doubts.
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



Re: [libreoffice-users] A word of warning about PDF text

2014-01-31 Thread Dominique Michel
Le Sat, 1 Feb 2014 01:18:22 +0100,
Dominique Michel dominique.mic...@vtxnet.ch a écrit :

 Le Fri, 31 Jan 2014 13:22:41 +1000,
 Peter West li...@pbw.id.au a écrit :
 
  A word of warning about text retrieved from PDF documents.
  
  Recovering text blocks from PDFs is inherently risky.  PDF is a
  page definition format, and so it has no notion of the semantics of
  the text it contains. It places bits of text at certain positions
  on the page. You can create a whole page of text by taking the
  individual characters and their attributes and position on the
  page, shuffling them, and writing them to the file.  That will
  produce a readable file, but try extracting the text from that
  file. Unless you have a very, very smart text extractor that
  reverse-engineers the process of creating the page, then calculates
  the _visual_ order of the text elements, you will end up with
  gibberish.
  
  _Most_ pdf text, _most_ of the time, is laid on the page in visual 
  order, but in even the best-behaved files, you are likely to be
  surprised.
  
  If you don't _know_ that your PDF text extractor program is
  completely visually accurate by design, don't tell your boss that
  you can easily extract that PDF text, without allowing time for
  proof-reading every page. You will get burned.
 
 It is why I open the pdf file into a separated program and use the
 mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I
 have full control on how the text will appear when I select it.
 
 And I use other programs like pdfimages, pdftppm and convert to
 extract the images directly from the pdf. They can be turned or
 mirrored, it is why convert is useful too. When they are split in
 small pieces, pdftoppm give me an exact copy of each page of the pdf,
 each page into a ppm file, which is converted in jpeg. In that case,
 gimp is useful to extract only the images from these files and cut
 the text.
 
 The script I use for the images is joined. To use it, place it
 somewhere in your path, control it is executable, go into the
 directory where your pdf file is, and run 'pdf2jpg'. It will only
 issue a help message. Be aware it will extract all the pdf files in
 that directory on the fly. Be also aware that, if the final output is
 jpeg files, ppm files are automatically used as middle men when
 needed, the conversion will be much slower and they can use a lot of
 space on the disk.
 
 So, if you want to extract pictures from a 100MB pdf file, count at
 least 2GB of temporary disk usage to be safe in all cases. (estimation
 from memory, so make you own tests if you don't have a lot of free
 disk space)
 
 Also, with some distributions, you may have to adjust the name of the
 pdfimages and pdftoppm commands in the script. They are part of
 poppler on gentoo (poppler-utils or something like that on Debian),
 in the past, they was part of xpdf.
 
 Dominique

The script didn't make it. Here it is:
http://fvwm-crystal.sourceforge.net/other/pdf2jpg

Dominique

 
  
  I don't know how LO extracts PDF text; perhaps it is very
  sophisticated. I have my doubts.
  
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



Re: [libreoffice-users] Writer - Merging PDFs

2014-01-29 Thread Dominique Michel
Le Tue, 28 Jan 2014 20:41:42 -0500,
charles meyer reachmepl...@gmail.com a écrit :

 I've got 3 or so separate PDF files.
 
 I'd like to merge them all into one PDF file in Writer in Office
 3.6.2.2
 
 Each page of each PDF file has a lot of empty space around the
 graphic image.
 
 Ex. 2 inches above and below the graphic on each page and a good 3
 inches on each side of each graphic is white, empty space.
 
 Is there a way to eliminate all the empty space around each graphic in
 in each page in each PDF or once all the PDFs are merged into one
 larger PDF?

On Linux, you can open a pdf file with a pdf viewer, as example with
evince, select what you want to copy, do a right click and select copy.
Then in writer you can copy it. Only the text will be copied, and you
will loose most of the page layout and things like links. Also, it will
not work if the pdf is copy protected.

For the pictures, the viewer comes with a set of console tools like
pdfimages that can extract the raw pictures, or pdftoppm that can
extract the whole pdf as a set of pictures. Be aware that images into a
pdf can be a real mess, because they can be rotated, mirrored, or
split into small pieces. In the 2 first cases, convert can be used to
rotate and mirror back the pictures, in the last case, only pdftoppm
can let you extract the pictures.

Dominique
 
 Thanks so much,
 
 Charles.
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



Re: [libreoffice-users] Writer - Merging PDFs

2014-01-29 Thread Dominique Michel
Le Wed, 29 Jan 2014 09:52:32 -0500,
Anthony Baldwin baldwinling...@gmx.com a écrit :

 On 1/29/2014 10:07 AM, Dominique Michel wrote:
  Le Tue, 28 Jan 2014 20:41:42 -0500,
  charles meyer reachmepl...@gmail.com a écrit :
 
  I've got 3 or so separate PDF files.
 
  I'd like to merge them all into one PDF file in Writer in Office
  3.6.2.2
 
  Each page of each PDF file has a lot of empty space around the
  graphic image.
 
  Ex. 2 inches above and below the graphic on each page and a good 3
  inches on each side of each graphic is white, empty space.
 
  Is there a way to eliminate all the empty space around each
  graphic in in each page in each PDF or once all the PDFs are
  merged into one larger PDF?
 
  On Linux, you can open a pdf file with a pdf viewer, as example with
  evince, select what you want to copy, do a right click and select
  copy. Then in writer you can copy it. Only the text will be copied,
  and you will loose most of the page layout and things like links.
  Also, it will not work if the pdf is copy protected.
 
 If the pdf is selectable/copyable text, you can also export it
 directly to a plain text file with pdftotext (on linux).
 
 
 
  For the pictures, the viewer comes with a set of console tools like
  pdfimages that can extract the raw pictures, or pdftoppm that can
  extract the whole pdf as a set of pictures. Be aware that images
  into a pdf can be a real mess, because they can be rotated,
  mirrored, or split into small pieces. In the 2 first cases, convert
  can be used to rotate and mirror back the pictures, in the last
  case, only pdftoppm can let you extract the pictures.
 
 When clients send me pdf files full of images to be translated,
 I often just snap a screenshot of them and manage the images in GIMP.
 In such cases, I am reconstructing their document in LO (in English,
 whereas the originals come to me in any of French, Portuguese or 
 Spanish), so then I just insert the images into the document in LO.
 
 Of course, I could probably just import the whole PDF into GIMP
 (pages as images, not layers), and then grab the images I need by
 cropping the pages for them.

I made a bash script that use pdfimages, pdftoppm and convert. In most
cases (when the images are not split into small pieces), it is faster
than working with gimp.

Dominique

 
 
 Tony

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted



Re: [libreoffice-users] Re: Limited Unicode Support in LibreOffice 4.1.4.2? Character insertion, non-zero (SMP, SIP) planes, and multi language documents.

2014-01-29 Thread Dominique Michel
Le Wed, 29 Jan 2014 19:18:32 -0800 (PST),
♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ gisrup2...@126.com a écrit :

 Tractor wrote
  in Ubuntu 13.10 with Unity, in Thunderbird e-mail the 4 characters
  are present. If I paste the same in LO Write from the Document
  Foundation 4.1.4.2 they also show up a 4 characters.
 
 Your browser probably has full support for CJK characters in the basic
 multilingual plane. I'm curious what font LO uses to display
 these 4 chars, as I am gradually shifting to Linux platform as a
 beginner.
 
http://www.latouche.info/admin/user_guides/chinese_support_gentoo.html
This is for gentoo, so maybe the names of the packages can be a little
bit different.

Also, it is a little bit outdated. With the last Xorg versions, you
don't need at all a font section in /etc/X11/xorg.conf. The most
important now is to install the fonts. After, each distribution have its
own way to prioritize them. On gentoo, this is via eselect. All
distributions must have a documentation about this.

As example, this one for Arch is more up-to-date:
https://wiki.archlinux.org/index.php/Fonts

Dominique
 
 
 -
 雷聲 靐䨻: if you can see four Chinese characters instead of blobs
 or question marks than your browser/OS has a good support for Unicode
 and CJK characters! (P.S. the four chracters put together means loud
 thunder) -- View this message in context:
 http://nabble.documentfoundation.org/Limited-Unicode-Support-in-LibreOffice-4-1-4-2-Character-insertion-non-zero-SMP-SIP-planes-and-multi-tp4094290p4094599.html
 Sent from the Users mailing list archive at Nabble.com.
 

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted


Re: [libreoffice-users] how many languages can LO support at the same time. . .

2014-01-28 Thread Dominique Michel
Le Mon, 27 Jan 2014 23:57:49 -0500,
Kracked_P_P---webmaster webmas...@krackedpress.com a écrit :

 
 The last time I tested the language dictionary issue, there seemed to
 a limit of 3 or 4.  I do not have the test document anymore that had 
 several non-English languages so retesting it would be a problem for
 me right now.
 
 Why talk about StarOffice at this point?  There has been so much 
 modifications to the original code to make it OOo and then LO, you 
 cannot rely on things working the same as it did back with
 StarOffice. I have had font listing issues that came into being when
 I went from one version of LO to another one, then it was fixed in a
 later version.

The first move of Sun when they acquired StarOffice and begun with
OpenOffice was to restrict both the multi language capability of
StarOffice to 3 languages and its html capability. They called it
OpenOffice.
This was a little but tricky to install, but it was possible to make it
work with more than 3 languages even at that time.

They bring back the multi language capabilities one year later or so in
OpenOffice, and from that time it have just be working fine with how
many languages are available and installed.

The html capability is another story. It is only recently, after the
move to LibreOffice, that it have finally been updated, which I am very
grateful.

Dominique

 
 Your launch coding is something that I have not seen before.  So it 
 might help, as long as you know how to set up all of these users
 files.
 
 The macro idea is fine, if you can write one, which I cannot and the 
 potential users should not be asked to do.
 
 Inserting special characters for Español [Spanish] and Français 
 [French] is easy if you have a Unicode font or a good inclusive one
 that has the needed characters when viewed with the Insert Special
 Character option.  I have done that myself for a few things.
 
 
 On 01/27/2014 02:38 PM, Regina Henschel wrote:
  Hi,
 
  Kracked_P_P---webmaster schrieb:
 
  The problem I have with selling LO to computer centers, both
  regular and ones that teach English as a second language, is how
  many languages can LO support at the same time.
 
  I am talking about two ways.  1 - usable dictionaries in the
  list.  2 - different languages you can change your menus to.
 
  The first one is the key for me.
 
  You have English, French, Spanish [3 regional versions], Italian,
  and 4 or 5 other different language dictionaries, installed and
  enabled, in the Extension Manager.  How many of those languages
  are usable to the user writing documents in English and one or two
  other languages at a time, then someone else sits down and tried
  to use his or hers language[s] with English.  So how many
  installed and enabled languages can be used at the same time?   I
  was told there was a very small limited number.
 
  I know no restriction. StarOffice was shipped with about fourteen 
  languages and had over twenty Autotext variants. Why do not test it?
 
 
  Then the second is not much of a problem for me.  Yet, since you
  can switch between language packs and their help packs, how many
  can you install and be able to switch back and forth between
  English and the other non-English languages?
 
  Some as above, I know no restriction.
 
 
  Of course, if a center worker needs to switch the menus back to
  English, or to their default settings, how easy is it if the
  person/worker does not read/speak the language the menu is
  currently in?
 
  If the language is bound to the person, you can use different user 
  settings and provide each user a prepared link to his special user 
  settings.
  For example on Windows
  C:\Program Files\LibreOffice 4\program\soffice.exe 
  -env:UserInstallation=file:///f:/SoftwareLO/user_DE
 
  will launch LibreOffice with the user settings in folder 
  f:/SoftwareLO/user_DE. And this different user settings can use 
  different languages in UI. You can even run LibreOffice one time
  and have several calls to it with different user settings parallel.
 
Is it possible to
  make a script to reset the preferences back to a default instead
  of some manual copy/paste-over some file?
 
  Changing the UI from one language to another for the same user 
  requires to restart LibreOffice.
 
 
  I do not use any language, other than American English, but others
  do need to deal with more than one language.  I met a lady a
  number of years ago.  She was from my area of the USA, but she
  worked in Israel as a travel advisor.  She had to use several
  different packages of MS Office, since she needed to write in
  English, French, Hebrew, and one other language that I cannot
  remember the name of.  I told her that with the language support
  of LO she could use it to write in which ever language she needed
  to do.  I do not know if LO can write both English and Hebrew,
  since one is left to right and the other is right to left [if I
  remember correctly], but if the document was only 

Re: [libreoffice-users] how many languages can LO support at the same time. . .

2014-01-28 Thread Dominique Michel
Le Tue, 28 Jan 2014 12:56:25 +0100,
Dominique Michel dominique.mic...@vtxnet.ch a écrit :

 Le Mon, 27 Jan 2014 23:57:49 -0500,
 Kracked_P_P---webmaster webmas...@krackedpress.com a écrit :
 
  
  The last time I tested the language dictionary issue, there seemed
  to a limit of 3 or 4.  I do not have the test document anymore that
  had several non-English languages so retesting it would be a
  problem for me right now.
  
  Why talk about StarOffice at this point?  There has been so much 
  modifications to the original code to make it OOo and then LO, you 
  cannot rely on things working the same as it did back with
  StarOffice. I have had font listing issues that came into being when
  I went from one version of LO to another one, then it was fixed in a
  later version.
 
 The first move of Sun when they acquired StarOffice and begun with
 OpenOffice was to restrict both the multi language capability of
 StarOffice to 3 languages and its html capability. They called it
 OpenOffice.
 This was a little but tricky to install, but it was possible to make
 it work with more than 3 languages even at that time.
 
 They bring back the multi language capabilities one year later or so
 in OpenOffice, and from that time it have just be working fine with
 how many languages are available and installed.
At least on Linux.

 
 The html capability is another story. It is only recently, after the
 move to LibreOffice, that it have finally been updated, which I am
 very grateful.
 
 Dominique
 
  
  Your launch coding is something that I have not seen before.  So it 
  might help, as long as you know how to set up all of these users
  files.
  
  The macro idea is fine, if you can write one, which I cannot and
  the potential users should not be asked to do.
  
  Inserting special characters for Español [Spanish] and Français 
  [French] is easy if you have a Unicode font or a good inclusive one
  that has the needed characters when viewed with the Insert Special
  Character option.  I have done that myself for a few things.
  
  
  On 01/27/2014 02:38 PM, Regina Henschel wrote:
   Hi,
  
   Kracked_P_P---webmaster schrieb:
  
   The problem I have with selling LO to computer centers, both
   regular and ones that teach English as a second language, is how
   many languages can LO support at the same time.
  
   I am talking about two ways.  1 - usable dictionaries in the
   list.  2 - different languages you can change your menus to.
  
   The first one is the key for me.
  
   You have English, French, Spanish [3 regional versions], Italian,
   and 4 or 5 other different language dictionaries, installed and
   enabled, in the Extension Manager.  How many of those languages
   are usable to the user writing documents in English and one or
   two other languages at a time, then someone else sits down and
   tried to use his or hers language[s] with English.  So how many
   installed and enabled languages can be used at the same time?   I
   was told there was a very small limited number.
  
   I know no restriction. StarOffice was shipped with about fourteen 
   languages and had over twenty Autotext variants. Why do not test
   it?
  
  
   Then the second is not much of a problem for me.  Yet, since you
   can switch between language packs and their help packs, how many
   can you install and be able to switch back and forth between
   English and the other non-English languages?
  
   Some as above, I know no restriction.
  
  
   Of course, if a center worker needs to switch the menus back to
   English, or to their default settings, how easy is it if the
   person/worker does not read/speak the language the menu is
   currently in?
  
   If the language is bound to the person, you can use different
   user settings and provide each user a prepared link to his
   special user settings.
   For example on Windows
   C:\Program Files\LibreOffice 4\program\soffice.exe 
   -env:UserInstallation=file:///f:/SoftwareLO/user_DE
  
   will launch LibreOffice with the user settings in folder 
   f:/SoftwareLO/user_DE. And this different user settings can use 
   different languages in UI. You can even run LibreOffice one time
   and have several calls to it with different user settings
   parallel.
  
 Is it possible to
   make a script to reset the preferences back to a default
   instead of some manual copy/paste-over some file?
  
   Changing the UI from one language to another for the same user 
   requires to restart LibreOffice.
  
  
   I do not use any language, other than American English, but
   others do need to deal with more than one language.  I met a
   lady a number of years ago.  She was from my area of the USA,
   but she worked in Israel as a travel advisor.  She had to use
   several different packages of MS Office, since she needed to
   write in English, French, Hebrew, and one other language that I
   cannot remember the name of.  I told her that with the language
   support of LO she could use

[libreoffice-users] [CALC] Automatic sheet generation

2014-01-26 Thread Dominique Michel
Hi,

This is my first message to that list, and I am a newbie with calc.

What I want to do is to extrapolate a sheet from an existing sheet, and
later do some calculation on the new sheet. The existing sheet is a
simple x/y graph with x on the A1, some values in A2 to An, and the
other rows, the value of x in the first cell, and the value of y for
the value corresponding to A2 to An in the following cells. This
give a chart with one x/y curve for each cell between A2 to An. This
work fine. The sheet look like:

x-20 -10 0  10   20
0  0   0 0   00
10.1  
2   0.2
3  0.23
40.340.15
5  0.4  0.3
...

The new sheet will represent the values of x at A2 to An, that for
different values of y. In other words, the first sheet is y=f(x) for
different and constant n, the new sheet will be x=f(n) for different
and constant y.

n 0.1 0.2 0.3 0.4 0.5
20  1
10  2
0   5
...

First I must have a way to search for the nearest 2 values of y in the
table, and from them do a calculation to extrapolate the value of x for
y. It must also know the value of n.

It is 2 cases, if the value of y = y_new sheet, no calculation is
needed, is the value of y != y_new sheet, I must make a simple
calculation to determine the value of x from the 2 nearest values of y.
(I have enough points, so I can assume it is a straight line between
the 2 values).

What I cannot figure out, is how to go trough the existing sheet with a
formula that will find the 2 nearest values of y and apply a function
on these values, or output directly the value if an exact match is
found. And from that build a new sheet.

Another issue is the original data comes from a csv file, that is
synchronized to the original sheet. The csv file is generated by
engauge, and between 2 runs, the numbers of rows and columns in that
file will change. That imply the new sheet must also be able to be
synchronized with the old one, and that even if the numbers of rows and
columns are changing. Is it possible to do that?


Can you explain me how to do this, or provide me a link to some
documentation describing a similar issue?

Cheers,
Dominique

-- 
To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted