Re: [libreoffice-users] A word of warning about PDF text
Le Fri, 31 Jan 2014 13:22:41 +1000, Peter West li...@pbw.id.au a écrit : A word of warning about text retrieved from PDF documents. Recovering text blocks from PDFs is inherently risky. PDF is a page definition format, and so it has no notion of the semantics of the text it contains. It places bits of text at certain positions on the page. You can create a whole page of text by taking the individual characters and their attributes and position on the page, shuffling them, and writing them to the file. That will produce a readable file, but try extracting the text from that file. Unless you have a very, very smart text extractor that reverse-engineers the process of creating the page, then calculates the _visual_ order of the text elements, you will end up with gibberish. _Most_ pdf text, _most_ of the time, is laid on the page in visual order, but in even the best-behaved files, you are likely to be surprised. If you don't _know_ that your PDF text extractor program is completely visually accurate by design, don't tell your boss that you can easily extract that PDF text, without allowing time for proof-reading every page. You will get burned. It is why I open the pdf file into a separated program and use the mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I have full control on how the text will appear when I select it. And I use other programs like pdfimages, pdftppm and convert to extract the images directly from the pdf. They can be turned or mirrored, it is why convert is useful too. When they are split in small pieces, pdftoppm give me an exact copy of each page of the pdf, each page into a ppm file, which is converted in jpeg. In that case, gimp is useful to extract only the images from these files and cut the text. The script I use for the images is joined. To use it, place it somewhere in your path, control it is executable, go into the directory where your pdf file is, and run 'pdf2jpg'. It will only issue a help message. Be aware it will extract all the pdf files in that directory on the fly. Be also aware that, if the final output is jpeg files, ppm files are automatically used as middle men when needed, the conversion will be much slower and they can use a lot of space on the disk. So, if you want to extract pictures from a 100MB pdf file, count at least 2GB of temporary disk usage to be safe in all cases. (estimation from memory, so make you own tests if you don't have a lot of free disk space) Also, with some distributions, you may have to adjust the name of the pdfimages and pdftoppm commands in the script. They are part of poppler on gentoo (poppler-utils or something like that on Debian), in the past, they was part of xpdf. Dominique I don't know how LO extracts PDF text; perhaps it is very sophisticated. I have my doubts. -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-users] A word of warning about PDF text
Le Sat, 1 Feb 2014 01:18:22 +0100, Dominique Michel dominique.mic...@vtxnet.ch a écrit : Le Fri, 31 Jan 2014 13:22:41 +1000, Peter West li...@pbw.id.au a écrit : A word of warning about text retrieved from PDF documents. Recovering text blocks from PDFs is inherently risky. PDF is a page definition format, and so it has no notion of the semantics of the text it contains. It places bits of text at certain positions on the page. You can create a whole page of text by taking the individual characters and their attributes and position on the page, shuffling them, and writing them to the file. That will produce a readable file, but try extracting the text from that file. Unless you have a very, very smart text extractor that reverse-engineers the process of creating the page, then calculates the _visual_ order of the text elements, you will end up with gibberish. _Most_ pdf text, _most_ of the time, is laid on the page in visual order, but in even the best-behaved files, you are likely to be surprised. If you don't _know_ that your PDF text extractor program is completely visually accurate by design, don't tell your boss that you can easily extract that PDF text, without allowing time for proof-reading every page. You will get burned. It is why I open the pdf file into a separated program and use the mouse to select the text, and copy/past or Ctrl-C/Ctrl-V. That way, I have full control on how the text will appear when I select it. And I use other programs like pdfimages, pdftppm and convert to extract the images directly from the pdf. They can be turned or mirrored, it is why convert is useful too. When they are split in small pieces, pdftoppm give me an exact copy of each page of the pdf, each page into a ppm file, which is converted in jpeg. In that case, gimp is useful to extract only the images from these files and cut the text. The script I use for the images is joined. To use it, place it somewhere in your path, control it is executable, go into the directory where your pdf file is, and run 'pdf2jpg'. It will only issue a help message. Be aware it will extract all the pdf files in that directory on the fly. Be also aware that, if the final output is jpeg files, ppm files are automatically used as middle men when needed, the conversion will be much slower and they can use a lot of space on the disk. So, if you want to extract pictures from a 100MB pdf file, count at least 2GB of temporary disk usage to be safe in all cases. (estimation from memory, so make you own tests if you don't have a lot of free disk space) Also, with some distributions, you may have to adjust the name of the pdfimages and pdftoppm commands in the script. They are part of poppler on gentoo (poppler-utils or something like that on Debian), in the past, they was part of xpdf. Dominique The script didn't make it. Here it is: http://fvwm-crystal.sourceforge.net/other/pdf2jpg Dominique I don't know how LO extracts PDF text; perhaps it is very sophisticated. I have my doubts. -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-users] Writer - Merging PDFs
Le Tue, 28 Jan 2014 20:41:42 -0500, charles meyer reachmepl...@gmail.com a écrit : I've got 3 or so separate PDF files. I'd like to merge them all into one PDF file in Writer in Office 3.6.2.2 Each page of each PDF file has a lot of empty space around the graphic image. Ex. 2 inches above and below the graphic on each page and a good 3 inches on each side of each graphic is white, empty space. Is there a way to eliminate all the empty space around each graphic in in each page in each PDF or once all the PDFs are merged into one larger PDF? On Linux, you can open a pdf file with a pdf viewer, as example with evince, select what you want to copy, do a right click and select copy. Then in writer you can copy it. Only the text will be copied, and you will loose most of the page layout and things like links. Also, it will not work if the pdf is copy protected. For the pictures, the viewer comes with a set of console tools like pdfimages that can extract the raw pictures, or pdftoppm that can extract the whole pdf as a set of pictures. Be aware that images into a pdf can be a real mess, because they can be rotated, mirrored, or split into small pieces. In the 2 first cases, convert can be used to rotate and mirror back the pictures, in the last case, only pdftoppm can let you extract the pictures. Dominique Thanks so much, Charles. -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-users] Writer - Merging PDFs
Le Wed, 29 Jan 2014 09:52:32 -0500, Anthony Baldwin baldwinling...@gmx.com a écrit : On 1/29/2014 10:07 AM, Dominique Michel wrote: Le Tue, 28 Jan 2014 20:41:42 -0500, charles meyer reachmepl...@gmail.com a écrit : I've got 3 or so separate PDF files. I'd like to merge them all into one PDF file in Writer in Office 3.6.2.2 Each page of each PDF file has a lot of empty space around the graphic image. Ex. 2 inches above and below the graphic on each page and a good 3 inches on each side of each graphic is white, empty space. Is there a way to eliminate all the empty space around each graphic in in each page in each PDF or once all the PDFs are merged into one larger PDF? On Linux, you can open a pdf file with a pdf viewer, as example with evince, select what you want to copy, do a right click and select copy. Then in writer you can copy it. Only the text will be copied, and you will loose most of the page layout and things like links. Also, it will not work if the pdf is copy protected. If the pdf is selectable/copyable text, you can also export it directly to a plain text file with pdftotext (on linux). For the pictures, the viewer comes with a set of console tools like pdfimages that can extract the raw pictures, or pdftoppm that can extract the whole pdf as a set of pictures. Be aware that images into a pdf can be a real mess, because they can be rotated, mirrored, or split into small pieces. In the 2 first cases, convert can be used to rotate and mirror back the pictures, in the last case, only pdftoppm can let you extract the pictures. When clients send me pdf files full of images to be translated, I often just snap a screenshot of them and manage the images in GIMP. In such cases, I am reconstructing their document in LO (in English, whereas the originals come to me in any of French, Portuguese or Spanish), so then I just insert the images into the document in LO. Of course, I could probably just import the whole PDF into GIMP (pages as images, not layers), and then grab the images I need by cropping the pages for them. I made a bash script that use pdfimages, pdftoppm and convert. In most cases (when the images are not split into small pieces), it is faster than working with gimp. Dominique Tony -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-users] Re: Limited Unicode Support in LibreOffice 4.1.4.2? Character insertion, non-zero (SMP, SIP) planes, and multi language documents.
Le Wed, 29 Jan 2014 19:18:32 -0800 (PST), ♪͡♪♪͡♪Neil Ren♪͡♪♪͡♪ gisrup2...@126.com a écrit : Tractor wrote in Ubuntu 13.10 with Unity, in Thunderbird e-mail the 4 characters are present. If I paste the same in LO Write from the Document Foundation 4.1.4.2 they also show up a 4 characters. Your browser probably has full support for CJK characters in the basic multilingual plane. I'm curious what font LO uses to display these 4 chars, as I am gradually shifting to Linux platform as a beginner. http://www.latouche.info/admin/user_guides/chinese_support_gentoo.html This is for gentoo, so maybe the names of the packages can be a little bit different. Also, it is a little bit outdated. With the last Xorg versions, you don't need at all a font section in /etc/X11/xorg.conf. The most important now is to install the fonts. After, each distribution have its own way to prioritize them. On gentoo, this is via eselect. All distributions must have a documentation about this. As example, this one for Arch is more up-to-date: https://wiki.archlinux.org/index.php/Fonts Dominique - 雷聲 靐䨻: if you can see four Chinese characters instead of blobs or question marks than your browser/OS has a good support for Unicode and CJK characters! (P.S. the four chracters put together means loud thunder) -- View this message in context: http://nabble.documentfoundation.org/Limited-Unicode-Support-in-LibreOffice-4-1-4-2-Character-insertion-non-zero-SMP-SIP-planes-and-multi-tp4094290p4094599.html Sent from the Users mailing list archive at Nabble.com. -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted
Re: [libreoffice-users] how many languages can LO support at the same time. . .
Le Mon, 27 Jan 2014 23:57:49 -0500, Kracked_P_P---webmaster webmas...@krackedpress.com a écrit : The last time I tested the language dictionary issue, there seemed to a limit of 3 or 4. I do not have the test document anymore that had several non-English languages so retesting it would be a problem for me right now. Why talk about StarOffice at this point? There has been so much modifications to the original code to make it OOo and then LO, you cannot rely on things working the same as it did back with StarOffice. I have had font listing issues that came into being when I went from one version of LO to another one, then it was fixed in a later version. The first move of Sun when they acquired StarOffice and begun with OpenOffice was to restrict both the multi language capability of StarOffice to 3 languages and its html capability. They called it OpenOffice. This was a little but tricky to install, but it was possible to make it work with more than 3 languages even at that time. They bring back the multi language capabilities one year later or so in OpenOffice, and from that time it have just be working fine with how many languages are available and installed. The html capability is another story. It is only recently, after the move to LibreOffice, that it have finally been updated, which I am very grateful. Dominique Your launch coding is something that I have not seen before. So it might help, as long as you know how to set up all of these users files. The macro idea is fine, if you can write one, which I cannot and the potential users should not be asked to do. Inserting special characters for Español [Spanish] and Français [French] is easy if you have a Unicode font or a good inclusive one that has the needed characters when viewed with the Insert Special Character option. I have done that myself for a few things. On 01/27/2014 02:38 PM, Regina Henschel wrote: Hi, Kracked_P_P---webmaster schrieb: The problem I have with selling LO to computer centers, both regular and ones that teach English as a second language, is how many languages can LO support at the same time. I am talking about two ways. 1 - usable dictionaries in the list. 2 - different languages you can change your menus to. The first one is the key for me. You have English, French, Spanish [3 regional versions], Italian, and 4 or 5 other different language dictionaries, installed and enabled, in the Extension Manager. How many of those languages are usable to the user writing documents in English and one or two other languages at a time, then someone else sits down and tried to use his or hers language[s] with English. So how many installed and enabled languages can be used at the same time? I was told there was a very small limited number. I know no restriction. StarOffice was shipped with about fourteen languages and had over twenty Autotext variants. Why do not test it? Then the second is not much of a problem for me. Yet, since you can switch between language packs and their help packs, how many can you install and be able to switch back and forth between English and the other non-English languages? Some as above, I know no restriction. Of course, if a center worker needs to switch the menus back to English, or to their default settings, how easy is it if the person/worker does not read/speak the language the menu is currently in? If the language is bound to the person, you can use different user settings and provide each user a prepared link to his special user settings. For example on Windows C:\Program Files\LibreOffice 4\program\soffice.exe -env:UserInstallation=file:///f:/SoftwareLO/user_DE will launch LibreOffice with the user settings in folder f:/SoftwareLO/user_DE. And this different user settings can use different languages in UI. You can even run LibreOffice one time and have several calls to it with different user settings parallel. Is it possible to make a script to reset the preferences back to a default instead of some manual copy/paste-over some file? Changing the UI from one language to another for the same user requires to restart LibreOffice. I do not use any language, other than American English, but others do need to deal with more than one language. I met a lady a number of years ago. She was from my area of the USA, but she worked in Israel as a travel advisor. She had to use several different packages of MS Office, since she needed to write in English, French, Hebrew, and one other language that I cannot remember the name of. I told her that with the language support of LO she could use it to write in which ever language she needed to do. I do not know if LO can write both English and Hebrew, since one is left to right and the other is right to left [if I remember correctly], but if the document was only
Re: [libreoffice-users] how many languages can LO support at the same time. . .
Le Tue, 28 Jan 2014 12:56:25 +0100, Dominique Michel dominique.mic...@vtxnet.ch a écrit : Le Mon, 27 Jan 2014 23:57:49 -0500, Kracked_P_P---webmaster webmas...@krackedpress.com a écrit : The last time I tested the language dictionary issue, there seemed to a limit of 3 or 4. I do not have the test document anymore that had several non-English languages so retesting it would be a problem for me right now. Why talk about StarOffice at this point? There has been so much modifications to the original code to make it OOo and then LO, you cannot rely on things working the same as it did back with StarOffice. I have had font listing issues that came into being when I went from one version of LO to another one, then it was fixed in a later version. The first move of Sun when they acquired StarOffice and begun with OpenOffice was to restrict both the multi language capability of StarOffice to 3 languages and its html capability. They called it OpenOffice. This was a little but tricky to install, but it was possible to make it work with more than 3 languages even at that time. They bring back the multi language capabilities one year later or so in OpenOffice, and from that time it have just be working fine with how many languages are available and installed. At least on Linux. The html capability is another story. It is only recently, after the move to LibreOffice, that it have finally been updated, which I am very grateful. Dominique Your launch coding is something that I have not seen before. So it might help, as long as you know how to set up all of these users files. The macro idea is fine, if you can write one, which I cannot and the potential users should not be asked to do. Inserting special characters for Español [Spanish] and Français [French] is easy if you have a Unicode font or a good inclusive one that has the needed characters when viewed with the Insert Special Character option. I have done that myself for a few things. On 01/27/2014 02:38 PM, Regina Henschel wrote: Hi, Kracked_P_P---webmaster schrieb: The problem I have with selling LO to computer centers, both regular and ones that teach English as a second language, is how many languages can LO support at the same time. I am talking about two ways. 1 - usable dictionaries in the list. 2 - different languages you can change your menus to. The first one is the key for me. You have English, French, Spanish [3 regional versions], Italian, and 4 or 5 other different language dictionaries, installed and enabled, in the Extension Manager. How many of those languages are usable to the user writing documents in English and one or two other languages at a time, then someone else sits down and tried to use his or hers language[s] with English. So how many installed and enabled languages can be used at the same time? I was told there was a very small limited number. I know no restriction. StarOffice was shipped with about fourteen languages and had over twenty Autotext variants. Why do not test it? Then the second is not much of a problem for me. Yet, since you can switch between language packs and their help packs, how many can you install and be able to switch back and forth between English and the other non-English languages? Some as above, I know no restriction. Of course, if a center worker needs to switch the menus back to English, or to their default settings, how easy is it if the person/worker does not read/speak the language the menu is currently in? If the language is bound to the person, you can use different user settings and provide each user a prepared link to his special user settings. For example on Windows C:\Program Files\LibreOffice 4\program\soffice.exe -env:UserInstallation=file:///f:/SoftwareLO/user_DE will launch LibreOffice with the user settings in folder f:/SoftwareLO/user_DE. And this different user settings can use different languages in UI. You can even run LibreOffice one time and have several calls to it with different user settings parallel. Is it possible to make a script to reset the preferences back to a default instead of some manual copy/paste-over some file? Changing the UI from one language to another for the same user requires to restart LibreOffice. I do not use any language, other than American English, but others do need to deal with more than one language. I met a lady a number of years ago. She was from my area of the USA, but she worked in Israel as a travel advisor. She had to use several different packages of MS Office, since she needed to write in English, French, Hebrew, and one other language that I cannot remember the name of. I told her that with the language support of LO she could use
[libreoffice-users] [CALC] Automatic sheet generation
Hi, This is my first message to that list, and I am a newbie with calc. What I want to do is to extrapolate a sheet from an existing sheet, and later do some calculation on the new sheet. The existing sheet is a simple x/y graph with x on the A1, some values in A2 to An, and the other rows, the value of x in the first cell, and the value of y for the value corresponding to A2 to An in the following cells. This give a chart with one x/y curve for each cell between A2 to An. This work fine. The sheet look like: x-20 -10 0 10 20 0 0 0 0 00 10.1 2 0.2 3 0.23 40.340.15 5 0.4 0.3 ... The new sheet will represent the values of x at A2 to An, that for different values of y. In other words, the first sheet is y=f(x) for different and constant n, the new sheet will be x=f(n) for different and constant y. n 0.1 0.2 0.3 0.4 0.5 20 1 10 2 0 5 ... First I must have a way to search for the nearest 2 values of y in the table, and from them do a calculation to extrapolate the value of x for y. It must also know the value of n. It is 2 cases, if the value of y = y_new sheet, no calculation is needed, is the value of y != y_new sheet, I must make a simple calculation to determine the value of x from the 2 nearest values of y. (I have enough points, so I can assume it is a straight line between the 2 values). What I cannot figure out, is how to go trough the existing sheet with a formula that will find the 2 nearest values of y and apply a function on these values, or output directly the value if an exact match is found. And from that build a new sheet. Another issue is the original data comes from a csv file, that is synchronized to the original sheet. The csv file is generated by engauge, and between 2 runs, the numbers of rows and columns in that file will change. That imply the new sheet must also be able to be synchronized with the old one, and that even if the numbers of rows and columns are changing. Is it possible to do that? Can you explain me how to do this, or provide me a link to some documentation describing a similar issue? Cheers, Dominique -- To unsubscribe e-mail to: users+unsubscr...@global.libreoffice.org Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette List archive: http://listarchives.libreoffice.org/global/users/ All messages sent to this list will be publicly archived and cannot be deleted