[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-06-01 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #18 from Pierre Fortin  ---
>Trying to start from scratch here because I am confused by the steps.

Probably my bad: I included the source data and the resulting issue in the same
file to reduce the data to a minimal set...

The real source files are HUGE. One column in these HUGE files contains text
like this:XX[nnn]  (2 letters and 4-7 digits)

When copying the column, there is no TAB in that data -- it's already been
absorbed by the import (more below).

After copying/splitting this column, I use the steps in comment #1... The
dialog shows only XX|1234 because the number of sampled rows are all simply 2
letters and 4 digits.  Apparently, there are enough 2+4 rows to convince the
dialog that all rows must be 2+4. There are always corner cases; but sampling
the last row in this case would result in one of: XX1234, XX12345, XX123456,
XX1234567... Not a good/reliable solution to sample the last row...

The issue is that the dialog should not assume the data can be split after 4
digits.  There was no explicit fixed split-point specified after 2+4, so none
should be assumed.  The flaw is that an unspecified split is assumed; only
explicit split points should be used.

The solution seems this simple:  split ONLY where split-points are specified
and leave the string remainder intact in the last resulting column so that the
resultant data becomes:
xx 1234
...
xx 12345
...
xx 123456

etc... and not:
xx 1234
...
xx 1234 5
...
xx 1234 56
...
etc...

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-06-01 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

Stéphane Guillou (stragu)  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||stephane.guillou@libreoffic
   ||e.org
 Status|UNCONFIRMED |NEEDINFO

--- Comment #17 from Stéphane Guillou (stragu) 
 ---
Trying to start from scratch here because I am confused by the steps.
Please tell me if the following steps are correct:

1. Open attachment 187388 with Calc
2. In text import dialog, import as tab-delimited

Result 1: import dialog preview shows 3 columns, but resulting column D
contains data starting from row 7895.
However: original file _does_ contain an extra tab at the end of the lines (and
before the last digit in the bottom rows, starting from lin 7895)!You can see
that with a text editor.

3. Copy column A, paste unformatted in column E (Edit > Paste > Paste
unformatted)
4. Using the "fixed width" setting, place a delimiter after the first two
letters "AH". No other delimiter.
5. Press OK

Result 2: works as expected for me, only two columns as a result, column G has
4 digits at the beginning, 5 digits at the end.

So in conclusion, according to my testing, there is no issue with how the data
is imported. However, the importa dialo preview _should_ show that a fourth
column will be created, because there are a total of three tab delimiters by
line. That last column should show even if it's empty for the however many
first rows.

Can you please follow the steps outlined above and confirm if you see the same
thing?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-06-01 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #16 from Stéphane Guillou (stragu) 
 ---
The content of attachment 187386 has been deleted for the following reason:

looks like sensitive data

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-06-01 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #15 from Stéphane Guillou (stragu) 
 ---
The content of attachment 187376 has been deleted for the following reason:

looks like sensitive data

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-20 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #14 from Pierre Fortin  ---
(In reply to ady from comment #13)

> The problem is that the last column's width is limited in the dialog and it
> seems it cannot be marked in any way in order for it to be wider. (At least
> the data is not completely lost.)

Not lost; but certainly not useful... and destructive if I've only allotted 2
columns and the data is split into 3, thereby overwriting some of the data in
the right-side adjacent column...  at least there's Ctrl+z 

While "natural sort" can be used on this data, arithmetic operations cannot
unless the original column is split into prefix & gap'ed sequences.

> Having said that, care should be taken if solving this issue, I could
> imagine some case in which someone complains in the opposite direction (e.g.
> the last column is too wide and contains unwanted characters, not seen in
> the dialog).

Which "unwanted characters" would you foresee being added?  I've always held
the position: "I can't insert data that doesn't exist"...  The last column
should only contain what is present, without "padding".  Or did you mean
something else?

> It seems as if somehow some lookup is performed in order to hint about some
> columns' widths; when solving this problem, care should be taken not to
> delay the dialog too much (for instance if there are many columns and many
> rows).

This is a "fixed width" issue; not "separated by". In fact, I often separate
columns with "separated by [comma]"... Since the user has selected how to split
the first N characters, Calc does not need to concern itself with anything but
the last X characters to determine the max width.  
A Python example:
  maxXwidth = 0  # last expected column
  for row in rows:
 maxXwidth = max( maxXwidth, len( row[last_split_point:] ) )

As far as "not to delay", this is a column, not sheet issue; so doing an
in-memory scan should be barely noticeable if at all. To wit, anyone who uses
AutoFilter is subliminally aware of any likely delay dealing with a column.

> I am able to overcome the case by using the "Separated by" alternative when
> importing the first time the csv (no need to copy after the initial import),
> but, again, this is because the relevant column is not the _last_ one.

Sorry; but what I provided was a sample of the issue. The real "initial import"
contains 90+ varying width columns, delimited by tab, comma, or other. There's
no way to use fixed width on that initial data without providing a dialog which
allows "fixed width" splitting on certain columns within a "separated by"
scenario. e.g., initial data is imported with "separated by" along with certain
columns further split via a second level "fixed width" or "separated by" (see
below).

> So, clearly, there is some problem when importing, with the "last" column's
> width.

Thinking about the problem more logically...  aside from the incorrect final
split; this may be more of a feature request:  the need is to split a single
column. In this case "fixed width"; but I also need to split a single columns
with "separated by", such as an address column into street, city, zip...   Or a
phone number column into area_code, NNX, number (separated by '-').  This is
further complicated for me because I have to deal with international data for
which there is no standard when multiple countries' data is provided within a
single column.  ;p

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #13 from ady  ---
(In reply to Pierre Fortin from comment #12)
> (In reply to ady from comment #10)
> 
> So...  rather than fixing what I consider a rookie coding error, the user is
> expected to abandon work on a HUGE spreadsheet (often 14-15M rows)

Let me clarify my comment then. There is indeed a problem. The fixed-width
selection should not trim the second/last column. Additionally, the width of
the last column should not strictly depend on some kind of scanning or look-up
of the first few rows. The current result acts as if the second column's width
was somehow selected/delimited, but the user didn't do that. The problem is
that the last column's width is limited in the dialog and it seems it cannot be
marked in any way in order for it to be wider. (At least the data is not
completely lost.)

Having said that, care should be taken if solving this issue, I could imagine
some case in which someone complains in the opposite direction (e.g. the last
column is too wide and contains unwanted characters, not seen in the dialog).

It seems as if somehow some lookup is performed in order to hint about some
columns' widths; when solving this problem, care should be taken not to delay
the dialog too much (for instance if there are many columns and many rows).

I am able to overcome the case by using the "Separated by" alternative when
importing the first time the csv (no need to copy after the initial import),
but, again, this is because the relevant column is not the _last_ one.

So, clearly, there is some problem when importing, with the "last" column's
width.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #12 from Pierre Fortin  ---
(In reply to ady from comment #10)

More specifically to your points...

> While I can reproduce it when copying from the already-imported first column
> (by selecting the first column _only_), it can be easily avoided when
> initially importing the data (using either "Separated by" or "fixed width"),
> precisely because there is more data on the right-side (duplicated in this
> case).

So...  rather than fixing what I consider a rookie coding error, the user is
expected to abandon work on a HUGE spreadsheet (often 14-15M rows) that takes 4
minutes just to load on a very capable CPU:
  Processors: 20 × 12th Gen Intel® Core™ i7-12700K (4@5GHz, 12@4.9GHz,
4@3.8GHz)
  Memory: 125.5 GiB of RAM and 2TB + 4TB SSDs + 44TB on 6 platter drives
in order to save a few milliseconds finding the max width of the incoming data?

> OTOH, I think users shouldn't expect to look-up 7900 rows (or whatever) just
> in order to find the data format before suggesting the columns' width. That
> would probably be inefficient.

Compared to 4 minutes to load a large file; that's not even noticeable.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #11 from Pierre Fortin  ---
Created attachment 187409
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187409=edit
Number of supported rows (OT)

I don't think I said, suggested or implied that many rows need to be inspected.

Questions: 
 - What is the logic behind adding another [hidden] split point based on the
first row(s)?
 - The request is to split once between 2nd and 3rd characters in this case;
why is an unspecified, unwanted additional split added?
 - Why is the fixed width UI limited to what is visible in the first few rows,
vs allowing the user to specify how to split wider cells by adding a split bar
farther the right, without having to scan the entire data?
 - Why even have an extra [hidden] split point?

>to find the data format before suggesting the columns' width.
What does the columns' width have to do with explicit split points?  
A suggested column's width is no reason for adding an unwanted/unexpected split
point.

Just accept the split(s) as specified, without adding another hidden split at
the end of what is presented as examples...

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #10 from ady  ---
While I can reproduce it when copying from the already-imported first column
(by selecting the first column _only_), it can be easily avoided when initially
importing the data (using either "Separated by" or "fixed width"), precisely
because there is more data on the right-side (duplicated in this case).

OTOH, I think users shouldn't expect to look-up 7900 rows (or whatever) just in
order to find the data format before suggesting the columns' width. That would
probably be inefficient.

[OT] BTW, 1.5M rows are not supported anyway.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #9 from Pierre Fortin  ---
1. my first attachment was the wrong file -- no idea how that happened since
that was not a file I had open.

2. sample file 2 is correct. 

3. I know how to use fixed width; the issue is that fixed width does not work
correctly if there are >= N (unknown to me) rows that *appear* to imply that
ALL data is like these initial N samples.

4. Try it yourself: 

  a) to see the problem as reported:
   - select cell D1
   - Ctrl+down-arrow
   - do you see digits in column D?  That is the issue!

  b) to reproduce the problem:
   - select column A
   - Copy it
   - right-click on E1 or any column to the right; NOT on column A-D as they
already contain data at the bottom.
   - Paste Special > Unformatted Text
 - Fixed width
   - set the divider line between characters 2 and 3
   - Note that the sample display is only 6 characters wide
   - OK
   - check the bottom rows...  is your pasted data in 2 or 3 columns?  Unless
this bug has been fixed, you should have "AH" in column E, 4 digits in column
F, and 1 digit in column G.  With a huge sample of ~1.5M rows, you would see
this in the bottom rows:  AH, 1234, 567 vs the expected AH, 1234567

HTH

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #8 from m.a.riosv  ---
Created attachment 187408
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187408=edit
Screenshot how to do it.

Seems your screenshot, it's for the first one.

But what you copy to do the paste special.

Maybe the fixed with gets what you need to do.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-19 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #7 from m.a.riosv  ---
Created attachment 187407
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187407=edit
Screenshot opening sample2

I think it is a different file from your screenshot.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

QA Administrators  changed:

   What|Removed |Added

 Ever confirmed|1   |0
 Status|NEEDINFO|UNCONFIRMED

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #6 from QA Administrators  ---
[Automated Action] NeedInfo-To-Unconfirmed

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #5 from Pierre Fortin  ---
Created attachment 187388
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187388=edit
sample file 2

I have no idea how that file got selected...  Here's another try at the file I
thought was attached...

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #4 from m.a.riosv  ---
Created attachment 187386
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187386=edit
Screenshot opening csv sample file

I think sample file, it's not the same you have reported.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #3 from Pierre Fortin  ---
Forgot to mention:  the original file was 14,361,164 rows with 100 unique
2-character prefixes; so this issue caused a bit of grief... :)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

--- Comment #2 from Pierre Fortin  ---
Created attachment 187376
  --> https://bugs.documentfoundation.org/attachment.cgi?id=187376=edit
sample file

Column A is the data I copied, then pasted into B and C; yet 4 digits went into
column C and the rest into column D.   I can provide sample files with up to
1.5M rows if necessary.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Libreoffice-bugs] [Bug 155384] Fixed width paste does not sample enough input data

2023-05-18 Thread bugzilla-daemon
https://bugs.documentfoundation.org/show_bug.cgi?id=155384

m.a.riosv  changed:

   What|Removed |Added

 CC||miguelangelrv@libreoffice.o
   ||rg
 Status|UNCONFIRMED |NEEDINFO
 Ever confirmed|0   |1

--- Comment #1 from m.a.riosv  ---
Please attach a sample file, so someone can test it.

-- 
You are receiving this mail because:
You are the assignee for the bug.