Re: [Rd] readLines() segfaults on large file & question on how to work around

2017-09-02 Thread Suzen, Mehmet
Jennifer, Why do you try Sparkr?

https://spark.apache.org/docs/1.6.1/api/R/read.json.html

On 2 September 2017 at 23:15, Jennifer Lyon  wrote:
> Thank you for your suggestion. Unfortunately, while R doesn't segfault
> calling readr::read_file() on the test file I described, I get the error
> message:
>
> Error in read_file_(ds, locale) : negative length vectors are not allowed
>
> Jen
>
> On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn  wrote:
>
>> As s work-around I  suggest readr::read_file.
>>
>> --Ista
>>
>>
>> On Sep 2, 2017 2:58 PM, "Jennifer Lyon"  wrote:
>>
>>> Hi:
>>>
>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>> jsonlite:fromJSON() to extract data from a JSON file.
>>>
>>> When I try and read in this file using readLines() R segfaults.
>>>
>>> I believe the two salient issues with this file are
>>> 1). Its size
>>> 2). It is a single line (no line breaks)
>>>
>>> I can reproduce this issue as follows
>>> #Generate a big file with no line breaks
>>> # In R
>>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>>
>>> # in unix shell
>>> cp alpha.txt file.txt
>>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>>> file.txt; done
>>>
>>> This generates a 2.3GB file with no line breaks
>>>
>>> in R:
>>> > moo <- readLines("file.txt")
>>>
>>>  *** caught segfault ***
>>> address 0x7cff, cause 'memory not mapped'
>>>
>>> Traceback:
>>>  1: readLines("file.txt")
>>>
>>> Possible actions:
>>> 1: abort (with core dump, if enabled)
>>> 2: normal R exit
>>> 3: exit R without saving workspace
>>> 4: exit R saving workspace
>>> Selection: 3
>>>
>>> I conclude:
>>>  I am potentially running up against a limit in R, which should give a
>>> reasonable error, but currently just segfaults.
>>>
>>> My question:
>>> Most of the content of the JSON is an approximately 100K x 6K JSON
>>> equivalent of a dataframe, and I know R can handle much bigger than this
>>> size. I am expecting these JSON files to get even larger. My R code lives
>>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>>> no control over the data format. I can imagine trying to incrementally
>>> parse the JSON so I don't bump up against the limit, but I am eager for
>>> suggestions of simpler solutions.
>>>
>>> Also, I apologize for the timing of this bug report, as I know folks are
>>> working to get out the next release of R, but like so many things I have
>>> no
>>> control over when bugs leap up.
>>>
>>> Thanks.
>>>
>>> Jen
>>>
>>> > sessionInfo()
>>> R version 3.4.1 (2017-06-30)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 14.04.5 LTS
>>>
>>> Matrix products: default
>>> BLAS: R-3.4.1/lib/libRblas.so
>>> LAPACK:R-3.4.1/lib/libRlapack.so
>>>
>>> locale:
>>>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics  grDevices utils datasets  methods   base
>>>
>>> loaded via a namespace (and not attached):
>>> [1] compiler_3.4.1
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] readLines() segfaults on large file & question on how to work around

2017-09-02 Thread Iñaki Úcar
2017-09-02 20:58 GMT+02:00 Jennifer Lyon :
> Hi:
>
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.
>
> When I try and read in this file using readLines() R segfaults.
>
> I believe the two salient issues with this file are
> 1). Its size
> 2). It is a single line (no line breaks)

As a workaround you can pipe something like "sed s/,/,\\n/g" before
your R script to insert line breaks.

Iñaki

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] I have corrected a dead link in the treering documentation

2017-09-02 Thread Martin Maechler
> Thomas Levine <_...@thomaslevine.com>
> on Fri, 1 Sep 2017 13:23:47 + writes:

> Martin Maechler writes:
>> There may be one small problem: IIUC, the wayback machine
>> is a +- private endeavor and really great and phantastic
>> but it does need (US? tax deductible) donations,
>> https://archive.org/donate/, to continue thriving.  This
>> makes me hesitate a bit to link to it within the "base R"
>> documentation.  But that may be wrong -- and I should
>> really use it to *help* the project ?

> I agree that the Wayback Machine is a private
> endeavor. After reviewing other base library
> documentation, I have concluded that it would regardless
> be consistent with current practice to reference it in the
> base documentation.

> I share your concern regarding the support of other
> institutions, and I have found some references that are
> more problematic to me than the one of present interest. I
> would thus support an initiative to consider the social
> implications of the different references and to adjust the
> references accordingly.

> Below I start by making a distinction between two types of
> references that I think should be treated differently in
> terms of your concern.  Next, I assess whether there is a
> precedent for inclusion of references to private
> publishers, as in the present patch; I include that there
> is such a president. Then I present my opinion regarding
> the present patch. Finally, I present some other
> considerations that I find relevant to the discussion.

> Distinguishing between two link types
> -
> For discussion of this issue, I think it is helpful to
> distinguish between references to sources and references
> to other materials.

> In the case of references of to sources, there is little
> choice but to reference the publisher, even though the
> overwhelming majority of referenced publishers are private
> companies that impose restrictive licenses on their
> journals and books and cannot be reasonably trusted to
> maintain access to the materials nor availability of
> webpages.

> With other references, it is possible to replace the
> reference with a different document that contains similar
> information.

> For example, if a function implements an method based on a
> particular journal article, that article's citation needs
> to stay, even if the journal is published by a private
> institution. On the other hand, if the reference just
> provides context or suggestions related to usage, then the
> reference is provided just as information and can be
> replaced.

> Precedent for inclusion of private non-source materials
> ---
> The dead link of interest is only informational, not a
> citation of a source, and so it could be replaced. So I
> assessed whether it would match current practice to
> include it, and I concluded that there is substantial
> precedent for inclusion of private reference materials
> other than strict sources. Not having access to a good
> library at the moment, I have limited my research on this
> matter to website references.

> In SVN revision 73164, \url calls are distributed among
> 148 files, from 1 call to 13 calls per file, with mean of
> 1.75 and median of 1.

>   grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq
> -c | sort -n

> Total number of library documentation files is 1419.

>   find src/library/ -name \*.Rd | wc -l

> I randomly selected 20 matching files for further study.

>   % grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq
> -c | sort -R | head -n 20 | tee /tmp/rd 2
> src/library/grDevices/man/pdf.Rd 1
> src/library/base/man/taskCallbackNames.Rd 1
> src/library/stats/man/shapiro.test.Rd 1
> src/library/tcltk/man/TkWidgets.Rd 2
> src/library/graphics/man/assocplot.Rd 1
> src/library/base/man/sprintf.Rd 6
> src/library/base/man/regex.Rd 3
> src/library/datasets/man/HairEyeColor.Rd 1
> src/library/stats/man/optimize.Rd 1
> src/library/datasets/man/UKDriverDeaths.Rd 1
> src/library/utils/man/object.size.Rd 1
> src/library/utils/man/unzip.Rd 1
> src/library/base/man/dcf.Rd 1
> src/library/base/man/DateTimeClasses.Rd 3
> src/library/stats/man/GammaDist.Rd 2
> src/library/utils/man/maintainer.Rd 2
> src/library/base/man/libcurlVersion.Rd 2
> src/library/base/man/eigen.Rd 2
> src/library/base/man/chol2inv.Rd 1
> src/library/tools/man/update_pkg_po.Rd

>> From these 20 I composed a table with statistical unit of
>> \url call and
> with variables filename, url, type of reference, and type
> of publisher.  The following commands were helpful.

>   se

Re: [Rd] readLines() segfaults on large file & question on how to work around

2017-09-02 Thread Jennifer Lyon
Thank you for your suggestion. Unfortunately, while R doesn't segfault
calling readr::read_file() on the test file I described, I get the error
message:

Error in read_file_(ds, locale) : negative length vectors are not allowed

Jen

On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn  wrote:

> As s work-around I  suggest readr::read_file.
>
> --Ista
>
>
> On Sep 2, 2017 2:58 PM, "Jennifer Lyon"  wrote:
>
>> Hi:
>>
>> I have a 2.1GB JSON file. Typically I use readLines() and
>> jsonlite:fromJSON() to extract data from a JSON file.
>>
>> When I try and read in this file using readLines() R segfaults.
>>
>> I believe the two salient issues with this file are
>> 1). Its size
>> 2). It is a single line (no line breaks)
>>
>> I can reproduce this issue as follows
>> #Generate a big file with no line breaks
>> # In R
>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>>
>> # in unix shell
>> cp alpha.txt file.txt
>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
>> file.txt; done
>>
>> This generates a 2.3GB file with no line breaks
>>
>> in R:
>> > moo <- readLines("file.txt")
>>
>>  *** caught segfault ***
>> address 0x7cff, cause 'memory not mapped'
>>
>> Traceback:
>>  1: readLines("file.txt")
>>
>> Possible actions:
>> 1: abort (with core dump, if enabled)
>> 2: normal R exit
>> 3: exit R without saving workspace
>> 4: exit R saving workspace
>> Selection: 3
>>
>> I conclude:
>>  I am potentially running up against a limit in R, which should give a
>> reasonable error, but currently just segfaults.
>>
>> My question:
>> Most of the content of the JSON is an approximately 100K x 6K JSON
>> equivalent of a dataframe, and I know R can handle much bigger than this
>> size. I am expecting these JSON files to get even larger. My R code lives
>> in a bigger system, and the JSON comes in via stdin, so I have absolutely
>> no control over the data format. I can imagine trying to incrementally
>> parse the JSON so I don't bump up against the limit, but I am eager for
>> suggestions of simpler solutions.
>>
>> Also, I apologize for the timing of this bug report, as I know folks are
>> working to get out the next release of R, but like so many things I have
>> no
>> control over when bugs leap up.
>>
>> Thanks.
>>
>> Jen
>>
>> > sessionInfo()
>> R version 3.4.1 (2017-06-30)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.5 LTS
>>
>> Matrix products: default
>> BLAS: R-3.4.1/lib/libRblas.so
>> LAPACK:R-3.4.1/lib/libRlapack.so
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics  grDevices utils datasets  methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_3.4.1
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] readLines() segfaults on large file & question on how to work around

2017-09-02 Thread Ista Zahn
As s work-around I  suggest readr::read_file.

--Ista


On Sep 2, 2017 2:58 PM, "Jennifer Lyon"  wrote:

> Hi:
>
> I have a 2.1GB JSON file. Typically I use readLines() and
> jsonlite:fromJSON() to extract data from a JSON file.
>
> When I try and read in this file using readLines() R segfaults.
>
> I believe the two salient issues with this file are
> 1). Its size
> 2). It is a single line (no line breaks)
>
> I can reproduce this issue as follows
> #Generate a big file with no line breaks
> # In R
> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")
>
> # in unix shell
> cp alpha.txt file.txt
> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
> file.txt; done
>
> This generates a 2.3GB file with no line breaks
>
> in R:
> > moo <- readLines("file.txt")
>
>  *** caught segfault ***
> address 0x7cff, cause 'memory not mapped'
>
> Traceback:
>  1: readLines("file.txt")
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection: 3
>
> I conclude:
>  I am potentially running up against a limit in R, which should give a
> reasonable error, but currently just segfaults.
>
> My question:
> Most of the content of the JSON is an approximately 100K x 6K JSON
> equivalent of a dataframe, and I know R can handle much bigger than this
> size. I am expecting these JSON files to get even larger. My R code lives
> in a bigger system, and the JSON comes in via stdin, so I have absolutely
> no control over the data format. I can imagine trying to incrementally
> parse the JSON so I don't bump up against the limit, but I am eager for
> suggestions of simpler solutions.
>
> Also, I apologize for the timing of this bug report, as I know folks are
> working to get out the next release of R, but like so many things I have no
> control over when bugs leap up.
>
> Thanks.
>
> Jen
>
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.5 LTS
>
> Matrix products: default
> BLAS: R-3.4.1/lib/libRblas.so
> LAPACK:R-3.4.1/lib/libRlapack.so
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
>  [9] LC_ADDRESS=C   LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] readLines() segfaults on large file & question on how to work around

2017-09-02 Thread Jennifer Lyon
Hi:

I have a 2.1GB JSON file. Typically I use readLines() and
jsonlite:fromJSON() to extract data from a JSON file.

When I try and read in this file using readLines() R segfaults.

I believe the two salient issues with this file are
1). Its size
2). It is a single line (no line breaks)

I can reproduce this issue as follows
#Generate a big file with no line breaks
# In R
> writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="")

# in unix shell
cp alpha.txt file.txt
for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt
file.txt; done

This generates a 2.3GB file with no line breaks

in R:
> moo <- readLines("file.txt")

 *** caught segfault ***
address 0x7cff, cause 'memory not mapped'

Traceback:
 1: readLines("file.txt")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3

I conclude:
 I am potentially running up against a limit in R, which should give a
reasonable error, but currently just segfaults.

My question:
Most of the content of the JSON is an approximately 100K x 6K JSON
equivalent of a dataframe, and I know R can handle much bigger than this
size. I am expecting these JSON files to get even larger. My R code lives
in a bigger system, and the JSON comes in via stdin, so I have absolutely
no control over the data format. I can imagine trying to incrementally
parse the JSON so I don't bump up against the limit, but I am eager for
suggestions of simpler solutions.

Also, I apologize for the timing of this bug report, as I know folks are
working to get out the next release of R, but like so many things I have no
control over when bugs leap up.

Thanks.

Jen

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: R-3.4.1/lib/libRblas.so
LAPACK:R-3.4.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.1

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Please avoid direct use of NAMED and SET_NAMED macros

2017-09-02 Thread luke-tierney

On Sat, 2 Sep 2017, Radford Neal wrote:


To allow for future changes in the way the need for duplication is
detected in R internal C code, package C code should avoid direct
use of NAMED,and SET_NAMED, or assumptions on the maximal value
of NAMED. Use the macros MAYBE_REFERENCED, MAYBE_SHARED, and
MARK_NOT_MUTABLE instead. These currently correspond to

MAYBE_REFERENCED(x):   NAMED(x) > 0
MAYBE_SHARED(x):   NAMED(x) > 1
MARK_NOT_MUTABLE(x):   SET_NAMED(c, NAMEDMAX)

Best,

luke



Checking https://cran.r-project.org/doc/manuals/r-release/R-exts.html
shows that currently there is no mention of these macros in the
documentation for package writers.  Of course, the explanation of
NAMED there also does not adequtely describe what it is supposed to
mean, which may explain why it's often not used correctly.


As of yesterday they are mentioned in the R-devel version of this
manual, which will make it to the web in due course.


Before embarking on a major change to the C API, I'd suggest that you
produce clear and complete documention on the new scheme.

   Radford Neal



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Please avoid direct use of NAMED and SET_NAMED macros

2017-09-02 Thread Radford Neal
> To allow for future changes in the way the need for duplication is
> detected in R internal C code, package C code should avoid direct
> use of NAMED,and SET_NAMED, or assumptions on the maximal value
> of NAMED. Use the macros MAYBE_REFERENCED, MAYBE_SHARED, and
> MARK_NOT_MUTABLE instead. These currently correspond to
> 
> MAYBE_REFERENCED(x):   NAMED(x) > 0
> MAYBE_SHARED(x):   NAMED(x) > 1
> MARK_NOT_MUTABLE(x):   SET_NAMED(c, NAMEDMAX)
> 
> Best,
> 
> luke


Checking https://cran.r-project.org/doc/manuals/r-release/R-exts.html
shows that currently there is no mention of these macros in the
documentation for package writers.  Of course, the explanation of
NAMED there also does not adequtely describe what it is supposed to
mean, which may explain why it's often not used correctly.

Before embarking on a major change to the C API, I'd suggest that you
produce clear and complete documention on the new scheme.

Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Missing y label

2017-09-02 Thread Dirk Eddelbuettel

On 1 September 2017 at 15:50, Therneau, Terry M., Ph.D. wrote:
| The system admins here ...

I suggest you get these local admins to help you.

These CRAN repos for Ubuntu are used by thousands of people every day, and
they "just work", for both the recent releases and the most recent LTS.

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Wayback and related questions (was: RE: I have corrected a dead link ...)

2017-09-02 Thread Thomas Levine
> If the R project cannot use or reference any site that uses non-open
> code, including minified javascript - which appears to be the
> principle issue for GitHub - I suspect that you will be obliged to
> discontinue links to almost every journal, university, charity,
> government and research establishment site currently in existence as
> soon as GNU get round to assessing them.  I personally have great
> difficulty seeing that as sensible. 

The policy that you suggest would indeed be completely stupid.
Fortunately, a reasonable policy that vaguely matches the current
practices is likely to affect hardly any documentation files.

I don't have a strong opinion as to whether publishing characteristics
of references should be a consideration during the composition of R
documentation files, and I trust the R developers to decide well.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel