Re: Accessing neighboring elements in an RDD
Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron
Re: Accessing neighboring elements in an RDD
There is support for Spark in ElasticSearch’s Hadoop integration package. http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html Maybe you could split and insert all of your documents from Spark and then query for “MoreLikeThis” on the ElasticSearch index. I haven’t tried it, but maybe someone else has more experience using Spark with ElasticSearch. At some point, maybe there could be an information retrieval package for Spark with locality sensitive hashing and other similar functions. On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen v...@paxata.com wrote: Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron
RE: Accessing neighboring elements in an RDD
Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now. Thanks, Ron From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, September 03, 2014 10:40 AM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Accessing neighboring elements in an RDD Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron
Re: Accessing neighboring elements in an RDD
There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/AreaUnderCurve.scala#L45 With it, you can process neighbor lines by rdd.sliding(3).map { case Seq(l0, l1, l2) = ... } -Xiangrui On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now. Thanks, Ron From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, September 03, 2014 10:40 AM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Accessing neighboring elements in an RDD Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Accessing neighboring elements in an RDD
Thanks Xiangrui, that looks very helpful. Best regards, Ron -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Wednesday, September 03, 2014 1:19 PM To: Daniel, Ronald (ELS-SDG) Cc: Victor Tso-Guillen; user@spark.apache.org Subject: Re: Accessing neighboring elements in an RDD There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a pache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a pache/spark/mllib/evaluation/AreaUnderCurve.scala#L45 With it, you can process neighbor lines by rdd.sliding(3).map { case Seq(l0, l1, l2) = ... } -Xiangrui On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Thanks for the pointer to that thread. Looks like there is some demand for this capability, but not a lot yet. Also doesn't look like there is an easy answer right now. Thanks, Ron From: Victor Tso-Guillen [mailto:v...@paxata.com] Sent: Wednesday, September 03, 2014 10:40 AM To: Daniel, Ronald (ELS-SDG) Cc: user@spark.apache.org Subject: Re: Accessing neighboring elements in an RDD Interestingly, there was an almost identical question posed on Aug 22 by cjwang. Here's the link to the archive: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-a nd-next-element-in-a-sorted-RDD-td12621.html#a12664 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, Assume I have read the lines of a text file into an RDD: textFile = sc.textFile(SomeArticle.txt) Also assume that the sentence breaks in SomeArticle.txt were done by machine and have some errors, such as the break at Fig. in the sample text below. Index Text N...as shown in Fig. N+1 1. N+2 The figure shows... What I want is an RDD with: N ... as shown in Fig. 1. N+1 The figure shows... Is there some way a filter() can look at neighboring elements in an RDD? That way I could look, in parallel, at neighboring elements in an RDD and come up with a new RDD that may have a different number of elements. Or do I just have to sequentially iterate through the RDD? Thanks, Ron