Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Victor Tso-Guillen
Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664


On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

 Hi all,

 Assume I have read the lines of a text file into an RDD:

 textFile = sc.textFile(SomeArticle.txt)

 Also assume that the sentence breaks in SomeArticle.txt were done by
 machine and have some errors, such as the break at Fig. in the sample text
 below.

 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...

 What I want is an RDD with:

 N   ... as shown in Fig. 1.
 N+1 The figure shows...

 Is there some way a filter() can look at neighboring elements in an RDD?
 That way I could look, in parallel, at neighboring elements in an RDD and
 come up with a new RDD that may have a different number of elements.  Or do
 I just have to sequentially iterate through the RDD?

 Thanks,
 Ron





Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Chris Gore
There is support for Spark in ElasticSearch’s Hadoop integration package.

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

Maybe you could split and insert all of your documents from Spark and then 
query for “MoreLikeThis” on the ElasticSearch index.  I haven’t tried it, but 
maybe someone else has more experience using Spark with ElasticSearch.  At some 
point, maybe there could be an information retrieval package for Spark with 
locality sensitive hashing and other similar functions.

 
On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen v...@paxata.com wrote:

 Interestingly, there was an almost identical question posed on Aug 22 by 
 cjwang. Here's the link to the archive: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
 
 
 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
 r.dan...@elsevier.com wrote:
 Hi all,
 
 Assume I have read the lines of a text file into an RDD:
 
 textFile = sc.textFile(SomeArticle.txt)
 
 Also assume that the sentence breaks in SomeArticle.txt were done by machine 
 and have some errors, such as the break at Fig. in the sample text below.
 
 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...
 
 What I want is an RDD with:
 
 N   ... as shown in Fig. 1.
 N+1 The figure shows...
 
 Is there some way a filter() can look at neighboring elements in an RDD? That 
 way I could look, in parallel, at neighboring elements in an RDD and come up 
 with a new RDD that may have a different number of elements.  Or do I just 
 have to sequentially iterate through the RDD?
 
 Thanks,
 Ron
 
 
 



RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks for the pointer to that thread. Looks like there is some demand for this 
capability, but not a lot yet. Also doesn't look like there is an easy answer 
right now.

Thanks,
Ron


From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Wednesday, September 03, 2014 10:40 AM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Accessing neighboring elements in an RDD

Interestingly, there was an almost identical question posed on Aug 22 by 
cjwang. Here's the link to the archive: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664

On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Hi all,

Assume I have read the lines of a text file into an RDD:

textFile = sc.textFile(SomeArticle.txt)

Also assume that the sentence breaks in SomeArticle.txt were done by machine 
and have some errors, such as the break at Fig. in the sample text below.

Index   Text
N...as shown in Fig.
N+1 1.
N+2 The figure shows...

What I want is an RDD with:

N   ... as shown in Fig. 1.
N+1 The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That 
way I could look, in parallel, at neighboring elements in an RDD and come up 
with a new RDD that may have a different number of elements.  Or do I just have 
to sequentially iterate through the RDD?

Thanks,
Ron




Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng
There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/AreaUnderCurve.scala#L45

With it, you can process neighbor lines by

rdd.sliding(3).map { case Seq(l0, l1, l2) = ... }

-Xiangrui

On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
 Thanks for the pointer to that thread. Looks like there is some demand for
 this capability, but not a lot yet. Also doesn't look like there is an easy
 answer right now.



 Thanks,

 Ron





 From: Victor Tso-Guillen [mailto:v...@paxata.com]
 Sent: Wednesday, September 03, 2014 10:40 AM
 To: Daniel, Ronald (ELS-SDG)
 Cc: user@spark.apache.org
 Subject: Re: Accessing neighboring elements in an RDD



 Interestingly, there was an almost identical question posed on Aug 22 by
 cjwang. Here's the link to the archive:
 http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664



 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:

 Hi all,

 Assume I have read the lines of a text file into an RDD:

 textFile = sc.textFile(SomeArticle.txt)

 Also assume that the sentence breaks in SomeArticle.txt were done by machine
 and have some errors, such as the break at Fig. in the sample text below.

 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...

 What I want is an RDD with:

 N   ... as shown in Fig. 1.
 N+1 The figure shows...

 Is there some way a filter() can look at neighboring elements in an RDD?
 That way I could look, in parallel, at neighboring elements in an RDD and
 come up with a new RDD that may have a different number of elements.  Or do
 I just have to sequentially iterate through the RDD?

 Thanks,
 Ron



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)
Thanks Xiangrui, that looks very helpful.

Best regards,
Ron


 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Wednesday, September 03, 2014 1:19 PM
 To: Daniel, Ronald (ELS-SDG)
 Cc: Victor Tso-Guillen; user@spark.apache.org
 Subject: Re: Accessing neighboring elements in an RDD
 
 There is a sliding method implemented in MLlib
 (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/rdd/SlidingRDD.scala),
 which is used in computing Area Under Curve:
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/evaluation/AreaUnderCurve.scala#L45
 
 With it, you can process neighbor lines by
 
 rdd.sliding(3).map { case Seq(l0, l1, l2) = ... }
 
 -Xiangrui
 
 On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:
  Thanks for the pointer to that thread. Looks like there is some demand
  for this capability, but not a lot yet. Also doesn't look like there
  is an easy answer right now.
 
 
 
  Thanks,
 
  Ron
 
 
 
 
 
  From: Victor Tso-Guillen [mailto:v...@paxata.com]
  Sent: Wednesday, September 03, 2014 10:40 AM
  To: Daniel, Ronald (ELS-SDG)
  Cc: user@spark.apache.org
  Subject: Re: Accessing neighboring elements in an RDD
 
 
 
  Interestingly, there was an almost identical question posed on Aug 22
  by cjwang. Here's the link to the archive:
  http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-a
  nd-next-element-in-a-sorted-RDD-td12621.html#a12664
 
 
 
  On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
  r.dan...@elsevier.com wrote:
 
  Hi all,
 
  Assume I have read the lines of a text file into an RDD:
 
  textFile = sc.textFile(SomeArticle.txt)
 
  Also assume that the sentence breaks in SomeArticle.txt were done by
  machine and have some errors, such as the break at Fig. in the sample text
 below.
 
  Index   Text
  N...as shown in Fig.
  N+1 1.
  N+2 The figure shows...
 
  What I want is an RDD with:
 
  N   ... as shown in Fig. 1.
  N+1 The figure shows...
 
  Is there some way a filter() can look at neighboring elements in an RDD?
  That way I could look, in parallel, at neighboring elements in an RDD
  and come up with a new RDD that may have a different number of
  elements.  Or do I just have to sequentially iterate through the RDD?
 
  Thanks,
  Ron