subject:"RE\: Accessing neighboring elements in an RDD"

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Victor Tso-Guillen

Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664


On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

 Hi all,

 Assume I have read the lines of a text file into an RDD:

 textFile = sc.textFile(SomeArticle.txt)

 Also assume that the sentence breaks in SomeArticle.txt were done by
 machine and have some errors, such as the break at Fig. in the sample text
 below.

 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...

 What I want is an RDD with:

 N   ... as shown in Fig. 1.
 N+1 The figure shows...

 Is there some way a filter() can look at neighboring elements in an RDD?
 That way I could look, in parallel, at neighboring elements in an RDD and
 come up with a new RDD that may have a different number of elements.  Or do
 I just have to sequentially iterate through the RDD?

 Thanks,
 Ron

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Chris Gore

There is support for Spark in ElasticSearch’s Hadoop integration package.

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html

Maybe you could split and insert all of your documents from Spark and then
query for “MoreLikeThis” on the ElasticSearch index. I haven’t tried it, but
maybe someone else has more experience using Spark with ElasticSearch. At some
point, maybe there could be an information retrieval package for Spark with
locality sensitive hashing and other similar functions.

On Sep 3, 2014, at 10:40 AM, Victor Tso-Guillen v...@paxata.com wrote:

Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664

On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
Hi all,

Assume I have read the lines of a text file into an RDD:

textFile = sc.textFile(SomeArticle.txt)

Also assume that the sentence breaks in SomeArticle.txt were done by machine
and have some errors, such as the break at Fig. in the sample text below.

Index Text
N...as shown in Fig.
N+1 1.
N+2 The figure shows...

What I want is an RDD with:

N ... as shown in Fig. 1.
N+1 The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That
way I could look, in parallel, at neighboring elements in an RDD and come up
with a new RDD that may have a different number of elements. Or do I just
have to sequentially iterate through the RDD?

Thanks,
Ron

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)

Thanks for the pointer to that thread. Looks like there is some demand for this 
capability, but not a lot yet. Also doesn't look like there is an easy answer 
right now.

Thanks,
Ron


From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Wednesday, September 03, 2014 10:40 AM
To: Daniel, Ronald (ELS-SDG)
Cc: user@spark.apache.org
Subject: Re: Accessing neighboring elements in an RDD

Interestingly, there was an almost identical question posed on Aug 22 by 
cjwang. Here's the link to the archive: 
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664

On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.commailto:r.dan...@elsevier.com wrote:
Hi all,

Assume I have read the lines of a text file into an RDD:

textFile = sc.textFile(SomeArticle.txt)

Also assume that the sentence breaks in SomeArticle.txt were done by machine 
and have some errors, such as the break at Fig. in the sample text below.

Index   Text
N...as shown in Fig.
N+1 1.
N+2 The figure shows...

What I want is an RDD with:

N   ... as shown in Fig. 1.
N+1 The figure shows...

Is there some way a filter() can look at neighboring elements in an RDD? That 
way I could look, in parallel, at neighboring elements in an RDD and come up 
with a new RDD that may have a different number of elements.  Or do I just have 
to sequentially iterate through the RDD?

Thanks,
Ron

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng

There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/AreaUnderCurve.scala#L45

With it, you can process neighbor lines by

rdd.sliding(3).map { case Seq(l0, l1, l2) = ... }

-Xiangrui

On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
r.dan...@elsevier.com wrote:
 Thanks for the pointer to that thread. Looks like there is some demand for
 this capability, but not a lot yet. Also doesn't look like there is an easy
 answer right now.



 Thanks,

 Ron





 From: Victor Tso-Guillen [mailto:v...@paxata.com]
 Sent: Wednesday, September 03, 2014 10:40 AM
 To: Daniel, Ronald (ELS-SDG)
 Cc: user@spark.apache.org
 Subject: Re: Accessing neighboring elements in an RDD



 Interestingly, there was an almost identical question posed on Aug 22 by
 cjwang. Here's the link to the archive:
 http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664



 On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:

 Hi all,

 Assume I have read the lines of a text file into an RDD:

 textFile = sc.textFile(SomeArticle.txt)

 Also assume that the sentence breaks in SomeArticle.txt were done by machine
 and have some errors, such as the break at Fig. in the sample text below.

 Index   Text
 N...as shown in Fig.
 N+1 1.
 N+2 The figure shows...

 What I want is an RDD with:

 N   ... as shown in Fig. 1.
 N+1 The figure shows...

 Is there some way a filter() can look at neighboring elements in an RDD?
 That way I could look, in parallel, at neighboring elements in an RDD and
 come up with a new RDD that may have a different number of elements.  Or do
 I just have to sequentially iterate through the RDD?

 Thanks,
 Ron



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Accessing neighboring elements in an RDD

2014-09-03 Thread Daniel, Ronald (ELS-SDG)

Thanks Xiangrui, that looks very helpful.

Best regards,
Ron

 -Original Message-
 From: Xiangrui Meng [mailto:men...@gmail.com]
 Sent: Wednesday, September 03, 2014 1:19 PM
 To: Daniel, Ronald (ELS-SDG)
 Cc: Victor Tso-Guillen; user@spark.apache.org
 Subject: Re: Accessing neighboring elements in an RDD

 There is a sliding method implemented in MLlib
 (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/rdd/SlidingRDD.scala),
 which is used in computing Area Under Curve:
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/a
 pache/spark/mllib/evaluation/AreaUnderCurve.scala#L45

 With it, you can process neighbor lines by

 rdd.sliding(3).map { case Seq(l0, l1, l2) = ... }

 -Xiangrui

 On Wed, Sep 3, 2014 at 11:30 AM, Daniel, Ronald (ELS-SDG)
 r.dan...@elsevier.com wrote:
  Thanks for the pointer to that thread. Looks like there is some demand
  for this capability, but not a lot yet. Also doesn't look like there
  is an easy answer right now.

  Thanks,

  Ron

  From: Victor Tso-Guillen [mailto:v...@paxata.com]
  Sent: Wednesday, September 03, 2014 10:40 AM
  To: Daniel, Ronald (ELS-SDG)
  Cc: user@spark.apache.org
  Subject: Re: Accessing neighboring elements in an RDD

  Interestingly, there was an almost identical question posed on Aug 22
  by cjwang. Here's the link to the archive:
  http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-a
  nd-next-element-in-a-sorted-RDD-td12621.html#a12664

  On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
  r.dan...@elsevier.com wrote:

  Hi all,

  Assume I have read the lines of a text file into an RDD:

  textFile = sc.textFile(SomeArticle.txt)

  Also assume that the sentence breaks in SomeArticle.txt were done by
  machine and have some errors, such as the break at Fig. in the sample text
 below.

  Index   Text
  N...as shown in Fig.
  N+1 1.
  N+2 The figure shows...

  What I want is an RDD with:

  N   ... as shown in Fig. 1.
  N+1 The figure shows...

  Is there some way a filter() can look at neighboring elements in an RDD?
  That way I could look, in parallel, at neighboring elements in an RDD
  and come up with a new RDD that may have a different number of
  elements.  Or do I just have to sequentially iterate through the RDD?

  Thanks,
  Ron

Re: Accessing neighboring elements in an RDD

Re: Accessing neighboring elements in an RDD

RE: Accessing neighboring elements in an RDD

Re: Accessing neighboring elements in an RDD

RE: Accessing neighboring elements in an RDD

5 matches

Site Navigation

Mail list logo

Footer information